85470

Seaborn KDEPlot - not enough variation in data?

<h3>Question</h3>

I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:

y = df.sort_values(by='col1', ascending=False)['col1'].values[:600] sb.kdeplot(y)

But including 451 of the minimum values gives a very different output:

y = df.sort_values(by='col1', ascending=False)['col1'].values[:601] sb.kdeplot(y)

Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.


<h3>Answer1:</h3>

The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.

The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3 could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.

Here is some sample code to show the difference between bw='scott' and bw=0.3. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.

<pre class="lang-py prettyprint-override">import matplotlib.pyplot as plt import numpy as np import seaborn as sns; sns.set() fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3}) for i, bw in enumerate(['scott', 0.3]): for j, num_same in enumerate([400, 450, 500]): y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)]) sns.kdeplot(y, bw=bw, ax=axs[i, j]) axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}') plt.show()

The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.

PS: As mentioned by @mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde is used (not scipy.stats.gaussian_kde). There the default is <em>"scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34)</em>. The min() clarifies the abrupt change in behavior. IQR is the "interquartile range", <em>the difference between the 75th and 25th percentiles</em>.


<h3>Answer2:</h3>

If the sample has repeated values, this implies that the underlying distribution is not continuous. In the data that you show to illustrate the issue, we can see a Dirac distribution on the left. The kernel smoothing might be applied for such data, but with care. Indeed, to approximate such data, we might use a kernel smoothing where the bandwidth associated to the Dirac is zero. However, in most KDE methods, there is only one single bandwidth for all kernel atoms. Moreover, the various rules used to compute the bandwidth are based on some estimation of the rugosity of the second derivative of the PDF of the distribution. This cannot be applied to a discontinuous distribution.

We can, however, try to separate the sample into two sub-samples:

<ul><li>the sub-sample(s) with replications,</li> <li>the sub-sample with unique realizations.</li> </ul>

(This idea has already been mentionned by johanc).

Below is an attempt to perform this classification. The np.unique method is used to count the occurences of the replicated realizations. The replicated values are associated with Diracs and the weight in the mixture is estimated from the fraction of these replicated values in the sample. The remaining realizations, uniques, are then used to estimate the continuous distribution with KDE.

The following function will be useful in order to overcome a limitation with the current implementation of the draw method of Mixtures with OpenTURNS.

def DrawMixtureWithDiracs(distribution): """Draw a distributions which has Diracs. https://github.com/openturns/openturns/issues/1489""" graph = distribution.drawPDF() graph.setLegends(["Mixture"]) for atom in distribution.getDistributionCollection(): if atom.getName() == "Dirac": curve = atom.drawPDF() curve.setLegends(["Dirac"]) graph.add(curve) return graph

The following script creates a use-case with a Mixture containing a Dirac and a gaussian distributions.

import openturns as ot import numpy as np distribution = ot.Mixture([ot.Dirac(-3.0), ot.Normal()], [0.5, 0.5]) DrawMixtureWithDiracs(distribution)

This is the result.

Then we create a sample.

sample = distribution.getSample(100)

This is where your problem begins. We count the number of occurences of each realizations.

array = np.array(sample) unique, index, count = np.unique(array, axis=0, return_index=True, return_counts=True)

For all realizations, replicated values are associated with Diracs and unique values are put in a separate list.

sampleSize = sample.getSize() listOfDiracs = [] listOfWeights = [] uniqueValues = [] for i in range(len(unique)): if count[i] == 1: uniqueValues.append(unique[i][0]) else: atom = ot.Dirac(unique[i]) listOfDiracs.append(atom) w = count[i] / sampleSize print("New Dirac =", unique[i], " with weight =", w) listOfWeights.append(w)

The weight of the continuous atom is the complementary of the sum of the weights of the Diracs. This way, the sum of the weights will be equal to 1.

complementaryWeight = 1.0 - sum(listOfWeights) weights = list(listOfWeights) weights.append(complementaryWeight)

The easy part comes: the unique realizations can be used to fit a kernel smoothing. The KDE is then added to the list of atoms.

sampleUniques = ot.Sample(uniqueValues, 1) factory = ot.KernelSmoothing() kde = factory.build(sampleUniques) atoms = list(listOfDiracs) atoms.append(kde)

Et voilà: the Mixture is ready.

mixture_estimated = ot.Mixture(atoms, weights)

The following script compares the initial Mixture and the estimated one.

graph = DrawMixtureWithDiracs(distribution) graph.setColors(["dodgerblue3", "dodgerblue3"]) curve = DrawMixtureWithDiracs(mixture_estimated) curve.setColors(["darkorange1", "darkorange1"]) curve.setLegends(["Est. Mixture", "Est. Dirac"]) graph.add(curve) graph

The figure seems satisfactory, since the continuous distribution is estimated from a sub-sample which size is only equal to 50, i.e. one half of the full sample.

来源:https://stackoverflow.com/questions/61797760/seaborn-kdeplot-not-enough-variation-in-data

Recommend

  • Fixed width gap between two Div's
  • Unsigned long long overflow error?
  • Redux Saga stopped by LOCATION_CHANGE too early
  • Firebase once() timeout
  • how to add multiple argument options in python using argparse?
  • Disable on-click event for single column
  • Python subplots leaving space for common axis labels
  • Filtering specific lines
  • How to mock Kotlin Object in android?
  • JavaScript reduce returns object on Array of objects
  • Python Split string in a certain length
  • How to fire mouse event once for moving over child elements in Javascript?
  • How can I retrieve list of custom configuration sections in the .config file using C#? [duplicate]
  • Cannot find storage/emulated/0/ folder of Nexus 7 in Eclipse
  • Do not close ContextMenuStrip on selection of certain items
  • Can I define models in a django project directory?
  • Run bat file in c# throws not recognized error
  • Android : Radio Buttons keep changing their state when List View Scrolled
  • How to call a method data, only when app is in background, with a time interval of one hour
  • Condition on a timestamp column to select data for a year
  • How to render a component outside datatable by ajax?
  • How to return multiple result in the same query?
  • checkbox doesn't (check/uncheck) work inside bootstrap tab-pane
  • python Matplotlib candlestick plot works only on daily data, not for intraday
  • Converting array with Different data type
  • ASP.Net MVC entity framework submit model, then open new model in edit page
  • Octave code for gradient descent using vectorization not updating cost function correctly
  • when plotting a graph using bokeh, how to remove missing date while x_axis type is datetime ,
  • Filtering out choiceless polls in the Django tutorial causes polls in the index to duplicate
  • Update all WooCommerce product prices to 2 decimals in database
  • How to define a custom accuracy in Keras to ignore samples with a particular gold label?
  • Create/delete users from text file using Bash script