47841

Find only relevant points in MATLAB

I have a MATLAB function that finds charateristic points in a sample. Unfortunatley it only works about 90% of the time. But when I know at which places in the sample I am supposed to look I can increase this to almost 100%. So I would like to know if there is a function in MATLAB that would allow me to find the range where most of my results are, so I can then recalculate my characteristic points. I have a vector which stores all the results and the right results should lie inside a range of 3% between -24.000 to 24.000. Wheras wrong results are always lower than the correct range. Unfortunatley my background in statistics is very rusty so I am not sure how this would be called. Can somebody give me a hint what I would be looking for? Is there a function build into MATLAB that would give me the smallest possible range where e.g. 90% of the results lie.

EDIT: I am sorry if I didn't make my question clear. Everything in my vector can only range between -24.000 and 24.000. About 90% of my results will be in a range which spans approximately 1.44 ([24-(-24)]*3% = 1.44). These are very likely to be the correct results. The remaining 10% are outside of that range and always lower (why I am not sure taking then mean value is a good idea). These 10% are false and result from blips in my input data. To find the remaining 10% I want to repeat my calculations, but now I only want to check the small range. So, my goal is to identify where my correct range lies. Delete the values I have found outside of that range. And then recalculate my values, not on a range between -24.000 and 24.000, but rather on a the small range where I already found 90% of my values.

Answer1:

The relevant points you're looking for are the percentiles:

% generate sample data data = [randn(900,1) ; randn(50,1)*3 + 5; ; randn(50,1)*3 - 5]; subplot(121), hist(data) subplot(122), boxplot(data) % find 5th, 95th percentiles (range that contains 90% of the data) limits = prctile(data, [5 95]) % find data in that range reducedData = data(limits(1) < data & data < limits(2));

Other approachs exist to detect outliers, such as the IQR outlier test and the three standard deviation rule, among many others:

%% three standard deviation rule z = 3; bounds = z * std(data) reducedData = data( abs(data-mean(data)) < bounds );

and

%% IQR outlier test Q = prctile(data, [25 75]); IQ = Q(2)-Q(1); %a = 1.5; % mild outlier a = 3.0; % extreme outlier bounds = [Q(1)-a*IQ , Q(2)+a*IQ] reducedData = data(bounds(1) < data & data < bounds(2)); <hr>

BTW if you want to get the z value (|X|<z) that corresponds to 90% area under the curve, use:

area = 0.9; % two-tailed probability z = norminv(1-(1-area)/2)

Answer2:

Maybe you should try mean value (in matlab: mean) and standard deviation (in matlab: std)?

What is the statistic distribution of your data?

See also this wiki page, section "Interpretation and application". In general for almost every distribution, very useful Chebyshev's inequalities take place.

In most of the cases this should work:

meanval = mean(data) stDev = std(data)

and probably the most (75%) of your values will be placed in range:

<meanVal - 2*stDev, meanVal + 2*stDev>

Answer3:

it seems like maybe you want to find the number x in [-24,24] that maximizes the number of sample points in [x,x+1.44]; probably the fastest way to do this involves a sort of the sample points, which is ultimately nlog(n) time; a cheesy approximation would be as follows:

brkpoints = linspace(-24,24-1.44,n_brkpoints); %choose n_brkpoints big, but < # of sample points? n_count = histc(data,[brkpoints,inf]); %count # data points between breakpoints; accbins = 1.44 / (brkpoints(2) - brkpoints(1); %# of bins to accumulate; cscount = cumsum(n_count); %half of the boxcar sum computation; boxsum = cscount - [zeros(accbins,1);cscount(1:end-accbins)]; %2nd half; [dum,maxi] = max(boxsum); %which interval has the maximal # counts? lorange = brkpoints(maxi); %the lower range; hirange = lorange + 1.44

this solution does fudge some of the corner case stuff about the bottom and top bin, etc.

note that if you're going to go by the Chebyshev inequality route, Petunin's Inequality is probably applicable, and will give a slight boost.

Recommend

  • PCA and Hotelling's T^2 for confidence intervall in R
  • How to remove certain characters from a variable? (Python)
  • add geom_line to link all the geom_point in boxplot conditioned on a factor with ggplot2
  • How to remove ticks and labels of dropped off factors in a box plot
  • Reset default matplotlib colormap values after using 'set_under' or 'set_over'
  • How share x axis of two subplots after they are created?
  • Remove top and right axis in matplotlib after increasing margins?
  • MATLAB newbie - Display Photos above panel in guide
  • How to print percentile using xsl
  • in R, how to distribution data into different group
  • pandas multiple plots not working as hists
  • How to calculate the Spatial frequency in Gabor filter?
  • Plotting a 3d matrix in slices - MATLAB
  • Insert list of lists into single column of pandas df
  • The dimensions in hist for numpy.histogram with density = True
  • Plotting a histogram from cellvalues
  • boxplot won't display with ggvis
  • Color a heatmap in Python/Matplotlib according to requirement
  • Hatch area using pcolormesh in Basemap
  • Matplotlib rotate image file by X degrees
  • quick pandas groupby calculations with cumprod
  • Why isn't Kubernetes service DNS working?
  • SQL append distinct values from two columns and two tables
  • Color time-series based on column values in pandas
  • Problems with matplotlib.pyplot.xticks()
  • SQL - Select lowest values with group by and order by?
  • Put value at centre of bins for histogram
  • Python cosine function precision [duplicate]
  • Why must we declare a variable name when adding a method to a struct in Golang?
  • Python ImageIO Gif Set Delay Between Frames
  • SetWindowsHookEx does not react on media keys
  • Textfile Structure (tables)
  • req.body is undefined - nodejs
  • How to add a column to a Pandas dataframe made of arrays of the n-preceding values of another column
  • Convert array of 8 bytes to signed long in C++
  • Understanding cpu registers
  • Why joiner is not used after Sequence generator or Update statergy
  • Recursive/Hierarchical Query Using Postgres
  • Running Map reduces the dimensions of the matrices
  • Android Heatmap on canvas or ImageView