The outlier is an extreme observation value that might badly influence the test results.
The teacher measured the height of a group of students: 110,115,130,145,721,151,160,128,137.
With the following results:
Average: 199.7, Standard deviation:196.2.
She made a clear typo mistake.
Following the results after correcting the outlier to 121:
Average: 133, Standard deviation:16.7
.
If she would analyze the data in a public holiday and the correction would not be so obvious, she might need to exclude the observation and get the following results:
Average: 134.5, Standard deviation:17.2.
The observation's value is not correct due to various reasons.
You would like to exclude these incorrect outliers.
Measurement tool error, or wrong measurement process.
Example, when counting bacteria, some of the Petri dishes are contaminated ans show a larger count.
Any human error, like filling incorrect value, reading tool incorrectly, lie.
Since you use a wrong model some values appear as outliers, removing the outliers would be a mistake. Instead, you should fix the model.
You don't want to exclude these incorrect outliers!
The real statistical distribution is not symmetric, and the outlier is valid.
How to fix it?
Use the correct distribution or use a non-parametric test for not normally distributed data or transform the data to fit the normal distribution better.
The checked population is composed of two or more groups with different Characteristics.
How to fix it?
Analyze each data population separately or treat the separation in the model, like adding a predictor to a regression.
There is a low probability to get a genuine extreme value.
When you use a large sample size, you will undoubtedly get some such observations, and you must not exclude it from the research.
For example, in a normal distribution, there is a probability of 0.05 to get an extreme value of more than two standard deviations from the average.
There are many ways to identify outliers. Following two of the commonly used methods.
Usually with k=3.
Lower= Average - k * Standard Deviation.
Upper= Average + k * Standard Deviation.
A potential problem is that the outliers may increase the standard deviation of the sample size.
Usually with k=1.5.
Interquartile Range : IRQ = Q3 - Q1.
Lower = Q1 - k * IRQ.
Upper = Q3 + k * IRQ.
Example, with k=1.5:
[21, 13, 14, 16, 38, 17, 18, 11, 20, 22, 22, 26].
Sort the list: [11, 13, 14, 16, 17, 18, 20, 21, 22, 22, 26, 38].
Divid to two equal lists, and each list divid again to 2 lists.
[11, 13, 14, 16, 17, 18], [20, 21, 22, 22, 26, 38].
Q1=(14+16)/2=15.
Q3=(22+22)/2=22.
IRQ=Q3-Q1=22-15=7.
Lower=Q1-k*IRQ=15-1.5*7=4.5.
Upper=Q3+k*IRQ=22+1.5*7=32.5.
An outlier is every observation which is below the lower threshold or above the upper threshold.
[11, 13, 14, 16, 17, 18, 20, 21, 22, 22, 26].
The number 18 is in the middle of the list. We chose to include it in both divided lists but removing it from both lists is also a valid option.
[11, 13, 14, 16, 17, 18], [18, 20, 21, 22, 22, 26].
Q1=(14+16)/2=15.
Q3=(21+22)/2=21.5.
The rest calculation is identical to the Even list.