Outlier

The outlier is an extreme observation value that might badly influence the test results.

Example

The teacher measured the height of a group of students: 110,115,130,145,721,151,160,128,137.
With the following results:
Average: 199.7, Standard deviation:196.2.

She made a clear typo mistake.

Following the results after correcting the outlier to 121:
Average: 133, Standard deviation:16.7
.
If she would analyze the data in a public holiday and the correction would not be so obvious, she might need to exclude the observation and get the following results:
Average: 134.5, Standard deviation:17.2.

Why do we get outliers?

Observation errors

The observation's value is not correct due to various reasons.
You would like to exclude these incorrect outliers.

Measurement error

Measurement tool error, or wrong measurement process.

Experiment error

Example, when counting bacteria, some of the Petri dishes are contaminated ans show a larger count.

Human error

Any human error, like filling incorrect value, reading tool incorrectly, lie.

Incorrect statistical model

Since you use a wrong model some values appear as outliers, removing the outliers would be a mistake. Instead, you should fix the model.
You don't want to exclude these incorrect outliers!

Incorrect distribution

The real statistical distribution is not symmetric, and the outlier is valid.
How to fix it?
Use the correct distribution or use a non-parametric test for not normally distributed data or transform the data to fit the normal distribution better.

Mixer of populations

The checked population is composed of two or more groups with different Characteristics.
How to fix it?
Analyze each data population separately or treat the separation in the model, like adding a predictor to a regression.

Valid outliers

There is a low probability to get a genuine extreme value.
When you use a large sample size, you will undoubtedly get some such observations, and you must not exclude it from the research.
For example, in a normal distribution, there is a probability of 0.05 to get an extreme value of more than two standard deviations from the average.

Detection Methods

There are many ways to identify outliers. Following two of the commonly used methods.

Standard deviation

Usually with n=2.
Lower= Avergae - n*Standard Deviation.
Upper= Avergae + n*Standard Deviation.

Tukey Fence

Usually with k=1.5.
Interquartile Range : IRQ = Q3-Q1.
Lower = Q1 - k*IRQ.
Upper = Q3 + k*IRQ.

Example, with k=2:

Even list.

[21, 13, 14, 16, 38, 17, 18, 11, 20, 22, 22, 26].
Sort the list: [11, 13, 14, 16, 17, 18, 20, 21, 22, 22, 26, 38].
Divid to two equal lists, and each list divid again to 2 lists.
[11, 13, 14, 16, 17, 18], [20, 21, 22, 22, 26, 38].
Q1=(14+16)/2=15.
Q3=(22+22)/2=22.
IRQ=Q3-Q1=22-15=7.
Lower=Q1-k*IRQ=15-1.5*7=4.5.
Upper=Q3+k*IRQ=22+1.5*7=32.5.
An outlier is every observation which is below the lower threshold or above the upper threshold.

Odd list

[11, 13, 14, 16, 17, 18,20, 21, 22, 22, 26].
The number 18 is in the middle of the list. We chose to include it in both divided lists but removing it from both lists is also a valid option.
[11, 13, 14, 16, 17, 18], [18,20, 21, 22, 22, 26].
Q1=(14+16)/2=15.
Q3=(21+22)/2=21.5.
The rest calculation is identical to the Even list.