Hawkins (1980) defines an outlier as an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.
Barnet and Lewis (1994) indicate that an outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs.
Simply put, the extreme values in the data are called Outliers.
Table of Contents
Causes of Outliers
Problem with Outliers
Types of Outliers
Outlier Detection
Methods of Removing the Outlier
Some Quick FAQs regarding Outliers
Causes of Outliers:
Causes can be majorly classified in 2 sections:
a) Naturally Occurring Error: Anomalies which are caused due to novelties in the dataset.
For example :- Although the JEE Mains/Advanced are super tough, handful students are still able to achieve near to perfect scores out of lakhs giving exams.
Another instance can be, Performance in sales job which has no upper limits, can have majority of employees getting average commissions while 3 or 4 employees getting much higher commissions based on their higher transaction volume.
If we work on the dataset of scores of the above mentioned students or sales commissions of employees, we may find some data points which will create an imbalanced dataset.
b) Non Natural or Artificial Error → Any error caused due to :
Data Entry errors : Human errors such as errors caused during data collection, recording, or entry can cause outliers in data. For example: Age of a customer is 24. Accidentally, the data entry operator puts an additional zero in the figure. Now the age becomes 240 which is 10 times higher. Evidently, this will be the outlier value when compared with rest of the population.
Measurement Error: These errors are caused when the measurement instrument used turns out to be faulty. For example: There are 10 weighing machines. 5 of them are correct, 1 is faulty. Weight measured by public on the faulty machine will be higher / lower than the rest of public in the group. The weights measured on faulty machine can lead to outliers.
Experimental Error: For example: In a 100m sprint of 7 runners, one runner missed out on concentrating on the ‘Go’ call which caused him to start late. Hence, this caused the runner’s run time to be more than other runners. His total run time can be an outlier.
Intentional Outlier: This is commonly found in self-reported measures that involves sensitive data. For example: Teens would typically under report the amount of alcohol that they consume. Only a fraction of them would report actual value. Here actual values might look like outliers because rest of the teens are under reporting the consumption.
Data Processing Error: Whenever we perform data mining, we extract data from multiple sources. It is possible that some manipulation or extraction errors may lead to outliers in the dataset.
Sampling error: For instance, we have to measure the height of athletes. By mistake, we include a few basketball players in the sample. This inclusion is likely to cause outliers in the dataset.
Problem With Outlier
Outliers create an imbalance in the data-set and hence are generally removed from the data. As clearly depicted from the image at LHS, we can see how the measures of Central Tendency — Mean, Median and Mode are affected by presence of Outliers in Dataset.
Furthermore, if the outliers are non-randomly distributed in the dataset, they can decrease the normality.
They can also impact the basic assumption of Regression, ANOVA and other statistical model assumptions.
Data outliers can also ruin and mislead the training process resulting in longer training times, lesser accurate models and ultimately poorer outcomes.
Types of Outliers
There are mainly 3 types of Outliers.
Point or global Outliers: A data point is considered a global outlier if its value is far outside the entirety of the data set in which it is found. Example: In a class all student age will be approx. similar, but if see a record of a student with age as 500. It’s an outlier. It could be generated due to various reason.
Contextual (Conditional) Outliers: A data point is considered a contextual outlier if its value significantly deviates from the rest of the data points in the same context. Note that this means that same value may not be considered an outlier if it occurred in a different context. Here, values are not outside the normal global range but are abnormal compared to the seasonal pattern. Example: World economy falls drastically due to COVID-19. Stock Market crashes due to the scam in 1992; in 2020 due to COVID-19. Usual data points will be near to each other whereas data point during the specific period will either up or down very far. This is not due to erroneous, but it’s an actual observation data point.
Collective Outliers: A collection of observations anomalous but appear close to one another because they all have a similar anomalous value. A subset of data points within a data set is considered anomalous if those values as a collection deviate significantly from the entire data set, but the values of the individual data points are not themselves anomalous in either a contextual or global sense.
Outlier Detection Before we start with the Detection of outliers, we should know: 1. Which and how many features am I taking into account to detect outliers ?
Univariate Analysis → These outliers can be found when we look at the distribution of a single variable.
Multivariate Analysis → These outliers are found in an n-dimensional space and to find them, we have to look at the distribution in multiple dimensions.
2. Can I assume a distribution(s) of values for my selected features?
Parametric Approach → Outliers could be identified by calculating the probability of the occurrence of an observation or calculating how far the observation is from the mean. For example, observations greater than 3 times the standard deviation from the mean, in case of normal distribution, could be classified as outliers.
The x-axis, in the above plot, represents the Revenues and the y-axis, probability density of the observed Revenue value. The density curve for the actual data is shaded in ‘pink’, the normal distribution is shaded in ‘green’ and log normal distribution is shaded in ‘blue’.
Non Parametric Approach → Let’s look at a simple non-parametric approach like a box plot to identify the outliers. In the box plot plot shown below, we can identify 7 observations, which could be classified as potential outliers, marked in green. This plot is also an example of Univariate analysis.
Methods for Detection of Outliers
1. We can use visualization :
A. Box Plot: It captures the summary of the data effectively and also provides insight about 25h, 50th and 75th percentile, median as well as outliers.
b. Scatter Plot: It is used when we want to determine the relationship between the 2 variables and can be used to detect any outlier(s)
2. Inter-Quartile Range → It is the difference between Quarter 3 and Quarter 1. In this method, we will define the base values wherein if the value is below or over, it is considered as an outlier.
eg → Q3 + 1.5*IQR > value > Q1–1.5IQR
3. Z score method → It tells us how far away a data point is from the mean. We can setup a threshold value, we can use it to define the outliers. Generally it is chosen as 3.0 as 99.7% of the values lie between +3.0 and -3.0
4. DBSCAN (Density Based Spatial Clustering of Applications with Noise) is focused on finding neighbors by density (MinPts) on an ‘n-dimensional sphere’ with radius ɛ. A cluster can be defined as the maximal set of ‘density connected points’ in the feature space. DBSCAN defines different classes of points:
Core point: A is a core point if its neighborhood (defined by ɛ) contains at least the same number or more points than the parameter MinPts.
Border point: C is a border point that lies in a cluster and its neighborhood does not contain more points than MinPts, but it is still ‘density reachable’ by other points in the cluster.
Outlier: N is an outlier point that lies in no cluster and it is not ‘density reachable’ nor ‘density connected’ to any other point. Thus this point will have “his own cluster”.
5. Isolation Forests → Isolation forest’s basic principle is that outliers are few and far from the rest of the observations. To build a tree (training), the algorithm randomly picks a feature from the feature space and a random split value ranging between the maximums and minimums. This is made for all the observations in the training set. To build the forest a tree ensemble is made averaging all the trees in the forest.
Methods for Handling the Outliers
Deleting observations: We delete outlier values if it is due to data entry error, data processing error or outlier observations are very small in numbers. We can also use trimming at both ends to remove outliers.
Transforming and binning values: Transforming variables can also eliminate outliers. Natural log of a value reduces the variation caused by extreme values. Binning is also a form of variable transformation. Decision Tree algorithm allows to deal with outliers well due to binning of variable. We can also use the process of assigning weights to different observations.
Imputing: We can also impute outliers by using mean, median, mode imputation methods. Before imputing values, we should analyze if it is natural outlier or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to predict values of outlier observation and after that we can impute it with predicted values.
Treat separately: If there are significant number of outliers, we should treat them separately in the statistical model. One of the approach is to treat both groups as two different groups and build individual model for both groups and then combine the output.
Some Quick FAQs regarding Outliers
1. How does removing the outlier affect the mean?
Removing an outliner changes the value of the mean.
Let us understand this with sample data of 10, 11, 14, 15, and 55 Mean = 10+11+14+15+55510+11+14+15+555 = 10551055 = 21 Mean (without the outlier) = 10+11+14+15410+11+14+154 = 504504 = 12.5 Here, on removing the outlier 55 from the sample data the mean changes from 21 to 12.5
2. When should we remove outliers?
Errors in data entry or insufficient data collection process result in an outlier. In such instances, the outlier is removed from the data, before further analyzing the data. Also sometimes the outliers rightly belong to the dataset and cannot be removed. An example is the marks scored by the students in which the student gaining a 100 mark (full marks) is an outlier, which cannot be removed from the dataset.
3. Can normal distribution have outliers?
A normal distribution also has outliers. The Z-value helps to identify the outliers. The data with Z-values beyond 3 are considered as outliers.
4. What percent of a normal distribution are outliers?
Since 99.7% of the data is within the Z value of 3, the remaining data of 0.3% is the outliers.
Good blog