What Role Does Outlier Data Play In Data Analysis?

This article was published as a part of the Data Science Blogathon

When we started our data science journey and worked with our first data set for example the iris data fix, nosotros did non accept to do data cleaning but the real-world data sets are far from perfect. In that location are many shortcomings in the data set which should exist dealt with before plumbing equipment whatsoever model to information technology. If the data is not treated well, it might lead to biases and the results will not be reliable. This is where the exploratory data analysis comes into the motion picture.

There are multiple steps involved in the exploratory data assay like identifying all the variables and their information types, univariate and bivariate assay, handling missing values, dealing with the outliers, etc. It is always advisable to never skip the exploratory data analysis step during any model building. Ane of the most important steps in exploratory information analysis is outlier detection. Outliers are extreme values that might do non match with the rest of the data points. They might have fabricated their way to the dataset either due to various errors. There are numerous ways to care for the outliers simply based on the dataset nosotros accept to cull the best method.

Allow united states of america wait at all the steps involved in understanding outliers and dealing with them.

What are outliers?

"A celebrity in the oversupply of commoners is an outlier"

Epitome Source : Google Images https://wallhere.com/en/wallpaper/253405

The above statement might have given a fair clue well-nigh what are outliers. Anomalies of Outliers are those data points that lie at a great distance from the remainder of the data like a sudden increment or decrease past many folds or in the unproblematic globe an outlier is a value that lies outside the range of all other values in the dataset. For case, while measuring the body temperature of patients in a hospital there was an entry of 988 degrees Celsius which is clearly wrong. There might exist a missing decimal indicate like information technology should have been 98.8 instead of 988.

Another example is while measuring the weights of high schoolhouse students, in that location was an entry with a weight of 1234 which is highly unlikely. Information technology could exist a data entry fault. Information technology is not necessarily that an outlier is always an erroneous entry, in some cases, it could the result of some experiment but it'due south up to the data scientist to determine. The range of outliers depends on business problems and can change from case to instance. Information technology's always all-time to discuss with the business stakeholders before terming a data point every bit an outlier. The outliers demand special attention then that they don't cause any issues in the model results.

How do they affect the calculation/ biases due to outlier

If the outliers are not treated in the first footstep while doing the exploratory data assay, it can pb to biases in the results. There are many unfavorable impacts created by a bias which could lead to poor business decisions and ultimately a loss to the business.

"Avoiding bias starts by recognizing that data bias exists, both in the information itself and in the people analyzing or using information technology," said Hariharan Kolam, CEO, and founder of Findem said in his speech. The bias tin not only exist introduced by information but besides past the one working on information technology. The biases tin can be introduced subconsciously only they will be there, we simply have to make sure that before modeling the data, these biases are dealt with and make sure that they don't possess any threat to our finish results.

Unlike algorithms to treat outliers

There are numerous machine learning algorithms to treat outliers out of which the following are the nearly popularly used, lets expect at each algorithm in particular with examples.

Z score test

Z score test is one of the most commonly used methods to detect outliers. It measures the number of standard deviations abroad the observation is from the mean value. A z score of i.5 indicated that the observation is 1.5 standard deviations in a higher place the mean and -1.v means that the ascertainment is 1.v standard deviations below or less than the mean.

Z score = (x -mean) / std. departure

Where x is the data point

If the z score of ascertainment is 3 or more information technology is generally treated as an bibelot or an outlier.

Let us utilize the above table and observe the outliers in the weights of students by finding their z score

          import pandas as pd                    import scipy.stats as stats                    student_info = pd.read_excel('student_weight.xlsx')                    z_score = stats.zscore(student_info['weights(in Kg)'])                    print(z_score)

Output

[-0.30359971 -0.32843404 -0.35326838 -0.34085121 -0.37189413 -0.34085121

-0.29739113 2.99936649 -0.32843404 -0.33464263]

We tin clearly see that entry 588 is an outlier and the same is confirmed by the z score test.

Box plot

The box plot shows the distribution of the data points past dividing them into unlike quartiles. The box plot marks the minimum, maximum, median, first, and third quartiles of the dataset. These percentiles are too known every bit the lower quartile, median and upper quartile. This is one of the visual methods to detect anomalies. Any outliers which prevarication outside the box and whiskers of the plot can be treated as outliers.

          import matplotlib.pyplot as plt                    fig = plt.figure(figsize =(10, 7))                                plt.boxplot(student_info['weights(in Kg)'])                                plt.show()

The below graph shows the box plot of the student's weights dataset. The is an observation lying much away from the box and whiskers of the box which shows that this data signal is an outlier.

Isolation Forest

The isolation woods algorithm is an easy to implement yet powerful choice for outlier detection. Isolation Woods is based on the determination tree algorithm as it isolates the outliers from the dataset by selecting a random characteristic and a dissever value betwixt the maximum and minimum values of the selected characteristic.

The isolation forest method is preferred over other methods when the data set up is huge and has many features as it uses lesser retentivity compared to other techniques.

Below is the code for detecting outliers using isolation forest

          from sklearn.ensemble import IsolationForest                    model=IsolationForest(n_estimators=50, max_samples='auto', contamination=float(0.1),max_features=ane.0)                    model.fit(student_info[['weights(in Kg)']])                    student_info['scores']=model.decision_function(student_info[['weights(in Kg)']])                    student_info['anomaly']=model.predict(student_info[['weights(in Kg)']])                    anomaly=student_info.loc[student_info['bibelot']==-ane]                    anomaly_index=listing(bibelot.index)                    impress(bibelot)

Output

treat outlier isolation

DBSCAN

Density-based spatial clustering of applications with dissonance or popularly known as DBSCAN is a clustering algorithm.DBSCAN like any other clustering algorithm divides the dataset into different groups by checking their aggregation with other data points and the observations which fail to aggregate are termed as outliers.

          from sklearn.cluster import DBSCAN                    model = DBSCAN(eps=0.8, min_samples=10).fit(student_info[['weights(in Kg)']])                    X = model.labels_                    plt.scatter(student_info['weights(in Kg)'], student_info['student_name'], marker='o')                    plt.xlabel('Students', fontsize=16)                    plt.ylabel('Weights', fontsize=sixteen)                    plt.title('Students Vs Weights', fontsize=20)                    plt.show()

How to treat them?

It might exist tempting to but remove the records where there are outliers in the data set only it'due south not ever the best approach. The outlier treatment method tin can vary from case to case and should be discussed with the business before finalizing the method. In that location are unlike approaches such as replacing the outlier with the mean value, or median value or in some cases dropping the ascertainment with the suspected outlier then as to avert any bias in them. We tend to delete the outlier if they are due to information entry errors caused due to man error, data processing errors.

Depending on the size of the data set it is advisable to treat the outliers separately during model fitting and build a different model which can fit the outliers and a split up model for the rest of the dataset but this process tin exist fourth dimension-consuming and add to the cost.

The media shown in this commodity on treat outliers are non owned by Analytics Vidhya and are used at the Author's discretion.