It’s always fascinating to observe outliers and understand them. Outliers are the data points that don’t seem to fit well with rest of the data population. It is interesting that data points with outlier behavior are ‘outlier’ but can be found in almost every dataset that you get your hands on. Identifying outliers is always one of the first few things that a person understanding the data or interpreting the data should do. I would like to argue that it is a sin to infer from data without understanding outliers in that dataset.
Why are they important?
Outlier could be there due to an error or it could occur naturally. I remember a client who had to enter the weight of grain (traded in domestic and international market) in transaction system. ‘Common’ outlier in subsequent reports were the data points where client forgot to convert the unit of weight. Another client used to identify the manufacturing plant where end of the month inventory is very different from other manufacturing plants in this industry conglomerate. Financial markets care to identify outliers more than others. Identifying outlier in daily profit and loss change can be very rewarding.
Identifying flight legs that perform very different from others in delays, load factor can be fruitful. Identifying specific crew or aircraft that do perform different from other resources can be useful. For example, here Quick Insights found outliers in a time series for average flight delays.
Any data point that is 3 or more standard deviations away from the mean of the population in a dataset is considered an outlier. Outliers can be found at raw data level or at an aggregated level of analysis. It is good to identify them early in the data analysis so that impact of outliers can be understood, controlled, and a decision to include/exclude those data points will impact results of further analysis.
How to identify?
Good news is that it does not take herculean efforts to identify outliers. Some businesses have clear use cases dependent on outliers. Twitter does spend good research efforts to identify anomalies in user traffic. Twitter’s user traffic could be a factor of seasonality, geography, and external events. It is very important for a company such as Twitter to determine changing flows of traffic. Sudden surge in traffic is an outlier or an anomaly. Monitoring outliers in Profit and loss data has been very active research area for a long time.
Think of scatter plot, box plot and histogram and their spread when thinking about outliers. These are the common ways to find outliers. These ways are there for a long time and are now part of every BI tool. Use of these techniques had become very popular in 1980s when six sigma was widely being accepted in industry. Control chart were being implemented in manufacturing industry. Utilization of these techniques helped in identifying defects in process.
Analyzing box plot and histogram on a dataset is part of data discovery. Outlier discovery is just one step in that data discovery but an important one. Good part of identifying and studying outliers is that there can only be a limited number of them. The maximum number of outliers that a column can contain is determined using? Chebyshev’s Inequality , which states that no more than 1/k2? of a column can be? k standard deviations away from the mean. Since we define outliers as being 3 or more standard deviations from the mean, that means that no more than 1/9 ˜ 11% of the column can be outlying values. Another good part of studying outliers and trying to find the cause behind the occurrence is that one gets to learn the system through the data and uncover lot more. The bad part of identifying cause behind an outlier is that it is time consuming effort if BI reporting is not available.
Conclusion
Outlier occurrence must be a rare/occasional event in the life span of a dataset. Data generation technique or data supplier is error prone if outliers are frequently available in a dataset. I would like to argue that one might find Identification of outlier not as hard as figuring out the cause behind outlier’s occurrence. A good dashboard would allow you to help in going to the root of the occurrence. Outliers can also bring surprises and that’s the fun part of working with them.
Reason behind occurrence of outlier might or might not be straightforward and hence could be time consuming. A good BI dashboard should allow you to help discover more on outliers.
Data scientists at Nalashaa are passionate about finding insights in data and the root cause behind data behavior. We help clients make dashboards with analytics insights that provide 360-degree view on your data.