paint-brush
Outlier Detection with Chi Squareby@mxsundevice
2,485 reads
2,485 reads

Outlier Detection with Chi Square

by David Ochoa Corrales2mOctober 31st, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A Quantile of weibull distribution of the chi square test in python is a simple method for outlier detection. Chi square test can be used to test if the variance of a population is equal to a specified value. Chi-square series (χ2) allows us to create a constantly updated chi-square sequence and thereby obtain easily groupable values
featured image - Outlier Detection with Chi Square
David  Ochoa Corrales HackerNoon profile picture
0-item


This is a simple method for outlier detection, the procedure basically is a Quantile of weibull distribution of the chi square test in python.


We may have some small series of data like stock, rolled product thickness, etcetera; sometimes the data is collected manually and we need to detect outliers as a first filter to detect human errors in the collected data and correct them before analysis, in other cases, the data must need be broken down by season, process, etc.


I understand that doing a clustering like a Gaussian mixture for outlier detection or something like that on each variable is too expensive, and Chi squared is computationally cheaper.

The first question is why each part of this process? The proposed approach to this problem is simply to find an observation error, and for that the problem is built on a calculation of Measurement’s Accuracy. The procedure is composed of:


  1. Chi Square test: It can be used to test if the variance of a population is equal to a specified value.
  2. Weibull Distribution: it can also model skewed data.
  3. Quantile function: probability distribution of a random variable.

We cannot establish outlier fences by calculating the interquartile range because this can be used across products and seasons of a diverse nature. But then you can use a simple Chi-square test and look for errors of observation. We can continue as a little recipe.


The Recipe:


  1. The first step is display the density of the variable in question.

    I want know the nature of the variable


2. Measurement’s Accuracy calculation, the classical procedure exposed in the Kalman filter substract the median to the value; i divide the value between the median to make a normalized non-dimentional measure:

This normaized measure is used like framework for the exercise, we can observe little bit less skewness


  1. Moving Chi-square series (χ2), allows us to create a constantly updated chi-square sequence and thereby obtain easily groupable values

    with χ2 now we can see a “cheap” version of clustering


4. Quantile (Q); each business variable have their sole ranges, in this case i select the 95° quantile because is tent to be a flat range variable.


5. Classify; with the value in Q we can classify the χ2 returns:

6. Vualá! With this we split noise and data of a non linear variable:

Here we implement a very simple outlier detection process, this has a bit more potential; In a few days I will write another post on this subject.

source code:

Sources and further reading:

Some rights reserved


Questions and comments are always welcome


Also Published Here