Central Limit Theorem

Introduction

What is Central Limit Theorem? Aspiring data scientists often flounder at this question during interviews. Data science aspirants often find it challenging to understand the central limit theorem, however it is indeed one of the simplest concepts in statistics. In this article we will take a step by step approach to understand the working of Central Limit Theorem (CLT), its significance and applications in statistics. Let’s begin now…!!

The central limit theorem is one the fundamental theorems in statistics and probability. It provides the foundation for most of the concepts that we study under statistics and probability such as Hypothesis Testing and building Confidence Intervals. It is a very powerful theorem that every data science and machine learning professional must comprehend.

The formal definition of central limit theorem states that:

For a population with mean (µ) and standard deviation (σ), if we take sufficiently large random samples from the population with replacement, then the distribution of the sample means (also known as sampling distribution of means) will approximate to the normal distribution. This will hold true regardless of the distribution of the source population whether it is normal, skewed, uniform or completely random, provided the sample size (n) is sufficiently large (typically n > 30). When the population is normally distributed, the theorem holds true even for smaller sample size i.e. n < 30.

The definition went over the head?…….No need to be worried, understanding the complete meaning of the above definition can be difficult in the very first attempt. In the upcoming sections of the article, I will walk you through the various aspects of the definition of the central limit theorem and we will also discuss its importance in statistics.

Before discussing the Central Limit Theorem in detail, let us first understand the concept of population and a sample.

Population and Sample

In statistics, a population can be defined as the entire group of individuals /items /units/observations about which some information is required to be ascertained. For example, the collection of all the books in a library, weights of all the new born babies in India, total number of students in a university, total number of tech startups in Asia, etc. are considered as population in statistics. The total number of elements / items / units / observations in a population is known as population size and is denoted by N.

In practical situations, collecting data for every entity in a population requires considerable efforts, time and resources, and is often not possible since population sets tends to be large.

In general, extracting information from the entire population may be time consuming, expensive and difficult. In such situations, a small part is selected from the whole population to make statistical inferences from them and estimate characteristics of the whole population, and this small part of the population is called as sample. Thus, the sample can be defined as below:

“A sample is a part / fraction / subset of the population”

To make accurate inferences, the sample has to be representative. A representative sample is one in which each and every member of the population has an equal and mutually exclusive chance of being selected. The procedure of drawing a sample from the population is called sampling and the total number of observations in a sample is called its sample size, denoted as n.

Let’s take a look at a couple of examples of population and samples to get an intuitive understanding of the concept:

A syringe full of blood drawn from the vein of a patient is a sample of all the blood in the patient’s circulation at the moment
50 pairs of denim selected at random from the whole lot by the denim apparel manufacturer to perform the quality check of the stitching is an example of drawing a sample from the population. The manufacturer checks a sample of 50 pairs, instead of the whole population to draw a conclusion about whether the entire population is likely to have been stitched correctly.

Population Parameter vs. Sample Statistic

A parameter is a known or unknown numeric summary of the population that describes the characteristic of the entire population such as population mean (μ) and population standard deviation (σ).
A statistic is a known numeric summary of the sample which can be used to make inference about the population. Sample mean (x̅) and sample standard deviation (s) are two common sample
A statistic describes the sample while a parameter describes the population from which the sample was drawn. In inferential statistics, we use sample statistics to estimate population parameters.

Central Limit Theorem Explained – Step by Step

Now, as we know about the formal definition of the central limit theorem, also we have learned about the concepts of population and sample, we are in a better position to discuss the central limit theorem in more detail.

Let us first consider a population distribution X, with mean (μ) and standard deviation (σ), and the given distribution is not necessarily a Gaussian distribution, it can range from normal, left-skewed, right-skewed, and uniform to any random distribution.
Now, we take ‘m’ random samples (with replacement) S1, S2, S3…, and Sm, each of size n, from the population distribution.Then we calculate the arithmetic means of values of all the samples as x̅₁, x̅₂, x̅₃,…, and x̅_m.
Then draw the distribution of the sample means (x̅₁, x̅₂, x̅₃,…, and x̅_m) obtained in the previous step. The resulting distribution is called “sampling distribution of means”.
Now, according to the central limit theorem, as ‘n’ increases, the distribution of the sample means approaches a normal distribution with mean nearly equal to population mean (μ) and standard deviation equal to σ/√n. Thus, the central limit theorem also states that, the mean of a sampling distribution of means is an unbiased estimator of the population mean.

A natural question that may have raised in your mind must be: How large the sample size has to be for the normal distribution to occur?

The sample size depends upon the shape of the distribution of the underlying population. The more the population distribution differs from the Gaussian distribution, the larger the sample size will be required.

Thus, for large sample sizes, the sampling distribution of means will approximate to normal distribution even if the population distribution is not normal. A sample of size 30 (i.e. n = 30) is considered large enough to see the effect and power of Central Limit theorem. If the population is normal, then the theorem holds true even for samples smaller than 30.

To learn about Normal distributions, I would recommend you to read my article on Simplest Explanation of Normal Distribution

Normality Characteristics of Central Limit Theorem

Normal Distributions are mainly defined by two parameters, the mean and the standard deviation. Let’s now focus on some normality characteristics of the central limit theorem.

1st Characteristic:

In the central limit theorem, we are talking about three kinds of means:

First is the original population mean (μ), which is a fixed number.
Second is the sample mean (x̅_i), which varies with each sample.
Third is the new “mean of sample means”, which is the mean of all the sample means (x̅₁, x̅₂, x̅₃,…, and x̅_m) obtained during the deployment of the central limit theorem, symbolically this mean can be denoted as (μx̅)

According to the central limit theorem, the mean of the sampling distribution of means (μ_x̅) will be approximately equal to the population mean (μ), or in other words we can also state that, the mean of a sampling distribution of means is an unbiased estimator of the population mean i.e. μ =μx̅. This is the first normality characteristic of the central limit theorem.

2nd Characteristic:

The standard deviation of the sampling distribution of means (symbolically denoted by σ_x̅) is equal to σ/√n, where σ is the standard deviation of the population distribution and n is the sample size.

As we increase the sample size (n), the standard deviation of the sampling distribution of means (σ_x̅ = σ/√n) will become smaller, since we have the square root of the sample size in the denominator. In other words, the sampling distribution clusters more tightly around the mean as sample size increases. This is the second normality characteristic of the central limit theorem.

So a sampling distribution, without changing the mean of the parent distribution, tightens it up and draws it together, and the larger the sample size the greater this effect.

Applications of Central Limit Theorem

Central Limit Theorem has a variety of day to day applications ranging from healthcare to industrial processes.

Every distribution in some sense could become a normal distribution. The central limit theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving multiple types of non-normal distributions. If the distribution is not known or not normal, we consider the sample distribution to be normal according to CTL. This helps in analyzing data in methods like constructing confidence intervals.

Final Notes

In this article we discovered the central limit theorem along with its characteristics and applications. We learned how the central limit theorem describes the shape of the distribution of the sample means. This theorem is critically important for making inferences in statistics.

If you have any queries or you want to share a feedback, do let me know in the comments section below.

TheAnalyticsGeek