Table of contents
Sampling Techniques: Introduction
Sampling
Different types of Sampling techniques
Choosing Between Probability and Non-Probability Samples
Probability Sampling
Non-probability sampling
Sampling errors and biases
Introduction:
Let’s take an example of COVID-19 vaccine clinical trials. It is very difficult to conduct the trials on the entire population, as it deals with time, money, and resources. So in research methodologies, sampling is a method that helps researchers to infer information about a population based on results from a subset of the population, without having to investigate every individual.
A telecom company planning to build a machine learning model to predict, churn customers from their network. One way is to collect all the customers’ information and build a prediction model. This method requires high computational power and resources. So the best way is to take a sample (Subset of customers) from the population (All customers) which represents the population and build the machine learning model. This saves money and effort.
Sampling:
Sampling is the process of selecting a group of individuals from a population to study them and characterize the population as a whole.
sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population.
The population includes all members from a specified group, all possible outcomes or measurements that are of interest. The exact population will depend on the scope of the study.
The sample consists of some observations drawn from the population, so a part of a subset of the population. The sample is the group of elements who participated in the study.
The sampling frame is the information that locates and defines the dimensions of the universe.
A good sample should satisfy the below conditions-
Representativeness: The sample should be the best representative of the population
under study.
Accuracy: Accuracy is defined as the degree to which bias is absent from the sample. An
accurate (unbiased) sample is one that exactly represents the population.
Size: A good sample must be adequate in size and reliability.
Different types of Sampling techniques:
There are several different sampling techniques available, and they can be subdivided into two groups-
1. Probability sampling involves random selection, allowing you to make statistical inferences about the whole group.
There are four types of probability sampling techniques
Simple random sampling
Systematic Sampling
Stratified random sampling
Cluster sampling
Non-probability sampling involves non-random selection based on convenience or other criteria, allowing you to easily collect initial data.
There are four types of Non-probability sampling techniques.
Convenience sampling
Quota Sampling
Judgmental or purposive sampling
Snowball sampling
Choosing Between Probability and Non-Probability Samples
The choice between using a probability or a non-probability approach to sampling depends on a variety of factors:
- Objectives and scope of the study
- Method of data collection
- Precision of the results
- Availability of a sampling frame and resources required to maintain the frame
- Availability of extra information about the members of the population
Probability Sampling
Probability sampling is normally preferred when conducting major studies, especially when a population frame is available, ensuring that we can select and contact each unit in the population. Probability sampling allows us to quantify the standard error of estimates, confidence intervals to be formed and hypotheses to be formally tested.
The main disadvantage is Bias in selecting the sample and the costs involved in the survey.
Simple random sampling
In Simple Random Sampling, each observation in the population is given an equal probability of selection, and every possible sample of a given size has the same probability of being selected. One possible method of selecting a simple random sample is to number each unit on the sampling frame sequentially and make the selections by generating numbers from a random number generator.
Simple random sampling can involve the units being selected either with or without replacement. Replacement sampling allows the units to be selected multiple times whilst without replacement only allows a unit to be selected once. Without replacement, sampling is the most commonly used method.
Ex: If a sample of 20 needs be collected from a population of 100. Assign unique numbers to population members and randomly select 20 members with a random generator. Train and test split in ML problems.
Applications
- Train and test split in machine learning problems
- Lottery methods
Advantages
Minimum sampling bias as the samples are collected randomly.
Selection of samples is simple as random generators are used.
The results can be generalized due to representativeness.
Disadvantages
The potential availability of all respondents can be costly and time consuming.
Larger sample sizes.
Systematic sampling
In systematic random sampling, the researcher first randomly picks the first item from the population. Then, the researcher will select each nth item from the list. The procedure involved in systematic random sampling is very easy and can be done manually. The results are representative of the population unless certain characteristics of the population are repeated for every nth individual.
Steps in selecting a systematic random sample:
Calculate the sampling interval (the number of observations in the population divided by the number of observations needed for the sample).
Select a random start between 1 and sampling interval
Repeatedly add sampling interval to select subsequent households
Ex: If a sample of 20 needs to be collected from a population of 100. Divide the population into 20 groups with a members of (100/20) = 5. Select a random number from the first group and get every 5th member from the random number.
Applications
Quality Control: The systematic sampling is extensively used in manufacturing industries for statistical quality control of their products. Here a sample is obtained by taking an item from the current production stream at regular intervals.
In Auditing: In auditing the savings accounts, the most natural way to sample a list of accounts to check compliance with accounting procedures.
Advantages
Cost and time efficient.
Spreads the sample more evenly over the population.
Disadvantages
Complete population should be known.
Sample bias If there are periodic patterns within the dataset.
Stratified random sampling
In Stratified random sampling, the entire population is divided into multiple non-overlapping, homogeneous groups (strata) and randomly choose final members from the various strata for research. Members in each of these groups should be distinct so that every member of all groups get equal opportunity to be selected using simple probability.
There are three types of stratified random sampling-
1. Proportionate Stratified Random Sampling
The sample size of each stratum in this technique is proportionate to the population size of the stratum when viewed against the entire population. For example, you have 3 strata with 10, 20 and 30 population sizes respectively and the sampling fraction is 0.5 then the random samples are 5, 10 and 15 from each stratum respectively.
2. Disproportionate Stratified Random Sampling
The only difference between proportionate and disproportionate stratified random sampling is their sampling fractions. With disproportionate sampling, the different strata have different sampling fractions.
3. Optimal stratified sampling
The size of the strata is proportional to the standard deviation of the variables being studied.
Ex: A company wants to do an employee satisfaction survey and the company has 300k employees and planned to collect a sample of 1000 employees for the survey. So the sample should contain all the levels of employees and from all the locations. So create different strata or groups and select the sample from each strata.
Advantages
Greater level of representation from all the groups.
If there is homogeneity within strata and heterogeneity between strata, the estimates can be as accurate.
Disadvantages
Requires the knowledge of strata membership.
Might take longer and more expensive
Complex methodology.
Cluster sampling
Cluster sampling divides the population into multiple clusters for research. Researchers then select random groups with a simple random or systematic random sampling technique for data collection and data analysis.
Steps involved in cluster sampling:
Create the clusters from the population data.
Select each cluster as a sampling frame.
Number each cluster.
Select the random clusters.
After selecting the clusters, either complete clusters will be used for the study or apply the other sampling methods to pick the sample elements from the clusters.
Ex: A researcher wants to conduct an academic performance of engineering students under a particular university. He can divide the entire population into multiple engineering colleges (Which are clusters) and randomly pick up some clusters for the study.
Types of cluster sampling:
One-stage cluster : From the above example, selecting the entire students from the random engineering colleges is one stage cluster
Two-Stage Cluster: From the same example, picking up the random students from the each cluster by random or systematic sampling is Two-Stage Cluster
Advantages
Saves time and money.
It is very easy to use from the practical standpoint
Larger sample sizes can be used
Disadvantages
High sampling error
May fail to reflect the diversity in the sampling frame
Non-probability sampling
Non-Probability samples are preferred when accuracy in the results is not important. These are inexpensive, easy to run and no frame is required. If a non-probability sample is carried out carefully, then the bias in the results can be reduced.
The main disadvantage of Non-Probability sampling is “dangerous to make inferences about the whole population.”
Convenience sampling
Convenience sampling is the easiest method of sampling and the participants are selected based on availability and willingness to participate in the survey. The results are prone to significant bias as the sample may not be a representative of population.
Applications
Surveys conducted in social networking sites and offices
Examples: The polls conducted in Facebook or Youtube. The people who are interested in taking the survey or polls will attend the survey and the results may not be accurate as the results are prone to significant bias.
Advantages
It is easy to get the sample
Low cost and participants are readily available
Disadvantages
Can’t generalize the results
Possibility of under or over representation of the population
Significant bias
Quota sampling
This method is mainly used by market researchers. The researchers divide the survey population into mutually exclusive subgroups. These subgroups are selected with respect to certain known features, traits, or interests. Samples from each subgroup are selected by the researcher.
Quota sampling can be divided into two groups-
Controlled quota sampling involves introduction of certain restrictions in order to limit researcher’s choice of samples.
Uncontrolled quota sampling resembles convenience sampling method in a way that researcher is free to choose sample group members.
Steps involved in Quota Sampling
Divide the population into exclusive sub groups.
Identify the proportion of sub groups in the population.
Select the subjects for each subgroup.
Ensure the sample is the representative of population.
Ex: A painting company wants to do research on one of their products. So the researcher uses the quota sampling methods to pick up painters, builders, agents and retail painting shop owners.
Advantages
Cost effective.
Doesn’t depend on sampling frames.
Allows the researchers to sample a subgroup that is of great interest to the study.
Disadvantages
sample may be overrepresented
Unable to calculate the sampling error
Great potential for researcher bias and the quality of work may suffer due to researcher incompetency and/or lack of experience
Judgement (or Purposive) Sampling
In Judgement (or Purposive) Sampling, a researcher relies on his or her judgment when choosing members of the population to participate in the study. Researchers often believe that they can obtain a representative sample by using sound judgment, which will result in saving time and money.
As the researcher’s knowledge is instrumental in creating a sample in this sampling technique, there are chances that the results obtained will be highly accurate with a minimum margin of error.
Ex: A broadcasting company wants to research one of the TV shows. The researcher has an idea of the target audience and he can choose the members of the population to participate in the study.
Advantages
a Cost and time effective sampling method.
Allows researchers to approach their target market directly.
Almost real-time results.
Disadvantages
Vulnerability to errors in judgment by researcher
Low level of reliability and high levels of bias
Inability to generalize research findings
Snowball sampling
This method is commonly used in social sciences when investigating hard-to-reach groups. Existing subjects are asked to nominate further subjects known to them, so the sample increases in size like a rolling snowball. For example, when surveying risk behaviors amongst intravenous drug users, participants may be asked to nominate other users to be interviewed.
This sampling method involves primary data sources nominating other potential primary data sources to be used in the research. So the snowball sampling method is based on referrals from initial subjects to generate additional subjects. Therefore, when applying this sampling method members of the sample group are recruited via chain referral.
There are three patterns of Snowball Sampling-
Linear snowball sampling; Recruit only one subject and the subject provides only one referral.
Exponential non-discriminative snowball sampling; Recruit only one subject and the subject provides multiple referrals.
Exponential discriminative snowball sampling; Recruit only one subject and the subject provides multiple referrals. But only one subject is picked up from the referrals.
Ex: Individuals with rare diseases. If a drug company is interested in doing research on the individuals with rare diseases, it may be difficult to find these individuals. So the drug company can find few individuals to participate in the study and request them to refer the individuals from their contacts.
Advantages
Researchers can reach rare subjects in a particular population
Low-cost and easy to implement
It doesn’t require a recruitment team to recruit the additional subjects
Disadvantages
The sample may not be a representative
Sampling bias may occur
Because the sample is likely to be biased, it can be hard to draw conclusions about the larger population with any confidence.
Sampling errors and biases
Sampling errors and biases are induced by the sample design. They include:
- Selection bias: When the true selection probabilities differ from those assumed in calculating the results.
- Random sampling error: Random variation in the results due to the elements in the sample being selected at random.
- Non-sampling error
Non-sampling errors are other errors which can impact final survey estimates, caused by problems in data collection, processing, or sample design. Such errors may include:
- Over-coverage: inclusion of data from outside of the population
- Under-coverage: sampling frame does not include elements in the population.
- Measurement error: e.g. when respondents misunderstand a question, or find it difficult to answer
- Processing error: mistakes in data coding
- Non-response or Participation bias: failure to obtain complete data from all selected individuals
Conclusion;
Reducing sampling error is the major goal of any selection technique.
A sample should be big enough to answer the research question, but not so big that the process of sampling becomes uneconomical.
In general, the larger the sample, the smaller the sampling error, and the better job you can do.
Decide the appropriate sampling method based on the study or use case.
Reference;
• Lance, P.; Hattori, A. (2016). Sampling and Evaluation. Web: MEASURE Evaluation. pp. 6–8, 62–64.
• Salant, Priscilla, I. Dillman, and A. Don. How to conduct your own survey. No. 300.723 S3. 1994.
• Robert M. Groves; et al. (2009). Survey methodology. ISBN 978-0470465462.
• Lohr, Sharon L. Sampling: Design and analysis.
• Särndal, Carl-Erik; Swensson, Bengt; Wretman, Jan. Model Assisted Survey Sampling.
• Scheaffer, Richard L.; William Mendenhal; R. Lyman Ott. (2006). Elementary survey sampling.
• Shahrokh Esfahani, Mohammad; Dougherty, Edward (2014). "Effect of separate sampling on classification accuracy". Bioinformatics. 30 (2): 242–250. doi:10.1093/bioinformatics/btt662. PMID 24257187.
• Scott, A.J.; Wild, C.J. (1986). "Fitting logistic models under case-control or choice-based sampling". Journal of the Royal Statistical Society, Series B. 48 (2): 170–182. JSTOR 2345712.
• https://www.mygreatlearning.com/blog/introduction-to-sampling-techniques/