1.2 Data, Sampling, and Variation in Data and Sampling
Recall from Section 1.1 that a variable is a characteristic/attribute of interest for each person or object (an observational unit or a case). The value of a variable can vary from one observational unit to another.
DATA
Data refer to the actual values of the variable under study that have been observed, measured, counted, etc. They may be numbers or they may be categories (words, phrases, numbers, or a combination describing type of something). Datum is a single value.
Data may come from a population or from a sample. Most data can be put into two categories: qualitative, quantitative. We can also classify quantitative variables into two types: discrete or continuous variable.
Types of Variables
SAMPLING METHODS
Gathering information about an entire population often costs too much or is virtually impossible. Instead, we use a sample of the population. A sample should have the same characteristics as the population it is representing.
Random sampling is the process of randomly selecting individuals from a population. A random sample is any subset of a population chosen in such a way that each member of a population has a non-zero chance (not necessarily equal) of being selected for the sample. (Note: the textbook says each member initially has an equal chance of being selected for the sample, which is not entirely true.)
Simple random sample (SRS) is a more specific type of random sample where every member of the population has an equal chance of being selected. In an SRS, all possible samples of the same size [latex]n[/latex] have an equally likely chance of being selected, ensuring true randomness in the selection process.
Steps for obtaining a simple random sample
STEP 1: Lists all of the individuals in the population of interest. Number the individuals in the frame [latex]1[/latex] through [latex]N[/latex].
STEP 2: Use a random number table, graphing calculator, or statistical software (Random Number Generators) to randomly generate [latex]n[/latex] random numbers where [latex]n[/latex] is the desired sample size.
OTHER SAMPLING METHODS
Stratified sampling
In stratified sampling, a population is divided into a number of subgroups (or strata) and then a proportionate number is selected from each subgroup (stratum) using simple random sampling. Every member of a subgroup shares a common underlying characteristics.
Cluster sampling
In cluster sampling, the population is divided into subgroups (clusters), and one or more clusters are randomly selected to be in the sample. All the members from these selected clusters are in the cluster sample. Unlike stratified sampling, clusters do not share some underlying characteristics, but they are more like the mix in the population representing all (or most) characteristics.
Systematic sampling
In systematic sampling, a starting point from the population is randomly selected and select every [latex]n^{th}[/latex] member of the population will be in the sample.
Note: While Simple Random Sampling (SRS) ensures that each member of the population has an equal chance of being selected, other types of random sampling like stratified sampling, cluster sampling, or systematic sampling may not give every individual the same chance. In these methods, selection could depend on how the population is divided into groups (strata, clusters) or on a pre-defined interval, affecting equal probabilities.
Convenience sampling
Convenience sampling involves selecting whoever is convenient or using results that are readily available. The results of convenience sampling may be very good in some cases and highly biased (favor certain outcomes) in others. Generally avoid this method.
Sampling Methods Illustrated
PRACTICE
Check Your Knowledge: Sampling Methods @Khan Academy
Sampling Errors and Non-sampling Errors
Are sampling errors really errors?
Review Statistical Language – Types of Errors.
For additional explanation, watch the following video:
Additional Take on Sampling Errors and Variation
Critical Evaluation
We need to evaluate the statistical studies we read about critically and analyze them before accepting the results of the studies. Common problems to be aware of include:
- Problems with samples: A sample must be representative of the population. A sample that is not representative of the population is biased. Biased samples that are not representative of the population give results that are inaccurate and not valid.
- Self-selected samples: Responses only by people who choose to respond, such as call-in surveys, are often unreliable.
- Sample size issues: Samples that are too small may be unreliable. Larger samples are better, if possible. In some situations, having small samples is unavoidable and can still be used to draw conclusions. Examples: crash testing cars or medical testing for rare conditions.
- Undue influence: collecting data or asking questions in a way that influences the response.
- Non-response or refusal of subject to participate: The collected responses may no longer be representative of the population. Often, people with strong positive or negative opinions may answer surveys, which can affect the results.
- Causality: A relationship between two variables does not mean that one causes the other to occur. They may be related (correlated) because of their relationship through a different variable.
- Self-funded or self-interest studies: A study performed by a person or organization in order to support their claim. Is the study impartial? Read the study carefully to evaluate the work. Do not automatically assume that the study is good, but do not automatically assume the study is bad either. Evaluate it on its merits and the work done.
- Misleading use of data: improperly displayed graphs, incomplete data, or lack of context.
- Confounding: When the effects of multiple factors on a response cannot be separated. Confounding makes it difficult or impossible to draw valid conclusions about the effect of each factor.
Variation in Data
Variation is present in any set of data.
In one study, eight 16 ounce cans were measured and produced the following amount (in ounces) of beverage:
15.8; 16.1; 15.2; 14.8; 15.8; 15.9; 16.0; 15.5
Variation in Samples
Results from the random samples of the same size taken from the same population following identical procedures will likely differ every time. This behavior is natural because of sampling variability.
In reality, a sample will never be exactly representative of the population so there will always be some sampling error. As a rule, the larger the sample, the smaller the sampling error.
In statistics, a sampling bias is created when a sample is collected from a population and some members of the population are not as likely to be chosen as others (remember, each member of the population should have an equally likely chance of being chosen). When a sampling bias happens, there can be incorrect conclusions drawn about the population that is being studied.
PRACTICE