TutorChase logo
Login
AP Statistics study notes

1.7.5 Selecting Appropriate Summary Statistics

AP Syllabus focus:
‘Exploring methods for identifying outliers, including the 1.5×IQR rule and the standard deviation criterion.

- Differentiating between nonresistant (mean, standard deviation, range) and resistant (median, IQR) statistics, discussing their susceptibility to outliers.

- Guidance on selecting appropriate measures of center and variability based on data characteristics and presence of outliers.

- Skill 4.B: Developing the ability to choose suitable summary statistics for describing quantitative data, considering the data distribution and the presence of outliers.’

Selecting appropriate summary statistics requires understanding how different measures respond to extreme values, distribution shapes, and data characteristics to ensure accurate and meaningful quantitative descriptions.

Choosing Summary Statistics in Context

Selecting suitable measures of center and variability depends heavily on the structure and reliability of the dataset. AP Statistics emphasizes identifying outliers, distinguishing resistant from nonresistant statistics, and choosing measures that best reflect the data’s true traits.

Understanding Outlier Identification

Outliers are unusually large or small observations that differ from the rest of the dataset. Their presence can distort nonresistant statistics and lead to misleading interpretations.

Outlier: A data point that is unusually far from the majority of the distribution.

Outliers may appear due to measurement error, natural variability, or unique subgroup behaviors. Before choosing summary statistics, identifying whether outliers exist—and whether they are meaningful—is critical.

Methods for Detecting Outliers

AP Statistics highlights two primary approaches to determine whether a value is an outlier.

This boxplot illustrates the five-number summary along with the interquartile range (IQR) and the conventional 1.5×IQR outlier fences. Values outside the fences are plotted as outliers, emphasizing how the IQR-based rule flags unusually large or small observations. The diagram reinforces why statistics based on the middle 50% of the data, such as the median and IQR, are considered resistant in the presence of outliers. Source.

1.5×IQR Rule

  • Compute the interquartile range (IQR).

  • Determine cutoff boundaries:

    • Lower boundary: Q1 − 1.5×IQR

    • Upper boundary: Q3 + 1.5×IQR

  • Any value outside these bounds is considered a potential outlier.

Standard Deviation Criterion

  • Observations more than about 2 to 3 standard deviations from the mean may be flagged as outliers in roughly symmetric distributions.

These approaches guide whether resistant or nonresistant statistics should be used to summarize the dataset.

Resistant vs. Nonresistant Statistics

A central theme in this subsubtopic is understanding which statistics remain stable in the presence of outliers.

Nonresistant Statistics

Nonresistant statistics are strongly influenced by extreme values. Even a single outlier can drastically shift these measures.

Nonresistant Statistic: A numerical summary that changes noticeably when outliers are present.

Key nonresistant statistics include:
Mean – shifts toward extreme values
Standard deviation – increases with extreme spread
Range – entirely determined by minimum and maximum values

Nonresistant measures are most appropriate when the distribution is roughly symmetric with no extreme deviations.

A normal sentence is needed before the next block of definitions.

Resistant Statistics

Resistant statistics maintain stability despite unusual or extreme observations, making them especially valuable when data are skewed or contain outliers.

Resistant Statistic: A numerical summary that is minimally affected by extreme observations.

Important resistant statistics include:
Median – represents the middle position in ordered data
IQR – focuses on the middle 50% of values

Because resistant statistics ignore extremes, they remain reliable for skewed distributions or those with clear outliers.

Selecting Measures of Center

The choice of the most informative measure of center depends on distribution shape and outlier presence.

Median for Skewed or Outlier-Prone Data

When distributions are skewed or contain outliers, the median becomes the preferred measure because it accurately reflects the typical observation without being pulled toward extreme values.

Mean for Symmetric Distributions

In symmetric and outlier-free distributions, the mean provides a more precise representation of center, incorporating the numerical value of every observation.

This graphic compares a symmetric distribution, where data are evenly balanced around the center, with a right-skewed distribution, where a long tail extends to the right. In symmetric settings, the mean is a stable and informative measure of center, while in skewed settings the tail and potential outliers suggest using the median and IQR instead. The density-style curves align with AP Statistics’ emphasis on distribution shape when selecting summary statistics; any real-world examples on the page extend beyond what is required here. Source.

Selecting Measures of Variability

Measures of variability describe how spread out the data are. As with center, variability measures must be chosen based on the structure of the dataset.

IQR for Distributions with Outliers

The interquartile range (IQR) is the most appropriate measure of spread when outliers are present. Because it ignores the outer 25% of the data on each side, it avoids distortion caused by extreme values.

Standard Deviation for Symmetric Distributions

The standard deviation is informative when the distribution is approximately symmetric and free of large outliers because it measures average distance from the mean, which requires stability in the dataset’s center.

A normal sentence is required before the following equation block.

EQUATION

s=1n1(xixˉ)2 s = \sqrt{\frac{1}{n-1}\sum (x_i - \bar{x})^2}
s s = Sample standard deviation (units of original data)
xi x_i = Individual observation
xˉ \bar{x} = Sample mean
n n = Number of observations

Integrating Outlier Detection and Summary Selection

Selecting appropriate summary statistics involves connecting outlier identification with resistant versus nonresistant measures.

Key Principles for Choosing Statistics

• Use median and IQR when the distribution is skewed or outliers exist.
• Use mean and standard deviation when data are symmetric and reasonably free of extremes.
• Consider the context of the data—whether the goal is robustness, sensitivity, or detailed spread representation.
• Evaluate outliers before calculating summary measures to ensure the chosen statistics represent the dataset accurately.

Understanding these principles aligns directly with Skill 4.B, emphasizing the thoughtful, context-based selection of summary statistics in quantitative data analysis.

FAQ

Visual impressions can be misleading, especially when the scale of a graph compresses variation. A value might look consistent with the rest of the distribution but still exceed the numerical cut-offs defined by the 1.5×IQR rule.

This happens most often in data sets with:
• Gradually increasing tails
• Moderate skewness
• Small sample sizes where each point influences the IQR noticeably

The numerical rule provides a consistent criterion even when visual inspection is ambiguous

The standard deviation approach assumes an approximately symmetric and mound-shaped structure, so it responds strongly to how distant a value is from the mean.

The 1.5×IQR rule, however, is distribution-free and depends only on the middle 50% of the data. This leads to differences when:
• The data are skewed
• The mean is dragged by extreme values
• The standard deviation increases sharply due to tail behaviour

Both rules can disagree without either being “wrong”.

In small samples, a single extreme value has a disproportionately large effect on the mean and standard deviation, making resistant statistics more reliable.

In larger samples:
• The effect of one outlier is diluted
• The distribution’s overall shape becomes clearer
• Decisions can be based more confidently on skewness rather than isolated points

However, resistant statistics remain safer whenever the presence or origin of unusual values is uncertain.

Yes. Using both pairs can provide insight into how outliers or skewness influence the data.

This dual reporting is particularly helpful when:
• You want to demonstrate the impact of extreme values
• The suitability of resistant versus nonresistant measures is unclear
• Stakeholders require measures familiar from different disciplines

Presenting both sets can highlight discrepancies that guide further analysis or data cleaning.

Context determines whether extreme values are legitimate or errors, and this directly affects which statistics are meaningful.

Consider:
• In environmental or financial data, extremes may be genuine and important
• In human measurement data, implausible values often indicate recording mistakes
• Some fields prioritise conservative metrics that remain stable despite outliers

Understanding the real-world meaning of unusual values helps decide whether to emphasise resistant or nonresistant measures.

Practice Questions

Question 1 (1–3 marks)
A data set of reaction times is strongly right-skewed and contains several unusually large values.
(a) Which measure of centre is most appropriate for this data set?
(b) Which measure of spread is most appropriate?
(c) Briefly explain why these choices are suitable.

Question 1
(a) 1 mark for correctly identifying the median as the appropriate measure of centre.
(b) 1 mark for correctly identifying the interquartile range (IQR) as the appropriate measure of spread.
(c) 1 mark for explaining that the median and IQR are resistant to outliers and skewness, whereas the mean and standard deviation are distorted by extreme values.

Question 2 (4–6 marks)
A researcher records the daily number of hours employees spend on focused work. The distribution is roughly symmetric, with no clear outliers.
(a) State the most appropriate measures of centre and spread for this distribution.
(b) Suppose the researcher later discovers that one employee incorrectly entered “42 hours” for a single day. Explain how this value would affect nonresistant and resistant statistics.
(c) Based on the presence of this value, explain whether the original choice of summary statistics remains appropriate, justifying your reasoning.

Question 2
(a) 1 mark for identifying the mean as the appropriate measure of centre.
1 mark for identifying the standard deviation as the appropriate measure of spread.
(b) 1 mark for stating that nonresistant statistics (mean and standard deviation) are strongly affected by extreme values.
1 mark for stating that resistant statistics (median and IQR) change very little.
(c) 2 marks for explaining that the incorrect value introduces an outlier, making the mean and standard deviation unsuitable, and justifying that the median and IQR would now be more appropriate.

Hire a tutor

Please fill out the form and we'll find a tutor for you.

1/2
Your details
Alternatively contact us via
WhatsApp, Phone Call, or Email