AP Syllabus focus:
‘Useful intelligence tests must be standardized and show validity and reliability, including construct, predictive, test-retest, and split-half evidence.’
High-quality psychological testing depends on consistent procedures and strong evidence that scores are meaningful. In AP Psychology, focus on how intelligence tests are standardised and evaluated through reliability and validity evidence.
Standardization (Norming and Consistent Administration)
Standardization: uniform procedures for administering and scoring a test, plus established norms (typical performance) from a defined population.
A test is standardized so that differences in scores reflect differences among people, not differences in testing conditions.
What standardization typically includes
Uniform instructions (same wording, timing, materials, and scoring rules)
Trained administrators to reduce experimenter bias and scoring drift
Normative sample used to create norms for interpretation
Should be representative of the target population (age, gender, ethnicity, region, SES)
Score reporting that allows comparison (often percentiles and standard scores)
Why norms matter
A raw score is rarely interpretable by itself; norms show how unusual a score is relative to others.
Norms must be updated periodically to remain accurate as populations change.
Reliability (Consistency of Measurement)
Reliability: the consistency of test scores across time, items, or raters; a reliable test yields similar results under consistent conditions.
Reliability is about stability and precision. A test can be reliable without measuring what it claims (so reliability is necessary but not sufficient for validity).
Major reliability evidence named in the syllabus
Test-retest reliability: consistency of scores when the same people take the test again after a delay

A test–retest scatterplot comparing the same individuals’ scores at Time 1 (x-axis) and Time 2 (y-axis). The tight, upward-sloping cloud of points illustrates high temporal stability: people who score high initially tend to score high again, which corresponds to a strong positive correlation. Source
Higher when the trait is stable and the interval is appropriate
Split-half reliability: consistency across halves of a test (e.g., odd vs. even items)
Indicates internal consistency, meaning items measure similar content/skills
Reliability tends to increase when the test has clear items, enough questions to sample the skill broadly, and consistent administration conditions.
Measurement error and score interpretation
= standard deviation of test scores (score units)
= reliability coefficient (unitless)
SEM frames a score as an estimate rather than a perfectly exact value; higher reliability (higher ) generally implies a smaller SEM.
Validity (Accuracy of What the Test Measures)
Validity: the extent to which a test measures what it claims to measure and supports appropriate interpretations and uses of scores.
Validity is supported by evidence, not guaranteed by a single statistic. For intelligence testing, validity focuses on whether scores reflect the intended construct and predict relevant outcomes.
Construct validity (named in the syllabus)
Construct validity concerns whether the test truly measures the theoretical attribute (the construct) such as “intelligence.”
Supported by patterns such as:
Expected relationships with other measures (convergent and discriminant evidence)
Appropriate factor structure (items cluster as theory predicts)
Performance differences that match theory (without unfair bias)
Predictive validity (named in the syllabus)
Predictive validity is the extent to which test scores predict future performance or outcomes (e.g., academic success in settings where IQ-related skills matter).
Typically evaluated by correlating test scores with later criteria (grades, training performance), while checking for fairness across groups.
Reliability vs. validity (the key relationship)
Reliability: “Is the measurement consistent?”
Validity: “Is the interpretation/use accurate and supported?”
A test can be:
Reliable but not valid (consistently measuring the wrong thing)
Not reliable and therefore unlikely to be valid (too much random error)
FAQ
There is no single cut-off; adequacy depends on intended uses and subgroups.
Larger samples are needed to report stable norms for many age bands or demographic groups.
Prediction depends on how well the criterion (e.g., grades) reflects comparable opportunities and evaluation standards.
Different teaching quality, grading practices, and selection effects can change observed correlations.
Split-half reliability is one method for estimating internal consistency.
Other internal consistency estimates combine information across all items rather than just two halves.
Yes. Validity is about the interpretation and use of scores.
A test might predict job training success but be inappropriate for diagnosing learning needs without separate supporting evidence.
They use administrator training, scripted instructions, auditing, and periodic re-certification.
Computer-based testing can also lock timing, item order, and scoring to reduce variation.
Practice Questions
Define test-retest reliability and explain what high test-retest reliability indicates about an intelligence test. (1–3 marks)
1 mark: Correct definition (scores are consistent across two administrations over time).
1 mark: States that high values indicate stability/consistency of measurement.
1 mark: Applies to intelligence testing (e.g., suggests scores are not largely due to random error across occasions).
An IQ test publisher claims their new test is “excellent” because it is standardised and has high split-half reliability. Evaluate this claim using standardisation, reliability, and validity (construct and predictive) evidence. (4–6 marks)
1 mark: Explains standardisation (uniform administration/scoring and norms from a representative sample).
1 mark: Explains split-half reliability as internal consistency.
1 mark: Notes reliability is necessary but not sufficient for validity.
1 mark: Describes construct validity as evidence the test measures intelligence as a construct.
1 mark: Describes predictive validity as predicting relevant future outcomes.
1 mark: Judgement that “excellent” requires validity evidence, not only standardisation and reliability.
