When designing an assessment strategy and when selecting and evaluating assessment tools it is important to consider a number of factors such as:

Reliability

The term reliability refers to consistency. Assessment reliability is demonstrated by the consistency of scores obtained when the same applicants are reexamined with the same or equivalent form of an assessment (e.g., a test of keyboarding skills). No assessment procedure is perfectly consistent. If an applicant's keyboarding skills are measured on two separate occasions, the two scores (e.g., net words per minute) are likely to differ.

Reliability reflects the extent to which these individual score differences are due to "true" differences in the competency being assessed and the extent to which they are due to chance, or random, errors. Common sources of such error include variations in:

  • Applicant's mental or physical state (e.g., the applicant's level of motivation, alertness, or anxiety at the time of testing)
  • Assessment administration (e.g., instructions to applicants, time limits, use of calculators or other resources)
  • Measurement conditions (e.g., lighting, temperature, noise level, visual distractions)
  • Scoring procedures (e.g., raters who evaluate applicant performance in interviews, assessment center exercises, writing tests)

A goal of good assessment is to minimize random sources of error. As a general rule, the smaller the amount of error, the higher the reliability.

Reliability is expressed as a positive decimal number ranging from 0 to 1.00, where 0 means the scores consist entirely of error. A reliability of 1.00 would mean the scores are free of any random error. In practice, scores always contain some amount of error and their reliabilities are less than 1.00. For most assessment applications, reliabilities above .70 are likely to be regarded as acceptable.

The practical importance of consistency in assessment scores is they are used to make important decisions about people. As an example, assume two agencies use similar versions of a writing skills test to hire entry-level technical writers. Imagine the consequences if the test scores were so inconsistent (unreliable) applicants who applied at both agencies received low scores on one test but much higher scores on the other. The decision to hire an applicant might depend more on the reliability of the assessments than his or her actual writing skills.

Reliability is also important when deciding which assessment to use for a given purpose. The test manual or other documentation supporting the use of an assessment should report details of reliability and how it was computed. The potential user should review the reliability information available for each prospective assessment before deciding which to implement. Reliability is also a key factor in evaluating the validity of an assessment. An assessment that fails to produce consistent scores for the same individuals examined under near-identical conditions cannot be expected to make useful predictions of other measures (e.g., job performance). Reliability is critically important because it places a limit on validity.

Validity

Validity refers to the relationship between performance on an assessment and performance on the job. Validity is the most important issue to consider when deciding whether to use a particular assessment tool because an assessment that does not provide useful information about how an individual will perform on the job is of no value to the organization.

There are different types of validity evidence. Which type is most appropriate will depend on how the assessment method is used in making an employment decision. For example, if a work sample test is designed to mimic the actual tasks performed on the job, then a content validity approach may be needed to establish the content of the test matches in a convincing way the content of the job, as identified by a job analysis. If a personality test is intended to forecast the job success of applicant's for a customer service position, then evidence of predictive validity may be needed to show scores on the personality test are related to subsequent performance on the job.

The most commonly used measure of predictive validity is a correlation (or validity) coefficient. Correlation coefficients range in absolute value from 0 to 1.00. A correlation of 1.00 (or -1.00) indicates two measures (e.g., test scores and job performance ratings) are perfectly related. In such a case, you could perfectly predict the actual job performance of each applicant based on a single assessment score. A correlation of 0 indicates two measures are unrelated. In practice, validity coefficients for a single assessment rarely exceed .50. A validity coefficient of .30 or higher is generally considered useful for most circumstances (Biddle, 2005). 1

When multiple selection tools are used, you can consider the combined validity of the tools. To the extent the assessment tools measure different job-related factors (e.g., reasoning ability and honesty) each tool will provide unique information about the applicant's ability to perform the job. Used together, the tools can more accurately predict the applicant's job performance than either tool used alone. The amount of predictive validity one tool adds relative to another is often referred to as the incremental validity of the tool. The incremental validity of an assessment is important to know because even if an assessment has low validity by itself, it has the potential to add significantly to the prediction of job performance when joined with another measure.

Just as assessment tools differ with respect to reliability, they also differ with respect to validity. The following table provides the estimated validities of various assessment methods for predicting job performance (represented by the validity coefficient), as well as the incremental validity gained from combining each with a test of general cognitive ability. Cognitive ability tests are used as the baseline because they are among the least expensive measures to administer and the most valid for the greatest variety of jobs. The second column is the correlation of the combined tools with job performance, or how well they collectively relate to job performance. The last column shows the percent increase in validity from combining the tool with a measure of general cognitive ability. For example, cognitive ability tests have an estimated validity of .51 and work sample tests have an estimated validity of .54. When combined, the two methods have an estimated validity of .63, an increase of 24% above and beyond what a cognitive ability test used alone could provide.

Back to Top

Table 1: Validity of Various Assessment Tools Alone and in Combination
Assessment methodValidity of method used aloneIncremental
(combined) validity
% increase in validity from combining tool with cognitive ability
Tests of general cognitive ability .51    
Work sample tests .54 .63 24%
Structured interviews .51 .63 24%
Job knowledge tests .48 .58 14%
Accomplishment record* .45 .58 14%
Integrity/honesty tests .41 .65 27%
Unstructured interviews .38 .55 8%
Assessment centers .37 .53 4%
Biodata measures .35 .52 2%
Conscientiousness tests .31 .60 18%
Reference checking .26 .57 12%
Years of job experience .18 .54 6%
Training & experience point method .11 .52 2%
Years of education .10 .52 2%
Interests .10 .52 2%

Note:

Table adapted from Schmidt & Hunter (1998). Copyright © 1998 by the American Psychological Association. Adapted with permission. 2

* Referred to as the training & experience behavioral consistency method in Schmidt & Hunter (1998).

Technology

The technology available is another factor in determining the appropriate assessment tool. Agencies that receive a large volume of applicants for position announcements may benefit from using technology to narrow down the applicant pool, such as online screening of resumes or online biographical data (biodata) tests. Technology can also overcome distance c