Test Development Basics

Understanding the Uses, Concepts and Types of Assessments

What are Assessments For?
Assessments are used in nearly every facet of our lives; from birth on, we are measured, evaluated and compared using metrics related to health, intellectual, artistic and physical aptitude, academic achievements, workplace skills, cognitive, social, emotional or "soft" skills, etc. But assessments do not live in a vacuum: they exist in order to support decisions. They can be used to identify critical needs or exceptional capabilities; compare us to norms or measure our specific competencies; allow us to gain admittance to a society or school or keep us from consideration for job opportunities or promotions.
It is vitally important to understand what decisions an assessment is designed to support, and how long the interpretation of results will be considered valid. A measurement, on its own, means nothing. We can measure the outdoor temperature at 65 degrees Fahrenheit, for example. But how we use that information depends on whether we were hoping to set up an outdoor trampoline or an ice skating rink. In the same vein, we need to know the purpose of an assessment tool in order to know what type of information and degree of precision is required. Is the purpose to baseline what someone knows and develop a study or training plan based on results? Is it to measure progress towards a goal, or identify areas needing more work? Is it to confer a certification, to decide on a hire, or to provide a final stamp of approval, evaluation or grade? Will future testing be necessary to maintain the assessment results, such as with many credentialing assessments? Or will the assessment confer finality, such as a high school or college degree, or passing a bar or medical board exam? Is the assessment one of many factors supporting a decision, or the sole deciding factor? Understanding the overall goal and context is the first step towards ensuring the right tool is developed.

You may have heard people talk about high-stakes assessment. How do the stakes affect the development of the assessment? A high stakes assessment is one in which the decision based on it has important consequences - a professional credential, a degree, a job, a license to practice. Since the consequences of inaccurate measurement are serious, considerable effort must go into ensuring the quality of the measurement. Furthermore, since examinees in these cases have some incentive to cheat, those responsible for developing and administering such assessments must implement significant security precautions in the design and administration protocols to maintain the security of the assessment and the validity of its results. A low-stakes assessment is one for which the results are not as impactful. Many personality tests, for example, or self-directed progress assessments, or baseline diagnostic assessments, may all be considered low stakes. Weeding out questions that don't work well may not be as important for these types of assessments as is making the assessment available quickly and cost-effectively. Security in low stakes exams is also less of a concern: there is usually not much benefit to cheating.

Reliability & Validity
Reliability and validity are related terms that speak to an assessment's usefulness as a basis for making decisions. Reliability is the extent to which an assessment will return consistent results. Validity is the extent to which the assessment provides information that can legitimately support a given decision. In order for the support to be legitimate, assessment results must be reliable. But reliable results don't necessarily mean that any inferences based on them are validly made.

Take, for example, a simple measurement tool such as a carpenter's level, used to establish a vertical or plumb line. If placed on a flat plane or surface, it will give a reading using a bubble's placement between two lines. If removed and then reset on the same surface, it should give exactly the same measurement result, every time. If it does not, then the tool is unreliable. If, however, you want a surface to be slanted at 30 degrees, a tool that only tells you if something is level cannot be validly used to decide if the angle is right, no matter how reliable it is. Hence, the tool can be reliable, but not valid.
Similarly, an assessment should yield the same results for test-takers of the same ability level, regardless of when or where the test-takers take it; and it must accurately and correctly measure what is relevant to the decision the results are used to support, without interference from factors that are extraneous or irrelevant.
It is important to note that you cannot look at an assessment and tell whether it is valid or reliable. Reliability and validity have to do with the use of the assessment, and not the assessment tool itself. For example, a standard measuring tape is reliable enough to be used validly in helping to put a bookshelf in the center of a wall, but is not reliable enough to be used validly for quality control on precision machinery parts. Only the study of assessment results and the building of a validity argument showing the connection between the assessment results and the decision it supports can determine whether the use of an assessment for a given purpose is both reliable and valid.

What does an assessment score mean? Criterion-Referenced vs. Norm-Referenced
Just knowing that someone got 52% of the questions right on a test tells us nothing about the person's abilities. Scores are only useful in context. There are two main types of contexts for interpreting test scores: criterion-referenced and norm-referenced. Criterion-referenced assessments measure a person's achievement of stated criteria, for example, specific learning objectives in a course. A score of 52% would be interpreted against a predetermined standard for passing (or for particular grades or proficiency levels). Such standards may be arbitrary (for example, in low-stakes assessments it is common to have a 70% passing criterion, regardless of the difficulty of the items), or they may be set specifically using a formal standard-setting process (used for high-stakes assessments). Criterion-referenced assessments tell us about the individual's competency, but not how that individual did in comparison to others. Norm-referenced assessments measure a person's overall results in comparison to a specific group of people. If a test-taker has scored in the 75th percentile of an assessment, that means that 75% of test-takers have scored below and 25% of test-takers have scored above this person - but the measure doesn't tell us how much or what they know. Norm-referenced exams are common for large populations taking the same test; for example, SAT exams are norm-referenced. Assessments that are used to identify top or bottom achievers, for job interviews or for remedial education opportunities, may make use of norm-referenced tests. Criterion-referenced tests are used when we want to know whether an individual has met a standard. We care more that the pilot flying our plane knows everything needed to land safely, than whether he or she scored better than 70% of the others taking the flying test. In academics, too, we want to ensure that students have achieved our program and course learning objectives - so we score them against those objectives, not against each other.

Types of Assessment Uses
There are three main types of uses for assessments: diagnostic, formative, and summative. Diagnostic assessments support decisions about developing study or training plans by providing baseline information about individual strengths and weaknesses. These assessments often drill down to very fine levels of detail. Formative assessments support decisions about study or instruction by providing information about a learner's progress toward a goal. It is typical for these assessment results to be provided to learners along with commentary or other supporting material helping them see what they got right and wrong. Summative assessments support decisions about credentialing, final grades, hiring, or placement. They typically do not involve presenting feedback to individuals, as their purpose is not to instruct, but to provide a snapshot of ability or rank.

Types of Assessment Items
Just as there are many types of assessments, there are many varieties of item types. Items are the building blocks of an assessment, the questions that a person must answer or tasks he or she must perform. Some items evaluate a candidate's understanding of a topic, whether at a basic comprehension level or at an advanced analytical level (see Bloom's Taxonomy, glossary). Others measure performance of skills. These may assess specific skills relatively directly in a test, such as tests to determine if a car mechanic can fix an engine, or if a pilot can fly a plane, or a software engineer can identify, locate and fix a bug in a program. Other performance items may involve samples of representative work gathered over time, such as the evaluation of a fine arts or architectural portfolio. When people think of examinations, they most often think of the standard Multiple Choice Question (MCQ) with one correct response choice (the key) and several wrong choices (the distractors). But there are many other types, even on exams testing understanding of a topic, such as a written essay, an oral defense, fill-in-the-blank, matching one thing to another, manipulating visual elements, or simulation items. Assessment developers select item types based on the purpose of the assessment, delivery constraints, and time and cost factors.

Scoring Systems: human scorers, computer scoring, or blended systems
Assessment items may be scored by humans, computers, or a combination of both. Human raters may be needed for grading essays, for evaluating performance or competencies, for reviewing work that led to an answer, or determining nuanced shadings between response alternatives. Computers regularly score not only exams delivered via computers, but also score sheets or bubble sheets submitted for exams taken in person at a testing center. Computers can score multiple-choice items, of course, but also many types of constructed response items, including essays, short answer, fill-in-the-blank or calculation problems, assuming a good scoring rubric or scoring rules and, for some types, a robust data set for the computers to "learn from" and model. Often, a hybrid approach to scoring is used, particularly in essay scoring, in which there may be responses that are too atypical for the computer to score.

When human raters are used, it is critical to maintain the assessment's reliability and the validity of the score use by using well-constructed scoring rubrics and keeping an eye on inter-rater and intra-rater reliability. Inter-rater reliability is the extent to which the same candidate will be scored the same way regardless of who the rater is. Intra-rater reliability is the degree to which a given rater rates the same response consistently. Both types of reliability are improved through regular training of raters and through the establishment of rating procedures, such as having multiple raters and a policy for review in case of significant differences between raters. Solidly constructed rubrics also help ensure that what is being rated matches what the objectives of the assessment were and nothing else - not how much the rater likes the candidate's ideas, or what the rater thinks is important, etc. Such rubrics also have to have systems for determining minimum qualifications for scoring levels.

Delivery & Administration
Where, by what means, and how often an assessment is offered, and how many forms (collections of items, the "test" itself) are developed and used simultaneously, are all factors impacting the test-taker's experience, the price of the assessment, and the security of the assessment. Many of the world's largest exam programs are still delivered in paper-and-pencil format (such as the SATs used for college applications). Computer testing has been a viable method for delivering exams for decades, with an entire industry in testing centers operating worldwide at colleges, private business centers, and independent testing centers. Computer testing over the Internet, also known as online delivery, is also widely available. And technology is advancing to develop testing options for mobile technologies.
It is common for test forms to be delivered as a fixed set of items, in which the same questions are presented in the same order for everyone taking a particular form of the test. However, large-scale programs are increasingly using linear on-the-fly testing (LOFT), in which different test-takers see the same number of items, but the items are randomly selected from a pool of items according to specifications, or computer-adaptive testing (CAT), in which different test-takers may see different numbers of items, depending on their level of ability. CAT generally means selecting increasingly or decreasingly difficult items (in real time during the administration of an assessment) for a test-taker to answer based on their successful or unsuccessful responses to previous items. Such assessments are not scored based on number of items correct, but using a system known as Item Response Theory (IRT), which estimates ability based on item difficulty and test-taker responses.

Additional considerations in administration include whether to localize and translate assessments into different languages and dialects. There are several quality translation firms that specialize in assessment translations, to ensure that all of the care taken to develop original items is appropriately transferred to translated or localized items.

This has been a very basic primer in assessments. If you seek to develop a custom assessment for your needs, please visit our section on test development, or contact us directly to discuss your situation in further detail.

Institute for Credentialing Excellence
Council for Adult and Experiential Learning
Competency-Based Education Network
National College Testing Association
Open Education Resource Foundation
National Council on Measurement in Education