The Anatomy of High Stakes Test Construction

A step-by-step guide to test development and maintenance


A job task analysis (JTA), or job analysis, is the first step in the development of any professional credentialing assessment.  It should also be the first step in developing educational offerings or instructional design aimed at workforce readiness.  The JTA takes a close look at the actual job for which the credential is offered, and identifies the knowledge, skills and abilities (KSAs) needed on the job.  It determines both the frequency and the criticality of each identified component.  It often clusters and links groups of KSAs that go together, and prioritizes them to provide appropriate weighting of the content areas of the assessment to be developed.  Colleges and training programs often use job task analyses to drive curricular decisions and designs. The JTA also forms the basis for the credentialing test blueprint, which the guides the entire development process.  Subject-matter experts for JTAs are often professionals working in the field, and are either already credentialed by the issuing organization, or in the case of a new credential being developed, they are likely to be the first cohort to earn the credential.   A good JTA ensures good demographic diversity of the subject-matter experts, including both veteran professionals and distinguished newcomers to the field, to ensure that recent changes in the field of operations, technology, or regulations are included in the assessment.  


The test blueprint or plan is the detailed basis for the assessment design.   Created by a committee of subject-matter experts in concert with test development experts, the test plan is a written, formal document that informs and structures the development of the assessment.  The test plan covers global design questions -- such as what are the learning or performance objectives, how the assessment will be used,  what domain(s) will be covered, who will be eligible to take it, how often and where it will be offered, how long will the results be good for, and what will be the minimum performance threshhold for passing -- as well as very granular specifications regarding the content and sub-content areas and weightings, length of assessment, number and types of items and forms to be developed, cognitive or difficulty level needed, other characteristics required in item fields, and even the determination of subject-matter expert credentials and sources for item-writing tasks.  


Items are test questions or problems, and therefore are the body of the assessment.  Although writing, reviewing, editing and banking are all separate components of test development, they are grouped here for functional efficiency.  Items should only be written in accordance with the test blueprint - without it, they may be good items, but may  not meet the test specifications or measure the intended outcomes.  They may introduce extraneous and irrelevant domains, or biases that impact a candidate's response.  They may be duplicative of other items already in play,  or disproportionally weighted in certain content areas, or be too difficult or easy for the intended threshold level.  

A thorough review and editing process will ensure submissions meet assignment instructions and stated characteristics, that stems/distractors are factually correct/plausible, that writing conforms to needed style, and that items are clear, contextually appropriate, and free of bias.

An item bank is the database that stores items and maintains their security and exposure.  The banking system should keep all pertinent item characteristics, such as content areas and cognitive levels,  in separate fields, so that tests can be built according to specifications.  Items that have been modified or recycled should have version controls to ensure that parent items are not co-presented with child items, and that an audit trail can keep track of changes made to items.  There should be a process for assigning certain items to certain forms, and others to other forms; and there should be a process for retiring overexposed or underperforming items.  And the bank should be securely held so that neither prospective examinees nor subject-matter experts can view the entire bank, and so that once an item has entered the bank, not even the test development team can make even slight changes to it without a full review process being initiated.


Once the development of items is complete, the test forms must be constructed.  Test forms are the actual tests that an examinee sees.  Each test form must meet the test specifications as to content area, weighting, cognitive level, and any other defining characteristics built into the items.  The item selection must meet the specifications for the length of test, and the form must present correctly visually and technologically.   As with items, forms must be reviewed and edited for clueing, overlap, version and eligibility.  Depending on how often the assessment is to be administered, in what format, and to how many people, it is likely that more than one form will be required for any high stakes assessment.  The design of additional forms, which must also meet test specifications, becomes more complicated by the need to ensure that all forms are equivalent.  That is, a student taking form A of the assessment should score the same if they took form B instead.


Whenever possible, it is best to field test the assessment.  Essentially a beta test, the new assessment is administered to a limited number of defined candidates prior to being finalized. This allows the developer to confirm that the assessment is performing as expected, that all items are testing well, that the length and difficulty appear appropriate, that the intended venue for administration works, etc.   When a test is being piloted, examinees do not receive their results immediately, because the scoring, or standard-setting process, uses actual responses to determine the cutscores, or scoring threshholds.  Often, when scores are needed immediately or when there are not enough beta testers, piloting is not indicated.


The dermination of passing scores helps interpret the results of the examination for those administering it.  With a new assessment, that is done via a standard-setting study; if an existing exam has merely been revised, this might be accomplished with only an equating study.   Standard-settings are conducted by psychometric staff, and may use Modified Angoff, IRT, or other statistical processes.  Subject-matter experts will determine the threshhold level descriptions, and judge items against those descriptions.  Psychometricians will enter and analyze the judgment data, and determine the appropriate cutscores.   A single cutscore may be all that is needed for many assessments (Pass/Fail, Hire/Don't Hire, Licensure/No Licensure); additional levels may be required for certain competency exams (novice/journeyman/master) or academic exams (A/B/C/Fail).  All decisions on how the exam will be scored and how scores will be used are made at the test planning stage; the cutscore setting process interprets the examination (item and form) data, and defines the meaning of each cutscore thresshold, so that the predictive validity of the exam in correctly separating one level from another can be maximized.


The administration of an assessment can be undertaken by the organization developing the test or by the assessment developer.  Like every other step along the way, however, the key is to keep to the test plan.  At CEM, test administration begins with our technical team publishing the test to one or more test platforms, and ensuring that items and forms display correctly and function as they are supposed to on each platform.  We maintain strict control of test security internally and with our testing platforms, and also monitor web activity to ensure our exams, items and forms are not exposed.  If in the process of test administration, test security is compromised, then indeed the entire test validity is compromised.  We have policies and procedures relating to examinee eligibility, retakes, scheduling, payment, recordkeeping and more.  We work with scoring vendors, and determine how to report scores according to customer needs.     We also maintain a phone bank of customer service associates to handle candidate questions with the products, registration, scheduling or the testing venue(s).    Test administration is a large and often overlooked component of the entire test development cycle.


As with the cutscore determination, the post-administration, ongoing analysis of test items and forms is conducted by psychometric staff.   Such analysis is needed to determine if a test is functioning properly.  Psychometricians collect examinee data and calculate form reliability and item difficulty and discrimination.  Many of our revision projects are due to psychometric review and findings that indicate that items need to be modified or refreshed, or previously unscored (pilot) items should be substituted for existing ones.  Spikes in the data may indicate a security breach requiring immediate substitution of new forms or items.   


The test development cycle does not have a discrete endpoint.  New items must be developed and substituted for overexposed or poorly performing ones.  With new items come new forms, field testing, equating, administration and ultimately, analysis.   Even the first step, conducting a job task analysis, should be done at least every five years, to ensure compliance with new regulations and inclusion of ever-adapting technologies.   Major regulatory changes or market disruptions may require entirely new test plans, which in turn drive new item development.  In any given development plan or RFP, therefore, there should be some thought given to how the organization intends to maintain the assessment.


Institute for Credentialing Excellence
Council for Adult and Experiential Learning
Competency-Based Education Network
National College Testing Association
Open Education Resource Foundation
National Council on Measurement in Education