Services
- - - Talk to an Expert
      
      Our platform, solutions, and services are perfect for building modern credentialing programs. Let us help you get off to a flying start.
      
      Schedule Meeting
Use Cases
- - - Professional & Corporate
  - - Associations
  - - Talk to an Expert
      
      Our platform, solutions, and services are perfect for building modern credentialing programs. Let us help you get off to a flying start.
      
      Schedule Meeting
Our Partners
Resources
- - - Podcasts
      
      Kryterion's podcast explores the rise of technology, artificial intelligence, machine learning and the global demand for credentials.
      
      Listen Now
  - - AI
      
      Creating a Better Pre-check Experience with AI
      Read More →
  - - Talk to an Expert
      
      Our platform, solutions, and services are perfect for building modern credentialing programs. Let us help you get off to a flying start.
      
      Schedule Meeting
About Us

A Psychometrician's Guide to Smart Test Development

Test Development Overview:

Those new to credentialing often think, “how difficult can it be to create a credentialing exam?” In fact, a common misconception is that it is just a matter of getting a couple of people to write the test questions (i.e., items), and then it’s ready to go. Creating a quality credentialing exam is actually much more complicated and requires a fairly standardized process be followed in order to create an examination that supports the purpose of the credential and is legally defensible.

The reason for this is because credentialing exams are often used to make high-stakes decisions. A highstakes decision is one that has meaningful consequences for the candidate and/or the public. For instance, test scores from credentialing exams are often used to make decisions as to whether an individual:

is allowed to drive a car or motorcycle on the road
gets hired, promoted, or receives a pay raise, or
is deemed competent and safe enough to practice medicine with the public

It is important to remember that whether or not individuals obtain a credential can significantly impact their lives and sometimes the welfare of the public. Therefore, it is important that the test scores from credentialing exams are valid, reliable, and legally defensible.

A Psychometrician's Guide to
Smart Test Development

DOWNLOAD THE GUIDE

Legal Defensibility

Legal defensibility refers to the ability of the credentialing program to withstand legal challenges. For instance, an individual or group may claim that the testing processes (e.g., test administration and/or scoring) or testing outcomes (e.g., whether a person passed or not) are not legally valid and sue for damages such as lost wages. Legal challenges can be costly for test sponsors, so it is important to demonstrate that testing industry best practices, such as those described in this guide, were followed to prevent these challenges from occurring in the first place.

Reliability

A credentialing program’s ability to withstand a legal challenge often relies upon whether the test scores are considered to be sufficiently reliable and valid. Reliability refers to whether or not an exam produces consistent (reproducible) test scores. In other words, if a candidate were to take the same exam two times in a row, without any sort of remediation in between retakes, a reliable examination would produce similar test scores for both testing events. If it were a credentialing exam, the candidate would have a very high likelihood of either passing or failing the exam both times. Reliability is generally a function of the number of items included on an exam and the quality of those items. Evidence of reliability can be provided via statistical indices.

Let’s speak the same language!

There’s a lot of confusion within the testing industry as to the meaning of various credentialing concepts. This is in part because credentialing is the umbrella term used for the many types of programs that exist, including degrees, diplomas, licensure, certification, accreditation, and certificate programs. Credentialing is the process by which an organization grants formal recognition to entities (individuals, organizations, processes, services, or products) that meet certain criteria.

Two of the most popular types of credentialing programs are licensure and certification programs. The major differences between the two programs are:

Certification programs are voluntary whereas licensure programs are mandatory
Certifications are granted by a nongovernmental agency whereas licensure is granted by a governmental agency
Certifications are not required to practice a profession or perform certain activities whereas licensure is required.

A driver’s license is an example of a licensure program because it is granted by a governmental agency and required in order to drive certain types of vehicles.

Validity

The concept of validity is more complex and will not be discussed in depth here. At a basic level, the concept of validity addresses whether the test measures what it is supposed to measure, or more to the point, whether the quantity and quality of validity evidence supports the intended interpretation of the test scores. The important thing to remember about validity is that every step in the test development process can help or hurt your efforts to gather evidence to support the interpretations that you will make based on the test scores (i.e., those who pass the exam will perform up to identified standards or expectations in the real world). To gather the necessary validity evidence, it is not only critical that each step discussed in this guide be incorporated into your process, but that you give careful consideration to who is involved in the steps and how each step is documented.

SMEs

Throughout the test development process, active involvement from subject matter experts (SMEs), a representative from the credentialing organization, and a psychometrician is needed. When selecting SMEs, it is important to include persons who have backgrounds, characteristics, and experiences that cover the full realm of critical attributes of those in the target audience for the credential (e.g., work settings, education, geographic locations, specialty areas). In most cases, you will need eight to 12 SMEs to form a group that is representative of the target audience for your credential. In terms of the documentation, it is important to keep detailed and accurate records of who participated, the methodology used, and outcomes of the activity. If a psychometrician is facilitating the activity, he or she will likely document most, if not all, activities in a report for you.

This guide discusses each of the ten test development steps outlined in the following graphic. Together, they can help you achieve a foundational understanding of the process.

We believe this guide offers an excellent overview of current test development best practices and resource requirements. We welcome your feedback.

Step 1: Test Definition

The saying “if you don’t know where you’re going, chances are you won’t get there” applies to designing a credentialing program as well as a specific examination.

The test definition phase provides the foundation for the test development process and is probably the most crucial and yet most overlooked step in the process. This phase ensures that everyone involved in the process (key stakeholders, SMEs, psychometricians) has a shared understanding of critical aspects of the credentialing program and examination BEFORE development begins so you can get where you want to go in the most efficient and effective manner.

It is important to complete the test definition process prior to any other test development activities (e.g., job analysis, item writing) because it guides decision making throughout the test development process and keeps those decisions grounded in a common set of goals. The intent is to make sure key stakeholders (i.e., those who could influence or change the direction of the program) agree on certain program parameters (e.g., target audience) before beginning to work on the more expensive and less flexible steps. Otherwise, the test development process can become a “moving target” which results in lost time and money and often a poor quality product.

During the test definition phase, participating SMEs and client stakeholders arrive at decisions regarding the following topic areas (list is not exhaustive):

Purpose of examination
Intended interpretation of test scores
Target audience for credential
Eligibility requirements or paths to earning the credential
General scope of examination (i.e. at a high level, what content should or should not be measured on the exam)
Geographic locations in which examination will be offered
Overall structure and format of exam (e.g. performance or non-performance-based)
Definition of the minimally qualified candidate for the credential
Any other decisions that will impact the nature of the exam (e.g., vendor-specific or vendor-neutral, practice environment such as large vs. small companies)

The Minimally Qualified Candidate

Of particular importance is defining the expectations for a minimally qualified candidate (MQC) for the credential. The MQC can be thought of as the candidate who possesses the minimum level of knowledge and skills required to obtain the credential. The SMEs use this conceptual definition of a candidate—this candidate “persona” —throughout the entire test development process to help:

determine which knowledge and skills should be measured on the exam,
define the appropriate level of difficulty for the items (i.e., test questions) and the overall exam, and
describe the standard of performance required to earn the credential which is an integral part of the standard setting process.

Although the definition of the MQC will vary based upon the nature of the exam, it typically includes the types of job activities that can be performed with and without supervision, the level of experience (e.g., time on the job, number of hours of experience using the software, number of implementations, number of surgeries), and level of education and/or recommended training.

Creating an agreed-upon definition of minimal competence at the very beginning of the test development process is essential. Otherwise, it is very likely that your examination will either measure the wrong types of knowledge and skills, be too difficult or easy given the purpose of the exam, result in passing or failing the wrong candidates, or all of the above.

To illustrate the importance of this process, imagine the following scenario:

An organization wanted to develop its first credentialing program. Its director had worked for other credentialing organizations and felt comfortable starting the development of the program. She planned to bring in a psychometrician later in the process to help with running statistics and setting the passing/cut score for the examination.

The director gathered some SMEs and started outlining what should be measured on the examination. They came up with the examination domains and list of topics that should be measured in each domain. Once they were in agreement on this, they started writing the items. The director was familiar with item-writing guidelines and provided training for the SMEs. When they had written all the items, the director assembled a beta test form and administered it to the first group of candidates.

A psychometrician was then brought in to run the item statistics on the beta test items. Unfortunately, the item-analysis report showed that many of the items were much too easy (e.g., 95% of the beta candidates got them correct) for the target audience. This is because the director did not sufficiently define the MQC before writing the items.

For the examination to be consistent with the purpose of the program, the item difficulty level should be aimed at the level of the MQC for the credential. Without this guidance, the SMEs had written questions that were too easy and were not measuring the critical things that would differentiate someone who should hold the credential (i.e., possesses the knowledge and skills to perform at the requisite level) from those who should not yet hold the credential.

What Test Content Will Be Covered?

Another important decision that is made during the test definition process includes defining at a high level what content is covered or not covered in the examination. Identifying what will not be measured on the exam can be as important as what will be measured. Frequently, credentialing programs opt to exclude soft skills from the exam (e.g., communication skills, interpersonal skills, and ethics (unless there is a published code of ethics) since these are difficult to measure well in a written examination.

Other skills (project management, record keeping) are usually excluded because they are not core or unique to the job role or profession being credentialed, and employers can evaluate these skills via other existing credentialing programs. Various factors (e.g., market needs, resources for test development and maintenance) impact the decision as to the scope of test content. Nevertheless, the resulting scope needs to be clear to candidates and stakeholders of the credential.

Other decisions that can have a large impact on the test development steps include:

Will the examination be offered internationally? If yes, will the exam be translated and localized?
Related to technology and tools, should the content of the items be vendor-neutral (i.e., classes or types of tools) or vendor-specific? If vendor-specific, which vendors and products should be included?
Will you seek accreditation for the program under the National Commission for Certifying Agencies (NCCA) or the American National Standards Institute (ANSI) ISO/IEC 17024?
Which item types will be considered for the examination? Are these item types offered by the testing vendors that you are considering for test delivery? If performance exams are being considered, is there a good understanding of resources required to develop, administer, and maintain these items?

There’s a Cost for Changing Program Parameters

Years ago, Kryterion was hired to help a technology company finish the development of a certification exam. A group of SMEs had already written a significant number of items related to the company’s specific technology

During the first meeting, the SMEs and Kryterion psychometrician met with the CEO of the company who decided that the exam should be an industry-based certification, and, therefore, vendor-neutral (i.e., not specific to the company’s version of technology). Because that key decision was made after test development began, most of the existing items had to be discarded or almost completely rewritten.

This is why it’s so important to make sure assumptions are surfaced and key decisions are documented very early in the process. Changing major program parameters later on can be quite costly, time consuming and resource-intensive.

Step 2: Job Analysis

The job analysis study (aka practice analysis, job task analysis, role delineation) is typically the most expensive and least favorite of the test development activities, but it is absolutely foundational for the credentialing program’s quality and legal defensibility. The job analysis identifies the critical tasks, knowledge, and skills required for competent performance of the job role or profession being credentialed. The critical tasks, knowledge, and skills demonstrate the link between the content that will be measured on the examination and the actual job or job role being performed. Specifically, the job analysis:

Refers to the investigation of positions or job classes to obtain information about job duties and tasks, responsibilities, necessary worker characteristics (e.g., KSAs), working conditions, and/or other aspects of the work (AERA, APA, & NCME, 2014, p.220)

Why a Job Analysis Is a Must

Many credentialing programs wonder if the effort and cost of the job analysis study is really necessary. Couldn’t the process be abbreviated or skipped entirely? The risk of skipping this step or doing an inadequate study is enormous. Your program’s credibility in the marketplace is largely determined by how well candidates feel your exam matches and validates their personal experience in the profession or job role. For example, have you ever taken a credentialing exam and thought that much of it didn’t apply to your actual job? A properly conducted job analysis will help you pinpoint those aspects of the job role or profession that deserve focus, taken from the responses of your own candidates and credential holders, to ensure your exam product matches their experience. Creating an examination without doing this step often leads to failed programs, or sometimes, legal action against the credentialing body

Without a properly conducted and documented job analysis study, the credential being created has no legal defensibility because it cannot be demonstrated that the credential is actually job related. Even with an extremely reliable or precise exam, if you cannot demonstrate a link back to competent job performance, you will lose a challenge to your examination in a U.S. court. Additionally, for highstakes credentials (e.g., those required to work, or those associated with promotions or pay increases), candidates are more likely, as a group, to search for any reason to challenge the examination in court, particularly if they do not pass.

A properly conducted job analysis study insulates your credentialing program from a wide majority of legal claims that can be made against your examination by directly linking the exam content to critical job tasks as well as by ensuring input from a demographically representative population of credential holders and others who are thoroughly knowledgeable of the credentialed job role or profession.

The first (and possibly the most important) step of a good job analysis study is recruiting the right group of SMEs. Most importantly, the SMEs must have a thorough understanding of the job role or profession being credentialed. As a group, the SMEs should be as representative as possible of all critical characteristics of the target audience for the credential (which typically requires between 8 and 12 SMEs). This means assembling a group that is ideally:

1. Geographically dispersed , or at least representative of your profession’s geographic practice areas. This is particularly important if you have practitioners across international boundaries that will be credentialed with the same exam. Regions with expected differences in practice based on location (e.g., legal frameworks) ought to be represented in your SME focus group.

2. Varied in terms of experience in the profession. Different specialties or work settings will bring varying views to the process that will help pinpoint exactly what is core to competent practice. This also applies to length of time in the profession, as it is a good idea to have both newer and more experienced members of the profession.

3. As diverse as possible in terms of other demographics, but particularly in protected class status (e.g., age, race, ethnicity, gender). Exams created under a singular societal lens will be more easily scrutinized, especially if the challenge is for disparate impact of results.

A Cautionary Tale

It can be tempting to assemble a group of SMEs based on convenience (e.g., most proximal, familiar SMEs), but there can be dire consequences of this approach. The following example illustrates this point:

The manager of an information technology certification was being pressured to complete a job analysis study and create a revised examination in a short time frame. In an effort to meet the timeline, he hurriedly scheduled a job analysis meeting and invited product owners/managers and others who had detailed knowledge of the product. He did not have a group that adequately understood the typical product user, and specifically the job role being credentialed. The result was an exam with questions that were too advanced and not relevant for the credential target audience. Candidates complained when the exam didn’t seem relevant and many failed to earn the credential. The job analysis study had to be redone and many of the items on the exam had to be replaced.

In this example, not only was time and money wasted in having to redo the job analysis and examination, but it hurt the credibility of the certification program.

It will likely require effort to recruit the right group of SMEs—the best and the brightest are often also the busiest—but it is well worth the effort to accomplish a more comprehensive study of the job role or profession, include diverse perspectives in deciding on examination content, and help to ensure the best representation possible of the job role or profession within your exam.

Job Analysis Steps

A job analysis study frequently takes three or four months to complete (and sometimes longer) and typically includes the following steps:

Literature review to prepare for the job analysis meeting
Meeting to develop the content for a job analysis survey
Pilot testing and completion of the job analysis survey
Survey administration to the target candidate population and data collection
Data analysis of the survey results
Meeting to review data results and determine examination content

As noted above, there is a lot of preparatory work that leads up to conducting a good job analysis study. During the literature review, the psychometrician gathers information from you, in addition to doing other research, to develop a broad picture of what the job role or profession entails in terms of work duties, tasks, skills, and underlying knowledge. This information is then compiled and presented to your group of SMEs who will review the content, make suggested additions, and remove irrelevant content from consideration.

The end goal of the process is to create a job analysis survey (containing a list of job-related content and some background information questions) to be distributed to a larger sample of the credential target audience

It is recommended that this job analysis meeting with your SMEs be held as a 2.5- day in-person meeting, as the process of creating the lists of tasks, knowledge, and skills can take considerable discussion to reach consensus among peers. If an in-person meeting is not feasible, this process can be done through a series of web conferences, but it will take much longer and frequently the quality of the work product is somewhat diminished compared to the work product of an in-person meeting.

Job Analysis Surveys

Once the content of the job analysis survey (i.e., tasks, knowledge, skills, and background information questions) is finalized, the survey is piloted with a small group of individuals in the credential target audience to look for any missing tasks, skills, or knowledge that may have been overlooked. Corrections are made after the pilot test. The survey is then distributed to the wider sample of the credential target audience. It is important to have a survey distribution plan that will allow you to obtain a sample that is adequately large and representative of the target audience.

The total number of completed surveys required depends on the size of your target audience, but frequently nearly 400 useable surveys are needed to adequately sample the total population.

If possible, it is helpful to have names and email addresses for survey recipients so that reminders can be customized and distributed to those who have not yet completed the survey.

In the job analysis survey, recipients are asked to rate the level of importance, frequency, and/or significance of each job-related task, knowledge, and skill to actual job performance. The purpose is to identify the primary tasks and supporting knowledge and skills expected of a credentialed individual. The results of a properly performed job analysis will provide data that illustrate the importance of individual job tasks, the frequency with which they are performed, and the importance of underlying knowledge and skills for competent performance. Because job analysis surveys tend to be lengthy, it is helpful to keep the survey open for at least three weeks and offer incentives for completing it (e.g., continuing education credits, prize drawing). If response rates are low, it may be necessary to extend the survey data collection period.

Once the survey data have been cleaned and analyzed by the psychometrician, the data then undergo a thorough review by your SMEs to ensure the results appear to represent expected performance, and the sample adequately represents the target audience as a whole. This review is typically conducted via web conference, and final data-driven decisions are made regarding whether each job task, knowledge, or skill should be included or excluded on the exam based on the survey data.

The important knowledge and skills that will be measured on the exam are linked to one or more important job tasks. In essence, this creates a list of job tasks that are widely performed throughout the target audience and are considered important for competent practice within the profession (i.e., they are valid) as well as a list of knowledge and skills deemed important for competent performance of the job tasks. The final list of content is the basis for a validated test blueprint. A good job analysis study will produce content for the test blueprint that informs what needs to be measured to demonstrate competence as well as to show how those things should be measured.

A Psychometrician's Guide to
Smart Test Development

DOWNLOAD THE GUIDE

Step 3: Test Blueprint

The test blueprint (aka test specifications, test plan, exam content outline, etc.) is the foundational piece of documentation for your credentialing examination and provides a roadmap of how each examination is to be constructed. It includes important information, such as the specific content to be measured on the exam, total number of items, number of items in each content domain or topic, number of items at each cognitive complexity level (if applicable), and item formats or types to be used (e.g., multiple-choice, matching).

The test blueprint is created based on the results of the job analysis study and input from SMEs. The process is typically guided by a psychometrician and entails calculating the weights of a draft test blueprint (based on the job analysis data) followed by a web conference with SMEs and representatives from the credentialing body

As with the job analysis study, this process requires a group of eight to 12 SMEs who are representative of the credential target audience. The goal of the test blueprint development process is a test with valid inferences and reliable measurement. The test blueprint ensures that content domains are weighted according to their importance to competent practice of a credentialed individual (i.e., how often a knowledge and/or skill is used or the potential consequence of not applying a knowledge and/or skill properly).

Additionally, candidates use the test blueprint to prepare for taking the exam. The test blueprint content domains or topics provide the framework for test performance feedback to failing candidates

Finally, test developers use the test blueprint to ensure that each and every test form is built to the same breadth and depth of content validated in the job analysis study and samples that content in equal proportions for all candidates. For example, if your credentialing program creates multiple test forms that it administers over the same time period (i.e., some candidates get assigned to Form A, and others get assigned to Form B), the test blueprint ensures that both test forms equally represent the same content in the same proportions, and with the same types of items.

This ensures that candidates, regardless of the test form they receive, are being tested on the same job-related subject matter (i.e., fairness) and that the credentialing decisions being made are based on results of an equitable sample of the knowledge and skills pertinent to competent job performance (i.e., validity). This same concept also applies to programs that offer only a single test form at one time; each new or revised test form must meet the requirements of the test blueprint to ensure a fair and consistent testing process for all candidates.

Step 4: Item Development

Writing items for an examination is not as simple as it may appear. Writing items for the purpose of measuring candidates’ ability to competently perform the job role or specific job activities takes a considerable amount of time and effort that many programs overlook. Most item writers think that writing items is a matter of determining what candidates should know and asking them a direct question about it.

The result is a large number of mostly definitional items that require rote memorization to correctly answer and do not really assess the candidate’s true understanding of a topic or how it is competently applied when performing the job. Worse still, these items typically do not do a good job of separating the truly competent from those who are less competent, which is the primary purpose of a credentialing program.

Memorizing Isn’t Professional

A good rule of thumb is that if someone can just memorize training material and pass an exam, it should likely not be used for certification or licensure purposes. In general, certification and licensure exams should focus on practical or applied knowledge, not memory retention.

Another misconception is that the goal is to write items that “trick” candidates into choosing the incorrect response. On the contrary, the goal is to develop clearly written items in which minimally qualified candidates (MQCs) are more likely to choose the correct answer than less-qualified or competent candidates.

There is a delicate balance to strike in item writing. Writers must hone in on items that target the MQCs taking the exam while also writing items that assess their level of competence. New item writers often lack training that allows them to create items that focus on particular job-related content, reflect the desired level of difficulty or complexity, and serve as precise measures of performance that avoids introducing unwanted error into the measurement of candidate competency.

This results in a lack of familiarity on how to avoid writing subtle clues into items that highlight or hint at the correct answer. Lack of training may also inadvertently produce items that allow candidates to pass your examination based on their test-taking strategies instead of competence. You may have heard about such programs that train candidates in “test-taking strategies” that result in something referred to as “testwiseness.” Test-wise candidates can easily scrutinize poorly written items to determine correct answers, even when they know nothing about the subject matter.

They can do this because novice item writers have inadvertently left subtle clues to the correct response in the item and/or written some answer options in ways that are obviously wrong, not attractive enough, or even humorous, all which serve to make the correct answer stand out even more.

Having a group of SMEs trained as item writers is essential to generating quality test content for a credentialing examination. SMEs bring their knowledge of the test content that is required to write valid items. Psychometricians then train SMEs to focus an item on a single piece of job-related content from the blueprint at the level of complexity and difficulty that is appropriate for the target population. In this manner, an item bank of individual small measures is formed that samples the larger whole of candidate competency delineated on the test blueprint.

Training also covers item writing techniques that help avoid giving candidates clues to the correct answers, unintentionally distracting candidates through language/tone of the items, or using items that are freely available to the public already (e.g., in reference sources that candidates use to study). Such training will not only help your credentialing program develop high-quality items that accurately measure candidate competence, but will also provide items that are of much higher fidelity than those written to only measure memorization of information.

Training typically takes the form of an item-writing workshop, which can be accomplished in two ways:

An online training session via a web conference that leads the SMEs through the Dos and Don’ts of writing quality items, then allows them access to item banks to begin creating items according to pre-assigned targets.
An in-person training session followed by a workshop to allow SMEs to collaborate, receive more coaching/feedback, and get real-time updates on item writing needs.

Item Review

However, merely training item writers to write good items is not enough to ensure a high-quality examination that adequately measures candidate competence. Each and every item should undergo a thorough vetting process that begins with a psychometric and grammar edit to ensure the items comply with item-writing guidelines, clearly convey what is being asked, are not offensive or distracting to any subgroup of persons in the credential target audience, are grammatically correct, etc. All of the items should also undergo a review by a group of SMEs. The SMEs should collectively review the items for:

Congruence with the knowledge/skills in the test blueprint
Technical accuracy
Scoring accuracy
Clarity
Importance to practice
Appropriateness of difficulty level
Plausibility of incorrect options (i.e., distractors)

To the extent that newly written items do not meet the criteria listed above, they will need further revision to meet the criteria or be retired from use before they are ever administered to candidates. If items suitably meet the criteria, they should be administered to actual candidates as unscored items to either gather measurement statistics that will confirm their suitability for use as scored test items or suggest further revision (or retirement) is needed.

More on this process will be discussed in the Beta Testing section. The key point is that you must ensure that items on your exam can reliably measure job-related competence. The better all of your items perform, the more accurate your decisions will be when you award candidates a pass or fail result.

When organizations skip any of these critical item development steps, there are always significant performance issues with the items. They affect the validity and reliability of the examination and often leave organizations with tough and risky decisions to make.

The Surprisingly Flexible Multiple-Choice Item

Multiple choice examinations are sometimes criticized as inferior to performance tests in predicting how a candidate will perform on the job. Many credentialing programs underestimate the flexibility of the multiple-choice item format in the measurement of complex behavior or judgement. Multiple-choice items can be used to test performance. A simple example is a math problem that asks a candidate for a financial planner certification to calculate the tax obligations on a stock sale for a client.

Your client sells stock. The long-term capital gains are $5,000. At a 15% tax rate, what is your client’s tax obligation on this sale?

$150
$500
$750
$1250

While simple, this example represents what can be done with multiple choice items regarding more complex tasks—like evaluating a situation and making a judgement, making medical decisions based on data, or selecting appropriate tools for a defined task. All of these have valid and defined answers, and incorrect responses that can be placed into a multiple-choice item. The only difference between these examples is the complexity of the task being presented to the candidate.

Item Format vs. Item Writer

Disdain for the multiple-choice item format stems from their common misuse. All too frequently, inexperienced item writers create multiple-choice items that measure the wrong thing, and the multiple-choice item format is blamed rather than the item writer.

For example, consider an item that is supposed to measure whether a candidate can correctly troubleshoot an issue. Rather than designing an item that requires candidates to demonstrate the ability to think through the situation and identify the root cause and a solution, a poorly trained item writer may develop an item that assesses candidates’ ability to recall the steps in a root-cause analysis.

Candidates take the exam, the item doesn’t measure their ability to perform the actual task of identifying the issue in a specific scenario, the exam loses credibility to measure candidate competence to perform the task, but the multiple-choice item format is blamed rather than the insufficient assessment of the complexity of the content

Remember from the job analysis section, the quality of the examination as your product and your credibility in the marketplace are largely determined by how well candidates feel your exam matches and validates their personal experience in the job role or profession.

Multiple choice items can be written to measure target behaviors as closely as possible to actually observing a candidate perform them, even if they cannot actually require the behavior to occur during the test. This can be accomplished by providing a realistic scenario to candidates, and/or data that might inform performance of a task on the job. Candidates can be asked what the next step in the process should be, what the missing step is, or to make a judgment call based on the available information.

Simple recall of something memorized is unlikely (if not impossible) when candidates are required to mentally work through the required behaviors based on their experience. This then allows them to answer the item in a way that reflects actual performance of a complex task. Items written to do this are more difficult to write as they require more nuanced options and potentially complex item stems. However, the multiple choice item format can be used to create measurements that approximate actual performance of complex tasks.

A Psychometrician's Guide to
Smart Test Development

DOWNLOAD THE GUIDE

Step 5: Beta Testing

The purpose of beta testing is to collect a sufficient amount of candidate response data on a newly written or revised item in order to statistically analyze its performance and determine whether it should be used as a scored item on an operational or “live” exam. It is a quality control measure to ensure that only those items that meet certain psychometric standards are used as scored items on the operational exam. Beta testing is also used to establish the baseline statistics for your items and identify how items might be improved before being used to determine candidates’ test scores.

To begin the beta testing phase, you need the newly written or revised items that your SMEs have determined are good and valid measures of job performance. Since not all items will perform acceptably, extra items (typically 33 to 50 percent more than is needed for the operational forms) will need to be beta tested to ensure there will be a sufficient number of acceptable items for the operational exams. The item statistics derived from beta testing can then inform decisions on how/where the item should be used in the future. For example:

If the item has acceptable item statistics with no content concerns, it can then be used as a scored item on an operational test form.
If the item shows some problematic item statistics, it can be revised in an attempt to alter the

response pattern to the item and yield better statistics (i.e., adjusting the item content to alter its difficulty or increase its precision). It is very important to note that if an item is altered after beta testing, it should be beta tested again. This is because altering the content of the item will change how candidates respond to it and, therefore, will impact its statistical performance.

Typically, beta testing is best accomplished in one of two ways:

1. A Beta Exam consists of administering a test form that contains all (or a large percentage of) unused items to establish baseline performance for the test form. Typically, this is only done for new examinations or those without an established passing/cut score. Test forms developed in this way also exceed the number of items expressly needed to meet the test blueprint by one-third to one-half to account for any items that do not perform well enough to be scored. This allows for some latitude in decision making when creating the final set of scored items for the operational form and on which beta participant scores are calculated. In this method, beta participant scores are delayed for a few weeks, until the test form is finalized (i.e., the scored items that meet the test blueprint are identified) and a standard setting study is completed to establish the passing/cut score. As such, it is important properly manage beta participant’s expectations by communicating a realistic timeline for their results.

Beta Exam Q&A

How are beta exams administered?

The beta exams should be administered in the same type of testing environment that the operational exam will be administered. For instance, if the operational exam is to be administered in a proctored environment, the beta exam should be as well.

Who should participate in the beta testing process?

Beta exam participants should be representative of the target audience for the credential in terms of level of experience, training, practice environment, etc. Many organizations use actual candidates for the credential. If this is not possible, the candidates should be as close to the target exam population as possible.

Ideally, approximately 70 percent of the beta participants should be as similar as possible to the profile of the Minimally Qualified Candidate (MQC) for the credential (i.e., those who would just barely pass the exam), with 15 percent of the beta candidates being lower than minimally qualified (but still informed on the topic areas) and the remaining 15 percent of the beta candidates being higher than minimally qualified (e.g., subject matter expert level). The reason this is important is that classical test statistics are sample dependent, meaning that if many of the candidates have a significantly higher or lower level of expertise than the minimally competent candidate (see Step 1: Test Definition), the statistical results may be somewhat biased. In other words, items may appear easier or more difficult than they actually are for the MQC and may be inappropriately excluded from the operational test forms.

It is helpful to understand the level of expertise and/or experience and other important characteristics of your beta participants so this information can be considered as an additional data point during the passing score determination process. If information is not gathered for the beta candidates in another process, a survey can be included with the beta exam to gather these data.

How long do beta participants have to wait for their results?

Beta candidates often receive their results six to eight weeks after the beta testing window closes. If it takes a while to gather a sufficient amount of candidate response data, beta participants who took the exam at the beginning of a beta testing window may have to wait a few months to receive their results.

Are there any strategies for increasing participation in a beta exam?

Yes. Because beta exams typically consist of more items than the operational exam, test sponsors often provide a discount on the examination fee and other incentives (e.g., discount on a course or conference admission) to recruit beta participants. It is also helpful to communicate the purpose and importance of beta testing.

2. Item Seeding involves placing a small number of unused and unscored items on a “live” test form. Typically referred to as “item seeding,” this is an industry best practice for an established credentialing program to replenish its item bank on an ongoing basis. The number of unscored items included on each test form varies depending upon: the number of items needed to create new forms; the need to build a bank of replacement items for future use; and/or the ongoing replacement of older scored items. After the unscored items have been administered to candidates, their statistical performance is assessed, and they are either graduated for use as a scored item on later test forms or sent back for revisions and beta tested again. Best practice is to NOT identify the beta items so candidates will respond to the item as if it contributes to their score and thus yields accurate data. This method allows for items to be replaced on subsequent test forms without delaying candidate scores or results. It also maintains the passing/cut score determined during a previous standard setting study or through a statistical equating procedure. When using this method, it may be necessary to extend testing time to ensure candidates have adequate time to respond to the additional items.

Credentialing programs employ each of the above methods at different stages in their test development lifecycle. Credentialing programs starting from new will need to undertake the first method because they will be using all newly developed items. Credentialing programs that have just finished a job analysis may also need to utilize the first method, if a large number of items were developed to accommodate the new test blueprint. Once a passing/cut score is determined on the first test form under a new test blueprint, credentialing programs typically shift to the second method of beta testing in order to continue building the pool of items from which to draw replacements, create entirely new test forms, or mitigate test security breaches. Programs that elect not to continue beta testing new items tend to run out of “extra” usable items which may be required if item content on a live form is breached or is no longer accurate (e.g., due to changes in technology, regulations, etc.) and needs to be replaced. A good beta testing routine can help you identify problematic items, avoid rescores, and maintain a steady stream of new examination content.

Step 6: Item Analysis

The quality of each test item is not truly known until a statistical analysis of the items has been performed. SMEs cannot reliably predict which items will measure at the desired level of difficulty or discriminate well between high-performing and low-performing candidates, as demonstrated by the fact that approximately one third of the items (no matter the industry) do not perform as desired.

Best practice is to run a statistical analysis after beta testing to ensure items are performing acceptably before using them as scored items as well as at regular intervals to ensure item and examination performance does not degrade in response to overexposure (e.g., being administered to a large number of candidates, including those who fail and retest) or a change impacting specific items (e.g., regulatory change, change in industry).

When evaluating the quality of items, the goal to ensure the item is fair to candidates and provides useful measurement data for differentiating candidates of different qualification levels.

For most credentialing programs, two classical test theory (CTT) statistics are primarily used in evaluating item performance—item difficulty and item discrimination. Items should have an item difficulty that is clustered around the examination passing/cut score because the purpose of credentialing examinations is to identify those candidates that have the requisite knowledge and skills to perform at a minimal level of competence in relation to the scope and level of the credential (i.e., this is the measurement point of the examination).

CTT and IRT Measurement

There are two theories of measurement used in the statistical analysis of items and examinations—classical test theory (CTT) and item response theory (IRT). A major advantage of a CTT analysis is that it can be performed on examinations that have lower candidate counts (e.g., less than 200 candidates). Another advantage is that CTT statistics are easy to interpret. The largest criticism of this method is that the results are sample dependent, meaning the results are impacted by the characteristics of the specific candidates whose data was included in the analysis. So, if a group of candidates was particularly well-prepared for the examination, the items might appear easier than they would be for another group of candidates who were less prepared.

This is a particular concern when using a CTT analysis to evaluate the items on a beta examination and the beta participants are not representative of the target audience of the examination (e.g., organization used either already-certified individuals as beta participants or students who did not meet eligibility criteria for the certification).

IRT analyses are more complex and require specialized software and knowledge to properly conduct the analyses. An IRT analysis also requires a higher candidate count and data that fits the statistical assumptions for the IRT model to be used. Recommendations from psychometricians vary on the minimum candidate count for this type of analysis, but frequently the range is 200 to 500 candidates, depending on the IRT model to be used.

The key advantage of an IRT analysis is that the analysis is sample independent, meaning the results are not dependent on the particular characteristics of the specific candidates whose data was included in the analysis. Additionally, the analysis puts the candidate’s ability and item difficulty on the same scale, which is useful in making more accurate certification decisions.

Item Discrimination

Items should have an item discrimination statistic that is in the good or acceptable range. Ideal item discrimination is achieved when most of the high-performing candidates on the examination (i.e., candidates who achieve a high score on the examination) select the correct response(s) to the item, and most of the low-performing candidates (i.e., candidates who achieve a low score on the examination) select the incorrect responses to the items.

When selecting items for a test form, items with a difficulty level in the desired range and good discrimination statistics are chosen first. If these “good” items are exhausted before the requirements of the test blueprint are met, it may be necessary to include items that are easier or more difficult than is ideal, or items that have less than ideal item discrimination.

It is important to minimize the number of items with less than ideal statistics to ensure a test form that is sufficiently reliable and supports your assessment measurement goals.

For the items that do not perform well, the item statistics provide a helpful roadmap for determining the issue(s). Sometimes the item distractors require candidates to make too fine of a distinction between them and the key, making the item exceedingly difficult. In other instances, the key may be more obvious to candidates than was intended or the distractors may be too obviously wrong, making the item too easy.

Option Analysis

The option analysis, which identifies the response patterns of candidates in each performance group (e.g., which response options the high-performing candidates are choosing), is particularly helpful in pinpointing the reason the item is not performing well. To illustrate this point, an option analysis for an item is provided on the next page:

The option analysis would allow you to raise specific issues about the item with SMEs to see if the item could be revised and improved.

The item had negative discrimination because more of the high-performing candidates chose distractor B than key A. SMEs could evaluate if B is correct or partially correct, something in the stem is making some of the high performers think B is the correct answer, or some other explanation (e.g., change in industry or regulations that impacted item).
No candidates answered D, so SMEs should try to come up with a more plausible distractor for D.
Very few candidates answered C, so SMEs may also want to try to write a more plausible distractor for C as well.

Please note that not all items are worthy of the effort that is required to revise them. Items that are measuring concepts that are not critically important to competent performance or not encountered by a majority of candidates should likely be retired since they do not contribute valuable information about a candidate’s competency. The problematic items that are measuring an important concept and are just too difficult or have one or two answer options that need improvement can often be easily revised by SMEs.

When an item has been revised after it has been administered (on a beta or operational examination), it must be beta tested again to see if the revisions have positively or negatively affected the item performance. Any revisions to an item beyond fixing a typo can change how candidates respond to the item (e.g., how difficult the item is or which answer options the high-performing or low-performing candidates choose).

A Psychometrician's Guide to
Smart Test Development

DOWNLOAD THE GUIDE

Step 7: Forms Build

New test forms must always be assembled to meet all requirements of the test blueprint. This adherence to the test blueprint is what ensures that each candidate taking the examination (no matter which test form is received) will be measured on the same job-related content as all other candidates. For example, consider the following blueprint for the topic of test development:

The Example Test Blueprint above shows a need for 100 total items on each test form. Additionally, each test form specifically requires those 100 items be divided among three content domains or topics: 40 items covering domain I, 40 items covering domain II, and 20 items covering domain III. Further, the total number of items within each domain

are subdivided by cognitive level classifications that must also be matched when creating new test forms (e.g., 10 Recall items, 20 Analysis items, and 10 Synthesis items for domain I).

Multiple Test Forms

In addition to meeting the requirements of the test blueprint, it is also desirable to have the test forms be a similar level of difficulty and performance (e.g., have similar reliability, decision consistency, and standard error of measurement statistics).

In most cases, assembling new test forms is a manual process, which involves selecting the best quality items to cover each domain/cognitive level combination on a test blueprint. The items on each test form should not give clues about how to answer another item and should not measure the exact same knowledge as another item (called “enemy items”). All items included as scored items on the test form should also have defensible statistics. The process just described is used to create fixed linear forms, which are the most common type of test form.

When using fixed linear forms, candidates who receive the same test form receive all the same items, but the items (and answer options) may be delivered in a randomized order for each candidate. There are test form construction methods that automate the process (e.g., LOFT, CAT), but these techniques require much larger item banks than fixed linear forms and huge amounts of item characteristic documentation and/or advanced statistical models such as item response theory (IRT) to function appropriately. (See sidebar on IRT/CTT in Step 6: Item Analysis.) Typically, these methods are used by larger programs due to cost and other resource requirements.

Test Security

The form construction method used will depend on examination security considerations and other factors of the credentialing program (e.g., annual candidate count, practical constraints of item and test development). Most credentialing programs will want to create and maintain at least two concurrent fixed linear test forms with about 20% item overlap between the two forms.

This is important for a few reasons, but the primary one is to increase test security. Having at least two concurrent test forms that are largely distinct in terms of item content decreases the amount of exposure of each individual item on the test forms (i.e., candidates see each item fewer times), which means your items will last longer before needing replacement (candidates talk about tests, even if you tell them not to).

Additionally, when a breach of test security does occur, it increases the likelihood that only one of the two test forms will be breached, which allows you to continue testing while a replacement form is being developed.

Fair Testing

In addition to bolstering test security, having two concurrent test forms also ensures fair testing to your candidates. As mentioned previously, when both tests are aligned exactly to the test blueprint, all candidates are being assessed with the same sampling of jobrelated content.

Not all candidates pass on the first attempt, however, so having a second concurrent form ensures that candidates can be granted at least one re-take of the exam without seeing the exact same set of items again. This reduces measurement error by preventing candidates from memorizing and responding to the same items, which also ensures that candidates re-taking the exam do not have any inherent advantage over those who passed on their first attempt. This all supports the goal of fairly assessing the competency of candidates and having confidence in the examination results.

Step 8: Standard Setting

A standard setting study is a process by which a passing score or cut score for a test form is selected to serve as the inflection point for all pass or fail determinations for candidates. For the cut score to be defensible, a criterionreferenced methodology must be used that ties the cut score to the theoretical performance on the examination by a candidate who would only just barely meet the minimum standards for competence.

The Angoff Method

There are multiple criterion-referenced standard setting methodologies (e.g., modified Angoff, bookmark, etc.), but the modified Angoff procedure is used most frequently to arrive at the cut score(s) for credentialing examinations.

Named after the late William Angoff, the Angoff procedure asks a demographically representative committee of SMEs (e.g., geographically, experientially) to come to agreement on what a candidate who just barely surpasses minimum performance standards on the examination would be expected to know and do (i.e., minimally qualified candidate profile).

After a training on the process and calibration exercise, SMEs are then asked to use that theoretical profile of the minimally qualified candidate (MQC) to estimate the percentage of MQCs who would be expected to get each item correct (called Angoff ratings). This is done for each scored item on the examination.

The individual ratings of SME panelists are then consolidated and compared. Items with ratings that are too varied (i.e., ratings that differ by more than 20-30 percentage points) are discussed with the entire SME panel. SME panelists share their rationales for their initial Angoff ratings.

After the discussion of the perceived difficulty of the item for the MQC, SMEs are given the opportunity to revise their ratings. Throughout this process, the psychometrician may periodically share information to help guide discussion (e.g., actual item difficulty statistics).

Once all Angoff ratings have been reviewed and finalized, the ratings are averaged across all items on the test form to derive an initial Angoff cut score. The Angoff cut score can be thought of as the lowest score a MQC is likely to achieve on the test form. In addition to the Angoff cut score, the variability in SME ratings or standard error of measurement of the test form is used to calculate a range of acceptable cut scores around the Angoff cut score that may be selected as a cut score. Nevertheless, the SME panelists would need to provide a sufficient and defensible argument for deviating from the Angoff cut score.

Criterion-Referenced Standard Setting

A final cut score is then unanimously recommended for approval from the SME panel to the governing body of the credentialing program for acceptance. Once finalized, the cut score becomes the criterion-referenced passing point for that specific test form.

The cut score for alternate forms and/or future forms of the examination will be determined by equating the initial form’s cut score (determined through the standard setting process) to the equivalent score on the subsequent forms.

Thus, a fair, criterion-referenced cut score is applied to all test forms to establish minimum competence of candidates, regardless of differences in individual form difficulty (i.e., harder forms have lower cut scores, easier forms have higher cut scores).

Who was William Angoff?

William Angoff (1919-1993) was an American research scientist who worked for the Educational Testing Service for 43 years.

He graduated from Harvard University and later earned a master’s degree and Ph.D. from Purdue University. During WWII, Angoff worked for the U.S. Army as a psychological testing expert.

He was hired by the Educational Testing Service in 1950, becoming Director of Developmental Research in 1976. He helped improve the Scholastic Aptitute Test (SAT). In 2019, the test was taken by 2.2 million American high school students.

During his professional career, Angoff made major contributions to educational measurement and authored seminal publications on psychometrics including the definitive Scales, Norms and Equivalent Scores. Angoff was known for his commitment to the highest technical standards and for his ability to make complex measurement issues widely accessible and understandable.

Versus Arbitrary Cut Scores

The process described above stands in contrast to other common ways of determining cut scores for test forms (e.g., selecting an arbitrary cut score of 70% or normreferenced methods such as grading on the curve or percentile rank). The reason credentialing programs go through the rigorous process of a criterion-referenced standard setting study is because arbitrary and normreferenced cut scores change meaning depending on the difficulty of the test forms and the ability level of the candidate group taking them. This sets up an indefensible platform for making pass/fail decisions about candidates, over time and between test forms, which is easily challenged in court

For example, a credentialing program chooses an arbitrary standard of 70% for passing an examination. There is no supporting evidence that 70% meets the initial level of competence. In fact, depending on the difficulty of the initial test form, 70% could far exceed minimum competence (i.e., qualified candidates would fail), or worse, it could underestimate required competence (i.e., non-qualified candidates would pass). This gets compounded as more test forms are developed because applying 70% to new test forms may or may not represent the same level of competence being measured on all other test forms, particularly if test forms are not constructed to be equitably difficult.

Arbitrary standards have no place in valid certification and licensure testing, and, in our opinion, no place in any decision-making process.

Versus Norm-Referenced Standards

Norm-referenced standards are only slightly better than arbitrary standards. In setting norm-referenced standards, at least some attempt is made to gauge the ability level of candidates compared to the difficulty of the test they are taking. However, since the ability of candidate groups can vary between administrations (i.e., one group is more able or prepared than another), normreferenced standards will differentially impact individual candidates depending on the cohort with whom they took the exam.

For example, a qualified candidate may take the exam with a group of peers who are all exceedingly qualified and/or prepared. Applying a norm-referenced standard to this group may result in a failing score for the otherwise qualified candidates because their peers were even more qualified. Worse, an unqualified candidate may take the exam with a group of other unqualified people. Applying a norm-referenced standard to this group would result in unqualified people passing even if no one was truly competent. This is also compounded by differences in form difficulty (e.g., a qualified candidate taking the exam with more qualified peers could also be assigned to a highly difficult test form, further decreasing the qualified candidate’s chance at passing).

Conclusion

Conducting a standard setting study and establishing a criterion-referenced cut score is the only way to ensure all candidates are fairly assessed. A criterionreferenced cut score applies the exact same standard of competence to each individual candidate, ensuring that truly competent candidates pass while non-competent candidates are excluded.

Further, criterion-referenced cut scores can be validly applied to other test forms through statistical equating procedures to ensure that the cut score on every form translates to the same level of competence, even if the cut score is higher on one form (because it is easier) or lower on another (because it is more difficult).

A Psychometrician's Guide to
Smart Test Development

DOWNLOAD THE GUIDE

Step 9: Scoring Beta Participants

After steps 1-8 are completed and the exam is finalized, beta participants receive their score feedback. It is important to note that beta participants are typically scored on the same set of items that are contained on the operational exam that “goes live” for administration to other candidates. This ensures that the testing process is fair for all candidates, whether or not they participated in the beta testing process.

Best Practices in Score Reporting

Reporting the examination results to the candidates (beta participants or other candidates) is one of the most significant communications of any credentialing program. The general rule for providing results to candidates is to only provide the information that will be meaningful to the candidate

For passing candidates, the results reporting can be a relatively straightforward endeavor. At a minimum, the passing candidate should get his or her passing result and information on what is needed to finalize obtaining the credential or other information related to holding or maintaining the credential (e.g., appropriate use of certification mark, recertification requirements). If desired, passing candidates can also be provided with their overall score on the examination (raw, percentage, or scaled) and/or their content domain/topic scores.

However, it should be noted that providing additional score information to candidates beyond a passing result (e.g., “Congratulations, you passed!”) may have the potential for misuse in the field, and should be thoughtfully considered prior to being distributed.

For failing candidates, the results reporting includes a few more considerations, some of which originated early in the test development process. Candidates who fail an examination should be provided with their fail result, as well as any eligibility rules and information for re-taking the exam. It is best practice to give failing candidates feedback that will help them identify their weak areas so they can better prepare to retake the exam. However, if feedback to failing candidates is to be given in the form of scores, the examination must be developed to facilitate that information being meaningful and useful to the candidates.

For example, presenting an overall failing score (raw, percentage, or scaled) to candidates can be useful only when presented with information about how far they fell below the cut score (i.e., how much they need to improve). Additionally, provision of content domain/ topic scores relies on each domain/topic containing a sufficient number of items to yield valid and reliable results (which had to be considered when developing the test blueprint). This equally applies if domain/topic scores are to be reported to passing candidates.

While the minimum number of items in a domain/ topic or subscale varies depending on the homogeneity of content, 15 to 20 items is generally accepted as the lower range. Only valid and reliable domain/topic scores will provide meaningful feedback that failing candidates can then use to focus their studies (e.g., in topics where scores were comparatively low) and improve their scores.

Misusing Exam Scores

Employers may discover that candidates have their overall score for their credentialing exam and begin to use that as a personnel selection criterion (i.e., candidates with higher test scores get preference for hiring or promotion). That would be a misuse of examination results that could impact candidates adversely since the scores on the examination beyond the cut score are not validated to differentiate between higher levels of performance. Errantly using exam raw or percentage scores for personnel decisions also does not account for the difficulty of individual test forms (i.e., a candidate may have achieved a lower overall score on a more difficult form, but still passed the examination).

Step 10: Test Maintenance

If you have reviewed and applied Steps 1-9 to your credential development process, congratulations are in order. Your credentialing program is launch-ready.

However, creating a credentialing exam is not a “one and done” proposition. To maximize the return on your credential development investment, you need a carefully considered test maintenance plan.

This means that you will need to monitor the examination’s performance on an ongoing basis and update the test forms at regular intervals. An examination monitoring and maintenance plan is critical to mitigate examination security risks and ensure that the examination continues to support the purpose of your credentialing program.

Failure to establish a test maintenance process is a source of great program risk. The value of your credentialing program can be diminished if its content is perceived to be outdated or individuals attain the credential through fraudulent means (e.g., cheating).

There is no magic formula or prescription for examination maintenance activities. Each program must consider its own risk factors and circumstances to create its examination monitoring and maintenance plan. Your plan should include the following activities:

Monitor the examination and item statistics
Monitor the currency of item content (e.g., changes to professional practice, product releases, and/or regulatory changes)
Monitor candidate comments about the items, examination, or program, if applicable
Monitor comments or feedback from stakeholders (e.g., consumers, employers, third- party payors)
Create new items (i.e., test questions)
Beta test (i.e., pretest) the new or revised items
Create new or revised test forms
Determine the cut score for the new or revised test forms

The schedule or intervals for these activities will vary based on a variety of factors.

Practical considerations typically drive the schedule of monitoring activities. For example, the critical factor controlling the schedule for examination and item statistics generation is candidate volume. To achieve reliable statistics, at least 100 candidates per test form is usually desired. If that is not feasible, analyses can be run with as few as 60 candidates. At a minimum, examination and item statistics should likely be reviewed annually.

Candidate volume will also help you determine reasonable intervals for reviewing candidate comments (e.g., monthly, quarterly, or yearly). Gather stakeholder feedback about the program when there are opportunities to do so. At a minimum, you will want to do this every few years before a job analysis to ensure that adjustments needed to meet market demands are considered when developing a new test blueprint.

The intervals for reviewing item content also depend on practical considerations. For instance, IT certification programs may need to review current items in relation to product releases to ensure that they have not become outdated. In fact, every type of credentialing program needs to identify the factors that put its items at risk of becoming outdated (e.g., regulatory changes, changes in practice related to new technology). For many programs, an annual review of the items is sufficient. For credentialing programs with only 100 or 200 candidates per year, it may be reasonable to review the item statistics and item relevance at the same time each year.

The schedule for the statistical monitoring of the items and test forms has implications for the examination maintenance activities since these statistics are needed to evaluate the items for continued use on test forms.

A primary goal of your examination maintenance efforts will be to manage examination security risk and limit the risk posed by outdated item content. Here are some questions to consider related to these risks.

Do you have a high annual candidate count? The higher the volume, the higher the item exposure and examination security risk.
Where is your examination delivered? Certain regions of the world are at a higher risk for examination security breaches due to cultural differences in how tests are perceived.
Is your program considered a high-stakes credential (i.e., required to work in the industry or greatly valued)? If the credential is required to practice or work in the industry or if the credential is highly valued, candidates have a greater incentive to undermine test security during the examination.
Are candidates likely to share information about the examination or items (e.g., many candidates work for the same employer, candidates are likely to take a prep class from the same provider)?
What are your retesting policies for failing candidates (e.g., length of time a candidate must wait before retaking the exam, number of times that a candidate can retest)? The less stringent the retesting policies, the greater the risk will be to examination security.
At what intervals may the item content become outdated? IT certifications require faster item and examination development cycles to keep up with product releases.
Will you seek accreditation for your program? If so, there are standards related to maintaining examination security and having a sufficient number of items to remain operational in the event of a security breach.

Your answers to the preceding questions will help identify how many test forms you need to develop for concurrent use and how frequently you should create new test forms to adequately maintain the examination. No matter the number of test forms and the frequency with which they need to be revised, your program should have a plan for ongoing new item development and beta testing.

Your item writers should be generating new content and revising and validating these items for use on an examination on a regular basis. Incorporate the revision of items that did not perform well or have become outdated into the ongoing item development process. Beta test new or revised items as unscored items on the test forms. Once data from about 100 examinees has been gathered on one set of beta test items, replace them with a new set.

At your established intervals, new test forms can be created. Whether you are replacing 10 or 20 items in an existing test form or creating a new one, you will need to determine its cut score. In fact, any change to the scored items on an examination requires that the cut score be evaluated. Replacing or editing the scored items on an examination WITHOUT evaluating the cut score negatively impacts the validity of examination results (and therefore, their defensibility).

This process of monitoring items related to performance and content, revising and retiring items, developing new items, beta testing items and creating new test forms constitutes a life cycle for items and examinations. This life cycle is the life blood of a healthy credentialing program.

In the absence of routine care, human beings’ ability to function begins to fail. When credential examinations lack routine care, typical symptoms include the use of poorly performing items to sample job-related content and overexposed examinations. One result can be errors in determining candidate competence (i.e., passing those who should not pass, and failing those who should pass). Worse, examinations can sometimes have too few items to meet their test blueprint, which results in a test that lacks legal defensibility.

Because there are so many factors to consider when formulating a plan for monitoring and maintaining your examination, you may find it helpful to consult with a psychometrician.

It is particularly important to budget sufficient funds and recruit enough SMEs to support the examination maintenance activities in your plan. The continued success of your credentialing program relies on your ability to keep the examination a valid and reliable measure of candidate competence.

You are now officially wicked smart…

Congratulations! You’ve now reviewed all ten steps of Smart Test Development. All were informed by the 40+ years’ collective experience of our Kryterion Psychometrics Team. You’re now better positioned for the launch your first or next certification program. The ten steps can also guide you in re-engineering an existing exam that may not be measuring up to expectations.

If you have any questions about the content of this guide, feel free to contact the Kryterion Psychometrics Team as follows: [email protected]

If you’d like to discuss specific aspects about your current, next or first(!) credentialing exam, just click the link below. We’ll follow up via email to schedule your free, no-obligation 30-minute phone call. Talk soon!

https://www.kryterion.com/psychometric-services-form

Talk to an Expert

Talk to an Expert

Podcasts

Creating a Better Pre-check Experience with AI

Talk to an Expert

A Psychometrician's Guide to Smart Test Development

Table of Contents

A Psychometrician's Guide toSmart Test Development

Legal Defensibility

Reliability

Let’s speak the same language!

Validity

SMEs

Step 1: Test Definition

The Minimally Qualified Candidate

What Test Content Will Be Covered?

There’s a Cost for Changing Program Parameters

Step 2: Job Analysis

Why a Job Analysis Is a Must

A Cautionary Tale

Job Analysis Steps

Job Analysis Surveys

A Psychometrician's Guide toSmart Test Development

Step 3: Test Blueprint

Step 4: Item Development

Memorizing Isn’t Professional

Item Review

The Surprisingly Flexible Multiple-Choice Item

Item Format vs. Item Writer

A Psychometrician's Guide toSmart Test Development

Step 5: Beta Testing

Typically, beta testing is best accomplished in one of two ways:

Beta Exam Q&A

How are beta exams administered?

Who should participate in the beta testing process?

How long do beta participants have to wait for their results?

Are there any strategies for increasing participation in a beta exam?

Step 6: Item Analysis

CTT and IRT Measurement

Item Discrimination

Option Analysis

A Psychometrician's Guide toSmart Test Development

Step 7: Forms Build

Multiple Test Forms

Test Security

Fair Testing

Step 8: Standard Setting

The Angoff Method

Criterion-Referenced Standard Setting

Who was William Angoff?

Versus Arbitrary Cut Scores

Versus Norm-Referenced Standards

Conclusion

A Psychometrician's Guide toSmart Test Development

Step 9: Scoring Beta Participants

Best Practices in Score Reporting

Misusing Exam Scores

Step 10: Test Maintenance

You are now officially wicked smart…

Test Candidates​

Test Sponsors

Quick Links

Take Control of Your Certification Program!

Please select the appropriate option below:

Read Reviews

A Psychometrician's Guide to
Smart Test Development

A Psychometrician's Guide to
Smart Test Development

A Psychometrician's Guide to
Smart Test Development

A Psychometrician's Guide to
Smart Test Development

A Psychometrician's Guide to
Smart Test Development

Test Candidates