I help write lots of “assessment questions”—some are for quizzes after courses, others are just review questions during courses. Some are for industry certifications like the CISSP and CSSLP. Some help teach, others gauge the effectiveness of teaching. I tend to be pretty good at it, so I put together some of my thoughts on how I do it.
The following is from Wikipedia: “Psychometrics is the field of study concerned with the theory and technique of educational and psychological measurement, which includes the measurement of knowledge, abilities, attitudes, and personality traits. The field is primarily concerned with the study of measurement instruments such as questionnaires and tests. It involves two major research tasks, namely: (i) the construction of instruments and procedures for measurement; and (ii) the development and refinement of theoretical approaches to measurement.”
These are my standard practices for writing questions used to assess knowledge, aptitude, attitude, or personality. Guidance in this document is drawn primarily from my experiences at (ISC)2 writing the CISSP and CSSLP questions.
Why do we need to create good assessment questions? Quite simply, we must ensure we’re testing what we think we’re testing. If we’re attempting to discover someone’s ability to recall facts about a given subject, we don’t want their score to be biased by their ability to create essay answers, work out logic puzzles, or catch overly-clever wordplay in questions and answers. Similarly, if we’re assessing someone’s attitude and personality[], we don’t want to ask questions that rely on depth of technical knowledge. This approach removes (to the extent possible) human bias from the assessment process and gives us greater confidence that the results reflect the candidate’s true ability in the areas important to us.
There are standard terms. They are useful.
Candidate — The person taking the assessment.
Stem — The question or statement that needs to be answered or completed.
Key — The right answer. The answer that is best from among the choices. Ideally the key is the best of all possible answers, not just the best of what’s on the assessment.
Distractors — Wrong answers.
Options — Collectively the distractors and key.
Rules for Good Questions
Multiple choice assessments are preferred because they are easier to score, they make it harder to guess, and it is possible to look for very specific knowledge without requiring essay answers, making the assessment more efficient for everyone.
All questions have the same format: a stem and exactly four[] possible responses.
There are essentially two forms of question: sentence completion and question-answer. Here are two examples:
Multi-tier applications often use a
- Bell-LaPadula model
- parametric polymorphism paradigm
- model-view-controller paradigm
- discretionary access control model.
The Payment Card Industry (PCI) Data Security Standard (DSS) requires which of the activitites?
- Public disclosure of software defects
- Agile software development methodologies
- Encryption of all personally identifiable information (PII)
- Use of a software source code scanning tool
Rules for Good Questions
There is exactly one right answer. No questions have “(a) and (b)” or “all of the above” as distractors or keys. The option “none of these” is discouraged and should be used only when there is an unequivocally correct option that is obviously not included or implied by the distractors.
All options (distractors and keys) are approximately the same length (i.e., they look visually equivalent). When there is significant variability in length, they are often arranged such that the shortest option is first and each option gets progressively longer.
All distractors should be equally capable of distracting the candidate. If some or all distractors are clearly “made up,” then there will be an artificially high correctness rate for that stem. (See “Rules for Scoring”)
Grammar must be parallel. If the stem is a sentence that needs completing, then all four options must complete the stem with correct grammar. If the stem asks a question, then all four options must answer it similarly. For example, in Q1, option (b) is not parallel to the others because it does not use the infinitive “To do XYZ"” format.
Q1: Which of the following is the BEST reason to hash passwords when they are stored in a database?
- (a) To protect passwords from unauthorized disclosure.
- (b) Speeding up comparison of the user’s login with the stored hash.
- (c) To have consistent sizes of data in the database.
- (d) To protect against social engineering attacks.
When distractors are too strong and might trick a knowledgeable candidate, as in Q1, words such as BEST or MOST IMPORTANT can be added. In such a case, those words are always written in all capitals and bolded.
There are no “except” questions. That is, the stem should never say “All of the following are best practices EXCEPT"”
All acronyms are spelled out when first used. ****From then on, just the acronym is used. This is on a per-question basis, since the ordering of questions and the set of questions selected for a given assessment may change. Just because something is an acronym does not mean it is a proper noun. That is, its constituent words do not necessarily need to be capitalized. For example: “user acceptance testing (UAT)” is correct. “User Acceptance Testing (UAT)” is incorrect. At the beginning of a sentence it would be “User acceptance testing (UAT).” In other cases, the constituent words are already a proper noun, thus they are capitalized: National Institute of Standards and Technology (NIST).
Words from the stem must not appear in the key, unless they also appear in a majority of the distractors. It is not necessary that the same word appear in all four answers, but if a word from the stem is repeated in the key, then words should be repeated in the distractors. In example Q2a, the key is (a), but it includes cluing. The word “personal” appears in the key, but no distractors have words from the stem.
Q2a: Encrypting personally identifiable information in a database helps
- keep personal information confidential.
- provide role-based access control.
- track user sessions.
- defeat brute-force attacks.
In example Q2b, the question is improved by leaving the cluing in (a) and repeating words from the stem in options (b) and (d).
Q2b: Encrypting personally identifiable information in a database helps
- keep personal information confidential.
- provide database access control.
- track user sessions.
- defeat cryptanalysis attacks.
An alternative improvement would be to rewrite (a) to something like “achieve confidentiality goals,” and leave (b) and (d) alone. The example Q2b is superior because (b), (c), and (d) are all weak to begin with. The mock cluing makes them marginally stronger.
It should be noted that writing distractors is the hardest part of writing a good assessment question. You will spend the vast majority of your time and effort making good distractors.
Distractors must be real things. For example, you cannot use a term like “blue hat hacking” or “test vulnerability analysis” because such things are not real terms that are accepted in the industry.
Distractors must be plausible. In question Q2b above, option (c) is a poor distractor because it stands out as very obviously not related to the stem.
Parallelism also applies across all options. Example Q3 is a good example of parallelism. “Asymmetric” and “symmetric” are combined with “cipher” and “hash function” to create four parallel choices. If one of these options were very different, it would stand out as either a likely key or an obviously bad distractor.
Q3: American Encryption Standard (AES) is an example of an
- asymmetric cipher
- symmetric cipher
- asymmetric hash function
- symmetric hash function
Avoid absolutes in distractors, especially as reasons for making them illegitimate. If “performing code analysis” would be legitimate, but “ALWAYS performing code analysis” is not, then it is a bad distractor.
Distractors are usually bad because:
- They are ridiculous. They couldn’t possibly be related to the stem.
- In an effort to be wrong, they include a mish-mash of terms that obviously don’t harmonize, as in “user acceptance testing of the architecture.”
- They are grammatically weak (i.e., they complete the sentence poorly).
- They are too short or too long.
- They don’t include cluing when the key does.
Avoid narrative statements that do not add meaningful information. Consider the first sentence of Q4. It does not add any meaningful information.
Q4a: An organization using a tool to scan its source code as part of its compliance with Payment Card Industry (PCI) Data Security Standards (DSS). When the source code analysis tool mistakenly identifies a vulnerability in a module when, in fact, there is no vulnerability, this is called a
- false positive"
When rewritten in Q4b, only the essential words from Q4a are used, focusing the question on the topic.
Q4b: If a source code analysis tool mistakenly identifies a vulnerability in a module when, in fact, there is no vulnerability, this is called a
- false positive"
Avoid absolutes in stems: “A firewall must NEVER permit which kinds of packets"?” or “Source code must ALWAYS be protected using"”
Avoid weak terms (“should,” “could,” “ought to,” etc.). Questions should ask about incontrovertible issues.
Avoid teaching or providing information in the stem. One question should not provide clues to the answer of another question. For example, a stem beginning “Because source code analysis tools frequently warn about false positives"” would provide clues to question Q4b.
Everything the candidate needs should be in the stem. That is, the candidate should not need to read all the options to decide what the stem is driving at. A stem such as “Source code control prevents"” followed by four options is bad, because source code control prevents many things. The only way to answer the question is to look at the four options and see which one makes sense.
Minimize the number of definitional items. Example Q3 is definitional. If the question boils down to “what is encryption?” or “what is a digital signature?,” it is not particularly good. Stems should focus on what is accomplished by the application of knowledge. Instead of a stem that asks “What is a digital signature?,” a stem could ask “To verify the integrity of a document, software can check for and validate a…” and the key can be “digital signature.” Beware of having other stems elsewhere in the assessment that say “Digital signatures provide document integrity by"”
Similar to definitional guidance, stems should avoid anything that can be answered through rote memorization. “X.509 is a standard related to"” is a bad stem because anyone doing free text association is likely to get it right if they’ve ever been exposed to X.509 in any way.
Keys must be affirmative. That is, they should always represent the desirable practice that the candidate should want to do.
Keys should not stand out because of the presence or absence of absolutes. If a distractor is wrong because it says “Always fix the bug before release” and the key is right because it says “Fix some bugs before release and triage the remaining ones,” then the key is bad.
Keys should be randomized throughout the assessment. Double-check that “a,” for example, is not the correct answer a disproportionate number of times.
Keys are usually bad because they provide too much information to candidates who don’t know the answer. That usually happens for one of these reasons:
- The key reuses a word from the stem
- The key stands out as being the best fit grammatically (i.e., it completes the stem best grammatically, regardless of the subject)
- The key is the longest or the shortest option
- The key is the simplest or the most complicated option
A surprising number of keys can be found by looking for “compound” options. That is, the presence of “and” or “or” or “but”. Very often the key is the only option that has “this and that” where the distractors have only a single concept.
Rules for Scoring
When items are just for reviewing material, scoring isn’t important.
When assessments are scored, they are usually scored on the total number of correct answers out of the total number of questions. This means it is to the student’s advantage to guess and we want to limit the effectiveness of guessing. That is, candidates with insufficient knowledge of the subject matter should guess the correct answer about 25% of the time on a given question.
 For example, as part of pre-employment screening to determine their fitness for a particular team.
 Five is also acceptable, but three is too few and six is too many. Choose one amount and stick with it throughout the assessment.