## Abstract

Item test analysis is an aid to identify items that need to be eliminated from an assessment. An automatic elimination procedure based on item statistics, therefore, could help to increase the quality of a test in an objective manner. This was investigated by studying the effect of a standardized elimination procedure on the test results of a second-year course over a period of 6 successive years in 1,624 candidates. Cohort effects on the item elimination were examined by determining the number of additional items that had to be eliminated from three different tests in 3 successive academic years in two cohorts. The items that were part of more than one test and had to be eliminated according to the procedure in at least one of the tests appeared to have to be retained according to the same procedure in most of the other tests. The procedure harmed the high scoring students relatively more often than the other students, and the number of eliminated items appeared to be cohort dependent. As a consequence, automatic elimination procedures obscure the transparency of the grading process unacceptably and transform valid tests into inadequate samples of the course content.

- assessment
- content review
- multiple choice
- psychometric criteria

multiple-choice questions are highly efficient in handling large numbers of candidates. They are objective and can easily and quickly be scored, and a single test can contain a large number of diverse items allowing a broad coverage of learning objectives. However, multiple-choice questions are also prone to construction errors, in particular those of the simple true/false type (^{5}, 12, 18). After an exam, therefore, one or more of the items may not satisfy the expectations. An item test analysis may help to identify such items. Most of the available test and item analysis software programs mark malfunctioning items automatically using psychometric indexes as the item difficulty index and the item total correlation. After reading such an analysis, many teachers feel compelled to eliminate the marked items as a precaution to avoid discussions about the test quality with disappointed students who failed the exam.

In accordance with these feelings, a largely automatic elimination procedure based on preliminary item test analyses was introduced to our faculty in an attempt to increase the quality of written assessments in an objective manner. Because of lack of studies concerning the effects of automatic item elimination based on explicit psychometric criteria, we investigated the effects of an elimination procedure on a second-year test for students in medicine over a period of 6 successive years.

## METHODS

### Participants

The elimination procedure was investigated by studying the effects of the procedure on the test results of a second-year course in medicine on renal physiology. From the academic years of 1998/1999 through 2003/2004, a total of 1,624 candidates were tested in this course spread over six regular multiple-choice tests. Furthermore, the number of items eliminated by the procedure was determined in the same cohort across three different year levels. It was calculated for the cohort starting in 2000 with 332 participants in the selected first-year test and 255 participants in the selected third-year test and for the cohort starting in 2001 with 330 participants in the selected first-year test and 288 participants in the selected third-year test.

### Analysis

Multiple-choice questions were analyzed by calculating the item difficulty index (*P* value) and the item total correlation (Rit score). The *P* value is the number of candidates who answered the item correctly divided by the total number of students who answered the item. The lower the *P* value, the more difficult a particular item is. The Rit score is the point-biserial correlation between the item score and test score. A high correlation means that candidates who did well on the item also did well on the test. A negative correlation means that candidates who performed well on the test did not choose the correct item response. The Rit score consequently shows the degree to which each item discriminates between candidates with high scores and those with low scores and therefore represents a discrimination index.

According to the automatic item elimination procedure used by our faculty, all items of an assessment were first classified in 1 of 15 classes (Table 1). The class of an item was determined by its *P* value and Rit score. Since in multiple-choice tests the proportion of candidates answering an item correctly is influenced by guessing, actually not the *P* value but the P value corrected for guessing, the *P*_{c} value, was used to classify the item. This *P*_{c} value is calculated by *P*_{c} = *P* − [(1 − *P*)/(*k* − 1)], where *k* is the number of item alternatives. Rit_{min} was used to classify the item according to its Rit score and represents the lowest positive Rit score of the test that is significant on the 5% level. If the Rit_{min} appears to be 0.19, for example, an item of this test with three alternatives, a *P* value of 0.42, and a Rit score of 0.18 has to be classified in *class (E)9*. If this item had a Rit score of 0.20, for example, it would have had to be classified in *class (R)14*.

The 15 classes of the automatic item elimination procedure represent 3 main classes that indicate whether an item has to be eliminated unconditionally (*class E*), possibly eliminated (*class PE*), or retained (*class R*). The effect of this elimination procedure was examined by investigating the effects of three separate subprocedures: traditional, minimum, and maximum subprocedures.

#### The traditional subprocedure.

In the traditional subprocedure, a student delegation of the year class makes an inventory of all student feedback. Additionally, items with a *P* value of <0.30 and/or a Rit score of <0.20 in the item test analysis are flagged as suspicious. Items with valid student feedback and items marked as suspicious are inspected subsequently for grammatical errors, inconsistencies, obvious clues, and content relevance. The inspection is done in a postassessment discussion by the student delegation and the teachers who constructed the items. Based on the outcome of this collective inspection, the items are either retained or removed from the examination.

#### The minimum subprocedure.

The minimum subprocedure starts with the items that remain after the traditional subprocedure. Next, all items of *class E* of the automatic item elimination procedure are removed from the examination.

#### The maximum subprocedure.

The maximum subprocedure starts with the items that remain after the minimum subprocedure. Next, all items of *class PE* of the automatic item elimination procedure are removed from the examination.

After performing the traditional subprocedure, items are obtained that remain if none of the items of *class E* and *class PE* of the automatic item elimination procedure are eliminated. After performing the minimum subprocedure, items are obtained that remain if all items of only *class E* of the automatic item elimination procedure are eliminated. After performing the maximum subprocedure, the items are obtained that remain if all of the items proposed by the automatic elimination procedure (i.e., of *classes E* and *PE*) are eliminated.

### Grading

After each subprocedure, all candidates were graded using the following formula: test score = (score − guess score) × 10/(maximum score − guess score), where “score” is the number of items correctly answered by the candidate; “maximum score” is the score if all items of the test would have been answered correctly; and “guess score” is the score if all items of the test would have been answered by random guessing alone.

If an item has *k* alternatives, the probability of a correct guess amounts to 1/*k*. By multiplying this probability with the number of items with *k* alternatives, the guess score of all items with *k* alternatives was computed. The total guess score of all items of the test together was calculated by adding up all guess scores of items with a different number of alternatives. A student failed for the test if the test score was lower than 5.5.

### Test Analysis

Finally, the reliability and difficulty index of the test were determined. The reliability of the test was investigated using the Kuder-Richardson 20 (KR20) formula. The arithmetic average of the scores was used as the difficulty index of the test.

### Statistics

All item test analyses were performed independently of our department by an institute specialized in the statistically evaluation of test results: the assessment service of the University Educational Centre.

Paired comparisons were performed using the Wilcoxon signed-ranks test and unpaired comparisons with the Mann-Whitney test. *P* values of <0.05 were considered statistically significant. All statistical analyses were performed using a commercial software package (SPSS version 12.00 for Windows, SPSS).

## RESULTS

Table 2 shows the main effects of the three subprocedures on the regular multiple-choice tests of the second-year course. The six tests refer to the same course with respect to design and teachers as well as learning objectives but were given in 6 different successive years. In these 6 tests, 30 items were eliminated by the traditional subprocedure.

### False/True Items

All false/true items in the tests were simple, not multiple false/true items. After following the traditional subprocedures for all 6 tests, a total of 567 items remained to test 1,624 candidates. Of these items, 297 items (52.38%) were false/true items.

Due to the minimum subprocedure, a total of 32 extra items was eliminated. Most of these items (65.63%) were classified in *class 15* (10 items) and *class 10* (11 items) of the automatic item elimination procedure; 21 of the eliminated items (65.63%) were false/true items.

Due to the maximum subprocedure, a total of 18 items was eliminated additionally. Of these items, 12 items (66.67%) were false/true items.

### Recurrent Items

Some of the items deleted during the minimum or maximum subprocedure were part of more than one test: 15 different multiple-choice questions were found in more than one test. Figure 1 shows how often each of these questions was classified in which main class according to the automatic item elimination procedure.

### Effect on Grading

#### From failed to passed.

After following the traditional subprocedure, 71.77 ± 12.08% of the 270.7 ± 41.1 students of a year class passed the exam (Table 2). After following the minimum procedure, 78.21 ± 8.95% of the students of a year class passed the exam, and after following the maximum procedure, 78.39 ± 11.76% of the students of a year class passed the exam. In other words, 6.44% of the students of a year class due to the minimum subprocedure and 6.62% of the students of a year class due to the maximum procedure were passed additionally.

#### Lower marks.

At the same time, 7.53 ± 8.70% of the students of a year class due to the minimum subprocedure and 8.60 ± 7.23% of the students of a year class due to the maximum procedure received a lower mark. Summarized over all 6 years, 137 of the 1,624 candidates (8.44%) received a lower mark due to the minimum subprocedure and 139 of the 1,624 candidates (8.56%) received a lower mark due to the maximum subprocedure. The candidates who were lower graded by the minimum subprocedure scored 6.45 ± 1.96 after performing the traditional subprocedure and 6.38 ± 1.98 after performing the minimum subprocedure. The other candidates who were not lower graded scored 6.30 ± 1.43 and 6.66 ± 1.44, respectively. The distribution of the test scores is shown in Fig. 2. The test scores of the students who were lower graded by the minimum subprocedure decreased significantly with 0.064 ± 0.042 but increased with 0.363 ± 0.250 in other students. The candidates who were lower graded after performing the maximum subprocedure scored 5.43 ± 1.82 according to the traditional subprocedure and 5.31 ± 1.85 according to the maximum subprocedure. The other candidates scored 6.39 ± 1.42 and 6.80 ± 1.46, respectively. The test scores of the students who were lower graded by the maximum subprocedure decreased significantly with 0.124 ± 0.108 but increased with 0.414 ± 0.304 in other students.

## DISCUSSION

Neither the criteria of the item elimination procedure nor the grading using a correction for guessing were the subjects of investigation in this study. In this study, we investigated the effects of an automatic elimination procedure on some important characteristics of a test such as its capability to determine what and how much had been learned by the students, the number of passed students, and the discrimination between candidates.

The items eliminated by both the minimum and the maximum procedures, and hence according to the automatic item elimination procedure, referred relatively more often to multiple-choice questions of the false/true type than to multiple-choice questions of other types. This agrees with the earlier findings of others demonstrating that false/true items are more susceptible to discussion (2, 5).

### Are the Students or Items Bad?

Most of the items eliminated by the minimum procedure appeared to be classified in *classes 10* and *15*. The most decisive role in the forced elimination of items during the automatic elimination procedure, consequently, is played by a *P* value lower than the guess score. Such a low *P* value may indicate that the item was flawed, but it may also indicate that the item was too difficult for most of the candidates. Because flawed items should be eliminated from an exam but extremely difficult (and easy) items are needed to adequately sample course content and objectives (13, 19), more insight in the cause of the low *P* values is crucial. Since the Rit score strongly decreases not only at high *P* values but also at low *P* values by definition, the Rit score is of less importance in solving this problem.

If the items eliminated by the automatic elimination procedure were bad instead of difficult, than the students who got a lower test score due to the elimination procedure answered these items correctly only by chance. As a consequence, one would expect that the test scores of these students were at best distributed in the same way as those of the other students, but at least not higher. The students who got a lower test score due to automatic item elimination according to the minimum subprocedure, however, scored above average compared with other students (Fig. 2). The percentage of students who got a lower test mark due to this elimination procedure amounted to 8.4% of all students, but appeared to be increased in the best (as well as the worst) performing quartile of the students (Fig. 3). In the 10% highest scoring students, for example, this percentage even amounted to 19% (31 students). As a consequence, the automatic item elimination harmed the higher scoring students more often than other students. This suggests that many of the eliminated items are not bad but merely difficult or refer to objectives that were not learned by most of the students.

The question of whether it is justified to use low *P* values and Rit scores as indisputable arguments in a largely automatic item elimination procedure was investigated further by comparing the *P* values and Rit scores of the eliminated items that appeared in more than one test. The large variation in the classification of these items over the various tests (Fig. 1) convincingly illustrates that the *P* value and Rit score of a multiple-choice question not only depend on the item itself but on the test population and test circumstances as well. As demonstrated previously (7), the item test statistics were hopelessly confounded with the particular sample of examinees who completed the assessment. Most of the recurrent items that had to be eliminated (*class E*) according to the automatic item elimination procedure in one of the tests could (*class PE*) or even had to be retained (*class R*) according to the same procedure in most of the other tests. This finding illustrates in another way but again that most of the eliminated items were eliminated not because they were flawed but because they were difficult or refer to course objectives that were not learned by most of the students.

The question of whether the items or the students are bad turns up once more if the relatively large number of eliminated items is noticed in the assessment of year 2001 compared with that of year 2002 (Table 2). This large number could have been caused by poor quality of the items, a large number of difficult items, inferior teaching, or even by factors related to the students themselves as well. Therefore, for the students who were tested for the second-year course in 2001 (cohort 2000), we determined the number of additional items that had to be eliminated unconditionally by the automatic elimination procedure (i.e., according to the minimum procedure) compared with the traditional procedure in both a first year-course and third-year course. Similar numbers were determined for the students who were tested for the second-year course in 2002 (cohort 2001). Figure 4 shows the resulting numbers of additionally eliminated items as a percentage of the total numbers of test items for both cohorts. These percentages were significantly higher for all three tests in cohort 2000 compared with cohort 2001. Because these tests refer to three completely different courses with different learning objectives and different teachers in different academic years, it is rather improbable that the teaching was significantly worse, the quality of the test items significantly lower, or the item difficulty significantly higher in all three courses of one specific cohort. The results shown in Fig. 4 suggest again, therefore, that many of the items eliminated by the automatic elimination procedure were eliminated not because they were bad but because of factors related to the student cohort.

Because most of the recurrent items that had to be eliminated in one test had to be retained in other tests, because the elimination procedure harmed the high scoring students more often than other students, and because the number of eliminated items appeared to be cohort dependent, there is little reason to believe that these items are flawed and missed by the thorough postassessment inspection described above.

### Effects of the Elimination Procedure

The number of passed students was significantly increased by the minimum subprocedure. By eliminating the items of *class PE* as well, the number of passed students was hardly changed. That is, if all items that may be eliminated according to the automatic item elimination procedure were flawed items, the elimination of *class PE* items in addition to the forced elimination of *class E* items adds little to the decision of whether a student has passed or failed the exam. From this point of view, the elimination procedure could be simplified by restricting the item classification to two classes (*classes E* and *R*) instead of three classes.

If the items were eliminated not because they were flawed but because of factors related to the students, as suggested above, the elimination procedure reduces to an ordinary relative grading process. The formula that was used for calculating the test score, however, suggests that the students were graded using an absolute instead of a relative grading procedure. As a consequence, the automatic elimination procedure obscures the transparency of the grading process by converting an apparently absolute grading process in a relative one.

As frequently demonstrated (3, 6, 9), exams have a decisive impact on the study behavior of students: “The impact of assessment on the educational process is a variable which allows little compromise” (17), “The assessment tail wags the curriculum dog” or “Grab the students by the tests and their hearts and minds will follow” (16), and “Students don't learn what you expect, they learn what you inspect” (11). Because difficult and unpopular items are answered correctly by only a very small percentage of students, by definition, such items receive low *P* values in test analyses and will be eliminated irrevocably during automatic elimination procedures. Each time a student is tested, therefore, the student will learn that s/he is not tested on difficult and unpopular items. If students really don't learn what you inspect, but what you inspect, they actually learn that it is highly inefficient to spend much time on difficult or unpopular learning objectives. The repeated use of the automatic item elimination procedure, in other words, creates a hidden curriculum (10) in which our future doctors learn to avoid the more difficult and unpopular problems.

If items eliminated by the automatic elimination procedure are not eliminated because they are flawed, the procedure reduces a test to an inadequate sample of the course objectives. Hence, the validity of a test is diminished by the elimination procedure, although the test statistics in numbers seem to be improved (Table 2). In this study, consequently, arguments were found to justify the concerns reported earlier by others: “not to eliminate all items from a test which show poor item statistics regardless of their content” (15).

### Comparison With Other Elimination Procedures

Compared with criteria of the investigated automatic item elimination procedure, more established criteria for malfunctioning items usually use higher values for the lower limit of an acceptable Rit score (8, 14) as well as *P* value (1). Fully automatic item elimination procedures according to such criteria, therefore, will led to even more inequitable item eliminations.

The use of Rit_{min} as the minimum value of acceptable Rit scores in the used classification differs from that of most other classifications. Usually a fixed Rit score is used as the minimum value (8, 14). Rit_{min} varies inverse with the number of candidates. Because of the large numbers of students in medicine, the Rit_{min} is generally lower than minimum values used by other classifications. Hence, fewer items will be eliminated on account of a low Rit score by the procedure. An item with a Rit score just below Rit_{min}, on the other hand, has to be eliminated from a test with little candidates but may be retained in a test with many candidates. According to the automatic item elimination procedure, therefore, one and the same item may lower the test quality unacceptably in one test but not in the other, depending on the numbers of candidates solely. Since the quality of an item has nothing to do with the numbers of candidates, the use of a fixed limit for the Rit score has to be preferred in item test analyses.

Alternatives to the automatic item elimination procedure, such as item response theory (IRT) measurement models, were not investigated in this study. One of the main advantages of IRT measurements over classical test analysis, however, appears to be that it is not sample dependent, like the item elimination procedure we investigated in this study, and that it dissects apart student proficiency and the inherent difficulty of individual test items (7). IRT, on the other hand, requires data from many hundreds of individuals and seems to be less appropriate to the deliberate miscellanies of knowledge and ability that are typically assessed in medical examinations. It is indeterminate, therefore, which of the methods is obviously better than the other (4). No matter how, using either classical test analysis or one of the alternative methods, the evaluation of the quality of a test by statistical analyses alone seems to be inadequate.

### Conclusions

Item test analysis does help identify items that may be considered for elimination after postassessment inspection by a student delegation and responsible teachers. The quality of a test, however, appears not to be increased by an automatic elimination procedure since most of the eliminated items are not flawed at all. Automatic item elimination obscures the transparency of the grading process, is often harmful to high performing students, encourages future doctors to avoid difficult and unpopular problems, and transforms valid tests into inadequate samples of the course objectives. The evaluation of the quality of a test, consequently, has to be a mix of statistical analyses and, above all, human judgment, in which content review by students and peers is imperative.

## Acknowledgments

The authors acknowledge all members of the assessment service of the University Educational Centre of Groningen for the constructive participation in item test analyses.

- © 2007 American Physiological Society