We already know from Chapter 2 that a test is said to be valid if it measures accurately what it is intended to measure. We create language tests in order to measure such essentially theoretical constructs as ‘reading ability’, ;fluency in speaking’, ‘control of grammar’, and so on. For this reason, in recent years the term construct validity has been increasingly used to refer to the general, overarching notion of validity.
It is not enough to assert that a test has construct validity; empirical evidence is needed. Such evidence may take several forms, including the subordinate forms of validity, content validity and criterion-related validity. We shall begin by looking at these two forms of evidence in turn, and attempt to show their relevance for the solution of language testing problems. We shall then turn to other forms of evidence.
The first form of evidence relates to the content of the test. A test is said to have content validity if its content constitutes a representative sample of the language skills, structures, etc. with which it is meant to be concerned. It is obvious that a grammar test, for instance, must be made up of items relating to the knowledge or control of grammar. But this in itself does not ensure content validity. The test would have content validity only if it included a proper sample of the relevant structures. Just what are the relevant structures will depend, of course, upon the purpose of the test. We would not expect an achievement test for intermediate learners to contain just the same set of structures as one for advanced learners. In order to judge whether or not a test has content validity, we need a specification of the skills or structures, etc. that it is meant to cover. Such a specification should be made at a very early stage in test construction. It isn’t to be expected that everything in the specification will always appear in the test; there may simply be too many things for all of them to appear in a single test. But it will provide the test constructor with the basis for making a principled selection of elements for inclusion in the test. A comparison of test specification ad test content is the basis for judgements as to content validity. Ideally these judgements should be made by people who are familiar with language teaching and testing but who are not directly concerned with the production of the test in question.
What is the importance of content validity? First, the greater a test’s content validity, the more likely it is to be an accurate measure of what it is supposed to measure, i.e. to have construct validity. A test in which major areas identified in the specification are under-represented – or not represented at all – is unlikely to be accurate. Secondly, such a test is likely to have a harmful backwash effect. Areas that are not tested are likely to become areas ignored in teaching and learning. Too often the content of tests is determined by what is easy to test rather than what is important to tests. The best safeguard against this is to write full test specifications and to ensure that the test content is a fair reflection of these. For this reason, content validation should be carried out while a test is being developed; it should not wait until the test is already being used. Advice on the writing of specifications is to be found in Chapter 7.
The second form of evidence of a test’s construct validity relates to the degree to which results on the test agree with those provided by some independent and highly dependable assessment of the candidate’s ability. This independent assessment is thus the criterion measure against which the test is validated.
There are essentially two kinds of criterion-related validity: concurrent validity and predictive validity. Concurrent validity is established when the test and the criterion are administered at about the same time. To exemplify this kind of validation in achievement testing, let us consider a situation where course objectives call for an oral component as part of the final achievement test. The objectives may list a large number of ‘functions’ which students are expected to perform orally, to test all of which might take 45 minutes for each student. This could well be impractical. Perhaps it is felt that only ten minutes can be devoted to each student for the oral component. The question then arises; can such a ten-minute session give a sufficiently accurate estimate of the student’s ability with respect to the functions specified in the course objectives? Is it, in other words, a valid measure?
From the point of view of content validity, this will depend on how many of the functions are tested in the component, and how representative they are of the complete set of functions included in the objectives. Every effort should be made when designing the oral component to five it content validity. Once this has been done, however, we can go further. We can attempt to establish the concurrent validity of the component.
To do this, we should choose at random a sample of all the students taking the test. These students would then be subjected to the full 45 minute oral component necessary for coverage of all the functions, using perhaps four scorers to ensure reliable scoring (see next chapter0. This would be the criterion test against which the shorter test would be judged. The students’ scores on the full test would be compared with the ones they obtained on the ten-minute session, which would have been conducted and scored in the usual way, without knowledge if their performance on the linger version. If the comparison between the two sets of scores reveals a high level of agreement, then the shorter version of the oral component may be considered valid, inasmuch as it gives results similar to those obtained with the longer version. If, on the other hand, the two sets of scores show little agreement, the shorter version cannot be considered valid; it cannot be used as a dependable measure of achievement with respect to the functions specified in the objectives. Of course, if ten minutes really is all that can be spared for each student, then the oral component may be included for the contribution that it makes to the assessment of students’ overall achievement and for its backwash effect. But it cannot be regarded as an accurate measure in itself.
References to ‘a high level of agreement’ and ‘little agreement’ raise the question of how the level of agreement is measured. There are, in fact, standard procedures for comparing sets of scores in this way, which generate what is called a ‘correlation coefficient’ (or, when we are considering validity, a ‘validity coefficient;) – a mathematical measure of similarity. Perfect agreement between two sets of scores will result in a coefficient of 1. Total lack of agreement will five a coefficient of zero. To get a feel for the meaning of a coefficient between these two extremes, read the contents of the box on page 29.
Whether or not a particular level of agreement is regarded as satisfactory will depend upon the purpose of the test and the importance of the decisions that are made on the purpose of the test and the importance of the decisions that are made on the basis of it. If, for example, a test of oral ability was to be used as part of the selection procedure for a high level diplomatic post, then a coefficient of 0.7 might well be regarded as too low for a shorter test to be substituted for a full and through test of oral ability. The saving in time would not be worth the risk of appointing someone with insufficient ability in the relevant foreign language. On the other hand, a coefficient of the same size might be perfectly acceptable for a brief interview forming part of a placement test.
|To get a feel for what a coefficient means in terms of the level of agreement between two sets of scores, it is best to square that coefficient. Let us imagine that a coefficient of 0.7 is calculated between the two oral tests referred to in the main text. Squares, this becomes 0.49. if this is regarded as a proportion of one, and converted to a percentage, we get 49 percent. On the basis of this, we can say that the scores on the short test predict 49 percent of the variation in scores on the longer test. In broad terms, there is almost 50 per cent agreement between one set of scores and the other. A coefficient of 0.5 would signify 25 per cent agreement; a coefficient of 0.8 would indicate 64 percent agreement. It is important to note that a ‘level of agreement’ of , say, 50 percent does not mean that 50 percent of the students would each have equivalent scores on the two versions. We are dealing with an overall measure of agreement that does not refer to the individual scores of students. This explanation of how to interpret validity coefficients is very brief and necessarily rather crude. For a better understanding, the reader is referred to the Further reading section at the end of the chapter.|
It should be said that the criterion for concurrent validation is not necessarily a proven, longer test. A test may be validated against, for example, teachers’ assessments of their students, provided that the assessments themselves can be relied on. This would be appropriate where a test was developed that claimed to be measuring something different from all existing testes.
The second kind of criterion-related validity is predictive validity. This concerns the degree to which a test can predict candidates’ future performance. An example would be how well a proficiency test could predict a student’s ability to cope with a graduate course at a British university. The criterion measure here might be an assessment of the student’s English as perceived by his or her supervisor at the university, or it could be the outcome of the course (pass/fail etc). The choice of criterion measure raises interesting issues. Should we rely on the subjective and untrained judgements if supervisors? How helpful is it to use final outcome as the criterion measure when so many factors other than ability in English (such as subject knowledge, intelligence, motivation, health and happiness) will have contributed to every outcome? Where outcome is used as the criterion measure, a validity coefficient of around 0.4 (only 20 percent agreement) is about as high as one can expect. This is partly because of the other factors, and partly because those students whose English the test predicted would be inadequate are not normally permitted to take the course, and so the test’s (possible) accuracy in predicting problems for those students goes unrecognised.
As a result, a validity coefficient of this order is generally regarded as satisfactory. The further reading section at the end of the chapter gives references to the reports on the validation of the British council’s ELTS test (the predecessor of IELTS), in which these issues are discussed at length.
Another example of predictive validity would be where an attempt was made to validate a placement test. Placement tests attempt to predict the most appropriate class for any particular student. Validation would involve an enquiry, once courses were under way, into the proportion of students who were thought to be misplaced. It would then be a matter of comparing the number of misplacements (and their effect on teaching and learning) with the cost of developing and administering a test that would place students more accurately.
Content validity, concurrent validity and predictive validity all have a part to play in the development of a test. For instance, in developing an English placement test for language schools, Hughes et al (1996) validated test content against the content of three popular course books used by language schools in Britain, compared students’ performance on the test with their performance on the existing placement tests of a number of language schools, and then examined the success of the test in placing students in classes. Only when this process was complete (and minor changes made on the basis of the result obtained) was the test published.
Other forms of evidence for construct validity
Investigation of a test’s content validity and criterion-related validity provide evidence for its overall, or construct validity. However, they are not the only source of evidence. One could imagine a test that was meant to measure reading ability, the specifications for which included reference to a variety of reading sub-skills, including, for example, the ability to guess the meaning of unknown words from the context in which they are met. Content validation of the test might confirm that these sub-skills were well represented in the test. Concurrent validation might reveal a strong relation ship between students’ performance on the test and their supervisors’ assessment of their reading ability. But one would still not be sure that the items in the test were ‘really’ measuring the sub-skills listed in the specifications.
The word ‘construct’ refers to any underlying ability (or trait) that is hypothesised in a theory of language ability. The ability to guess the meaning of unknown words from context, referred to above, would be an example. It is a matter of empirical research to establish whether or not such a distinct ability exists, can be measured, and is indeed measured in that test. Without confirming evidence from such research, it would not be possible to say that the part of a test that attempted to measure that ability has construct validity. If all of the items in a test were meant to measure specified abilities, then, without evidence that they were actually measuring those abilities, the construct validity of the whole test would be in question.
The reader may ask at this point whether such a demanding requirement for validity is appropriate for practical testing situations. It is easy to see the relevance of content validity in developing a test. And if a test has criterion related validity, whether concurrent or predictive, surely it is doing its job well. But does it matter if we can’t demonstrate that parts of the test are measuring exactly what we say they are measuring?
I have some sympathy for this view. What is more, I believe that gross, commonsense constructs like ‘reading ability’ and ‘writing ability’ are unproblematic. Similarly, the direct measurement of writing ability, for instance, should not cause us too much concern: even without research we can be fairly confident that we are measuring a distinct and meaningful ability (albeit a quite general and not closely defined ability). Once we try to measure such an ability indirectly, however, we can no longer take for granted what we are doing. We need to look to a theory of writing ability for guidance as to the form an indirect test should take, its content and techniques.
Let us imagine that we are indeed planning to construct an indirect test of writing ability that must for reasons of practicality be multiple choice. Our theory of writing tell us that underlying writing ability are a number of sub-abilities, such as control of punctuation, sensitivity to demands on style, and so on. We construct items that are meant to measure these sub-abilities and administer them as a pilot test. How do we know that this test really is measuring writing ability? One step we would almost certainly take is to obtain extensive samples of the writing ability of the group to whom the test is first administered, and have these reliably scored. We would then compare scores on the pilot test with the scores given for the samples of writing. If there is a high level of agreement (and a coefficient of the kind described in the previous section can be calculated), then we have evidence that we are measuring writing ability with the test.
So far, however, although we may have developed a satisfactory indirect test of writing, we have not demonstrated the reality of the underlying constructs (control of punctuation, etc). To do this we might administer a series of specially constructed tests, measuring each of the constructs by a number of different methods. In addition, compositions written by the people who took the tests could be scored separately for performance in relation to the hypothesised constructs (control) of punctuation, for example). In this way, for each person, we would obtain a set of scores for each of the constructs. Coefficients could then be calculated between the various measures. If the coefficients between scores on the same construct are consistently higher than those between scores on different constructs, then we have evidence that we are indeed measuring separate and identifiable constructs. This knowledge would be particularly valuable if we wanted to use the test for diagnostic purpose.
Another way of obtaining evidence about the construct validity of a test is to investigate what test takers actually do when they respond to an item. Two principal methods are used to gather such information: think aloud and retrospection. In the think aloud method, test takers voice their thoughts as they respond to the item. In retrospection, they try to recollect what their thinking was as they responded. In both cases their thoughts are usually tape-recorded, although a questionnaire may be used for the latter. The problem with the think aloud method is that the very voicing of thoughts may interfere with what would be the natural response to the item. The drawback to retrospection is that thoughts may be misremembered or forgotten. Despite these weaknesses, such research can give valuable insights into how items work (which may be quite different from what the test developer intended).
All test validation is to some degree a research activity. When it goes beyond content and criterion related validation, theories are put to the test and are confirmed, modified, or abandoned. It is in this way that language testing can be put on a sounder, more scientific footing. But it will not all happen overnight; there is a long way to go. In the meantime, the practical language tester should try to keep abreast of what is known. When in doubt, where it is possible, direct testing of abilities is recommended.
Validity in scoring
It is worth pointing out that if a test is to have validity, not only the items but also the way in which the responses are scored must be valid. It is no use having excellent items if they are scored invalidly. A reading test may call for short written responses. If the scoring of these responses takes into account spelling and grammar, then it is not valid (assuming the reading test is meant to measure reading ability). By measuring more than one ability, it makes the measurement of the one ability in question less accurate. There may be occasions when, because of misspelling or faulty grammar, it is not clear what the test taker intended. In this case, the problem is with the item, not with the scoring. Similarly, if we are interested in measuring speaking or writing ability, it s not enough to elicit speech or writing in a valid fashion. The rating of that speech or writing has to be valid too. For instance, overemphasis on such mechanical features as spelling and punctuation can invalidate the scoring of written work (and so the test of writing).
A test is said to have face validity if it looks as if it measures what it is supposed to measure. For example, a test that pretended to measure pronunciation ability but which did not require the test taker to speak (and there have been some) might be thought to lack face validity. This would be true even if the test’s construct and criterion-related validity could be demonstrated. Face validity is not a scientific notion and is not seen as providing evidence for construct validity, yet it can be very important. A test which does not have face validity may not be accepted by candidates, teachers, education authorities or employers. It may simply not be used; and if it is used, the candidates’ reaction to it may meant that they do not perform on it in a way that truly reflects their ability. Novel techniques, particularly those which provide indirect measures, have to be introduced slowly, with care, and with convincing explanations.
How to make tests more valid
In the development of a high stakes rest, which may significantly affect the lives of those who take it, there is an obligation to carry out a full validation exercise before the test becomes operational.
In the case of teacher-made tests, full validation is unlikely to be possible. In these circumstances, I would recommend the following:
First, write explicit specifications for the test (see chapter &) which take account of all that is known about the constructs that are to be measured. Make sure that you include a representative sample of the content of these in the test.
Second, whenever feasible, use direct testing. If for some reason it is decided that indirect testing is necessary, reference should be made to the research literature to confirm that measurement of the relevant underlying constructs has been demonstrated using the testing techniques that are to be employed (this may often result in disappointment, another reason for favouring direct testing).
Third, make sure that the scoring of responses relates directly to what is being tested.
Finally, do everything possible to make the test reliable. If a test is not reliable, it cannot be valid. Reliability is dealt with in the next chapter.
Tests developers must make every effort to make their tests as valid as possible. Any published test should supply details of its validation, without which its validity (and suitability) can hardly be judged by a potential purchaser. Tests for which validity information is not available should be treated with caution.
Consider any tests with which you are familiar. Assess each of them in terms of the various kinds of validity that have been presented in this chapter. What empirical evidence is there that the test is valid? If evidence is lacking, how would you set about gathering it?
At first sight, validity seems a quite straightforward concept. On closer examination, however, it can seem impossibly complex, with some writers even finding it difficult to separate it from the notion of reliability in some circumstances. In the present chapter, I have tried to present validity in a form which can be grasped by newcomers to the field and which will prove useful in thinking about and developing tests. For those who would like to explore the concept in greater depth, I would recommend: Anastasi and Urbina (1997) for a general discussion of test validity and ways of measuring it; Nitko (2001) for validity in the context of educational measurement; and Messick (1989) for a long, wide ranging and detailed chapter on validity is much cited in language testing literature. His 1996 paper discusses the relationship between validity and backwash.
Bachman and Palmer (1981) was a notable early attempt to introduce construct validation to language testing. A still interesting example of test validation (of the British Council ELTS test) in which a number of important issues are raised, is described and evaluated in Criper and Davies (1988) and Hughes, Porter and Weir (1988). More recent accounts of validation can be found in Wall et al (1994) and Fulcher (1997). Cohen (1984) describes early use of ‘think-aloud’ and retrospection. Buck (1991) ad Wu (1998) provide more recent examples of the use of introspection. Storey (1997) uses ‘think-aloud’. Bradshaw (1990) investigates the face validity of a placement test. Weir et al (1993) and Weir and Porter (1995) disagree with Aldeson (1990a, 1990b) about the evidence for certain reading comprehension skills. Cumming and Berwick (1996) is a collection of papers on validation in language testing. Bachman and Cohen (1998) is a collection of papers concerned with the relationship between second language acquisition and language testing research. For the argument (with which I do not agree) that there is no criterion against which ‘communicative’ language tests can be validated (in the sense of criterion-related validity), see Morrow (1986). Bachman’s (1990) book – much referred to and influential in the field of language testing – discusses validity and other theoretical issues in depth.
- When the term ‘construct validity’ was first used, it was in the context of psychological tests, particularly of personality tests. There was real concern at the time at the number of such tests which purported to measure psychological constructs, without offering evidence that these constructs existed in a measurable form. The demand was therefore that such evidence of these constructs be provided as part of demonstrating a test’s validity.
- Sometimes the size of a correlation coefficient can be misleading, an accident of the particular sample of people of people taking the test(s). If, for example, there are ‘extreme’ scores from outstandingly good or outstandingly poor takers of the test(s), the coefficient may be higher than the performance of the group as a whole warrants. See Nitko (2001) for detail.
- Because the full range of ability is not included, the validity coefficient is an underestimate (see previous footnote).
- However, one may question the validity of the scales used to assess performance in, say, writing. How far do they reflect the development or acquisition of the skills they refer to? This may not be important in proficiency testing, where the scales may be based on levels of skills needed for a particular purpose (a job, for example). In achievement testing, scales that are not consistent with patterns of development may lack validity.
Imagine that a hundred students take a 100-item test at there o’clock one Thursday afternoon. The test is not impossibly difficult or ridiculously easy for these students, so they do not all get zero or a perfect score of 100. Now what if, in fact, they had not taken the test on the Thursday but had taken it at three o’clock the previous afternoon? Would we expect each student to have got exactly the same score on the Wednesday as they actually did on the Thursday? The answer to this question must be no. even if we assume that the test is excellent, that the conditions of administration are almost identical, that the scoring calls for no judgement on the part of the scorers and is carried out with perfect care, and that no learning or forgetting as taken place during the one-day interval, nevertheless we would not expect every individual to get precisely the same score on the Wednesday as they got on the Thursday. Human beings are not like that; they simply do not behave in exactly the same way on every occasion, even when the circumstances seem identical.
But if this is the case, it implies that we can never have complete trust in any set of test scores. We know that the scores would have been different if the test had been administered on the previous or the following day. This is inevitable, and we must accept it. What we have to do is construct, administer and score tests in such a way that the scores actually obtained on a test on a particular occasion are likely to be very similar to those which would have been obtained if it had been administered to the same students with the same ability, but at a different time. The more similar the scores would have been, the more reliable the test is said to be.
Look at the hypothetical data in Table 1(a). they represent the scores obtained by ten students who took a 100-item test (A) on a particular occasion, and those that they would have obtained if they had taken it a day later. Compare the two sets of scores. (Do not worry for the moment about the fact that we would never be able to obtain this information. Ways of estimating what scores people would have got on another occasion are discussed later. The most obvious of these is simply to have people take the same test twice.) Note the size of the difference between the two scores for each student.
Table 1(a): Scores on test A(Invented data)
|Student||Score obtained||Score which would have been obtained on the following day|
Now look at table 1(b), which displays the same kind of information for a second 100-item test (B). again note the difference in scores for each student.
Table 1(b): Scores on test B (invented data)
|Student||Score obtained||Score which would have been obtained on the following day|
Which test seems the more reliable? The differences between the two sets of scores are much smaller for test B than for test A. on the evidence that we have here (and in practice we would not wish to make claims about reliability on the basis of such a small number of individuals). Test B appears to be more reliable than test A.
Look now at table 1(c), which represents scores of the same students on an interview using a five-point scale.
Table 1(c): Scores on interview (Invented data)
|Student||Score obtained||Score which would have been obtained on the following day|
In one sense the two sets of interview scores are very similar. The largest difference between a student’s actual score and the one which would have been obtained on the following day is 3. But the largest possible difference is only 4! Really the two sets of scores are very different. This becomes apparent once we compare the size of the differences between students with the size of differences between scores for individual students. They are of about the same order of magnitude. The result of this can be seen if we place the students in order according to their interview score, the highest first. The order based on their actual scores is markedly different from the one based on the scores they would have obtained if they had had the interview on the following day. This interview turns out in fact not to be very reliable at all.
The reliability coefficient
It is possible to quantify the reliability of a test in the form of a reliability coefficient. Reliability coefficients are like validity coefficients (chapter 4). They allow us to compare the reliability of different tests. The ideal reliability coefficient is 1. A test with a reliability coefficient of 1 is one which would give precisely the same results for a particular set of candidates regardless if when it happened to be administered. A test which had a reliability coefficient of zero (and let us hope that no such test exists!) would give sets of results quite unconnected with each other, in the sense that the score that someone actually got on a Wednesday would be no help at all in attempting to predict the score he or she would get if they took the test the day after. It is between the two extremes of 1 and zero that genuine test reliability coefficient are to be found
Certain authors have suggested how high a reliability coefficient we should expect for different types of language tests. Lado (1961), for example, says that good vocabulary, structure and reading tests are usually in the .90 to .99 range, while auditory comprehension tests are more often in the.80 to .89 range. Oral production tests may be in the .70 to .79 range. He adds that a reliability coefficient of .85 might be considered high for an oral production tests but low for a reading test. These suggestions reflect what Lado sees as the difficulty in achieving reliability in the testing of the different abilities. In fact the reliability coefficient that is to be sought will depend also on other considerations, most particularly the importance the decisions, the greater reliability we must demand: if we are to refuse someone the opportunity to study overseas because of their score on a language test, then we have to be pretty sure that their score would not have been much different if they had taken the rest a day or two earlier or later. The next section will explain how the reliability coefficient can be used to arrive at another figure (the standard error of measurement) to estimate likely differences of this kind. Before this is done, however, something has to be said about the way in which reliability coefficients are arrived at.
The first requirement is to have two sets of scores for comparison. The most obvious way of obtaining these is to get a group of subjects to take the same tests twice. This is known as the test-retest method. The drawbacks are not difficult to see. If the second administration of the test is to soon after the firs, then subjects are likely to recall items and their responses to them, making the same responses more likely and the reliability spuriously high. If there is too long a gap between administrations, then learning (or forgetting!) will have taken place, and the coefficient will be lower than it should be. However long the gap, the subjects are unlikely to be very motivated to take the same test twice, and this too is likely to have a depressing effect on the coefficient. These effects are reduced somewhat by the use of two different forms of the same test (the alternate forms method). However, alternate forms are often simply not available.
It turn our, surprisingly, that the most common methods of obtaining the necessary two sets of scores involve only one administration of one test. Such methods provide us with a coefficient of internal consistency. The most basic of these is the split half method. In this the subject take the test in the usual way, but each subject is given two scores. One score is for one half of the test, the second score is for the other half. The two sets of scores are then used to obtain the reliability coefficient as if the whole test had been taken twice. In order for this method to work, it is necessary for the test to be split into two halves which are really equivalent, through the careful matching of items (in fact where items in the test have been ordered in terms if difficulty, a split into odd-numbered items and even-numbered items may be adequate). It can be seen that this method is rather like the alternate form method, except that the two ‘forms’ are only half the length.
It has been demonstrated empirically that this altogether more economical method will indeed give good estimates of alternate forms coefficients, provided that the alternate forms are closely equivalent to each other.
The standard error of measurement and the true score
While the reliability coefficient allows us to compare the reliability of tests, it does not tell us directly how close an individual’s actual score is to what he or she might have scored on another occasion. With a little further calculation, however, it is possible to estimate how close a person’s actual score is to what is called their ‘true score’. Imagine that it were possible for someone to take the same language test over and over again, a indefinitely large number of times, without their performance being affected by having already taken the test, and without their ability in the language changing. Unless the test is perfectly reliable, and provided that it is not so easy or difficult that the student always gets full marks or zero, we would expect their scores on the various administrations to vary. If we had all of these scores we would be able to calculate their average score, and it would seem not unreasonable to think of this average as the one that best represents the student’s ability with respect to this particular test. It is this score, which for obvious reasons we can never know for certain, which is referred to as the candidate’s true score.
We are able to make statements about the probability that a candidate’s true score (the one which best represents their ability on the test) is within a certain number of points of the score they actually obtained on the test. In order to do this, we must first know the standard of measurement of the particular test. The calculation of the standard error of measurement is based on the reliability coefficient and a measure of the spread of all the scores on the test (for a given spread of scores, the greater the reliability coefficient, the smaller will be the standard error of measurement). How such statements can be made using the standard error of measurement of the test is best illustrated by an example.
Suppose that a test has a standard error of measurement of 5. An individual scores 56 on that test. We are them in a position to make the following statements.
We can be about 68 percent certain that the person’s true score lies in the range of 51-61 (i.e. within one standard error of measurement of the score actually obtained on this occasion).
We can be about 95 per cent certain that their true score is in the range 46-66 (i.e. within two standard error of measurement of the score actually obtained).
We can be 99.7 per cent certain that their true score is in the range 41-71 (i.e. within three standard errors of measurement of the score actually obtained).
These statements are based on what is known about the pattern of scores that would occur if it were in fact possible for someone to take the test repeatedly in the way described above. About 68 per cent of their scores would be within one standard error of measurement, and so on. If in fact they only take the test once, we cannot be sure how their score on that occasion relates to their true score, but we are still able to make probabilistic statements as above.
In the end, the statistical rationale is not important. What is important is to recognise how we can use the standard error of measurement to inform decisions that we take on the basis of test scores. We should, for example, be very wary of taking important negative decisions about people’s future if the standard error of measurement indicates that their true score is quite likely to be equal to or above the score that would lead to a positive decision, even though their actual score is below it. For this reason, all published tests should provide users with not only the reliability coefficient but also the standard error of measurement.
A relatively new approach to the statistical analysis of test data, known as item response theory (IRT) allows an even better estimate of how far an individual test taker’s actual score is likely to diverge from their true score. While classical analysis gives us a single estimate for all test takers, IRT gives an estimate for each individual, basing this estimate on that individual’s performance on each of the items on the test. Example of this estimate, usually referred to as the ‘standard error’ of the individual’s score, can be found in Appendix1.
What has been said so far in this chapter has been concerned with the consistency of scores that candidates obtain on a test. In criterion referenced testing, we are often less interested in scores than in whether a candidate has reached the criterion which has been set. In this case, the consistency which we are looking for is referred to as ‘decision consistency’ (rather than ‘reliability’)>
We want to know whether it test is consistent in deciding whether or not the candidates have or have not reached the criterion. Imagine a case where 50 candidates take a test (perhaps two alternate forms of it) twice. Those who reach a criterion may be called ‘masters’ (in the sense of having mastered the skills, or whatever, that are being tested) and those who do not reach it may be called ‘non-masters’. Of the 50 candidates:
18 are masters on both occasions
15 are non-masters on both occasions
9 are masters on the first occasion but non-masters on the second
8 are non-masters on the first occasion but masters on the second
So, out of 50 candidates, 33 are assigned to the same category (master or non-master on both occasions. Thirty-three out of 50 can be expressed as a percentage (66%) or as a proportion (0.66). this last value, 0.66, is known as the ‘per cent agreement’ and is an accepted estimate of decision consistency. For other methods for estimating decision consistency (and they are not limited to just two groups, maters and non-masters), see the further reading section.
We have seen the importance of reliability. If a test is not reliable then we know that the actual scores of many individuals are likely to be quite different from their true scores. This means that we can place little reliance on those scores. Even where reliability is quite high, the standard error of measurement (or the standard errors obtained through IRT) serves to remind us that in the case of some individuals there is quite possibly a large discrepancy between actual score and true score. This should make us very cautious about making important decisions on the basis of the test scores of candidates whose actual scores place them close to the cut-off point (the point that divides ‘passes’ from ‘fails’). We should at least consider the possibility of gathering further relevant information on the language ability of such candidates.
Having seen the importance of reliability, we shall consider, later in the chapter, how to make our tests more reliable. Before that, however, we shall look at another aspect of reliability.
In the first example given in this chapter we spoke about scores on a multiple choice test. It was most unlikely, we thought, that every candidate would get precisely the same score on both of two possible administration of the test. We assumed, however, that scoring of the test would be ‘perfect’. That is, if a particular candidate did perform in exactly the same way on the two occasions, they would be given the same score on both occasions. That is, any one scorer would give the same score on the two occasions, and this would be the same score as would be given by any other scorer on either occasion.
It is possible to quantify the level of agreement given by the same or different scorers on different occasions by means of a scorer reliability coefficient which can be interpreted in a similar way as the test reliability coefficient. In the case of the multiple choice test just described, the scorer reliability coefficient would be 1. As we noted in chapter 3, when scoring requires no judgement, and could in principle or in practice be carried out by a computer, the test is said to be objective. Only carelessness should cause the scorer reliability coefficients of objective test to fall below 1.
However, we did not make the assumption of perfectly consistent scoring in the case of the interview scores discussed earlier in the chapter. It would probably have seemed to the reader an unreasonable assumption. We can accept that scorers should be able to be consistent when there is only one easily recognised correct response. But when a degree of judgement is called for on the part of the scorer, as in the scoring of performance in an interview, perfect consistency is not to be expected. Such subjective tests will not have scorer reliability coefficients of 1!. Indeed there was a time when many people thought that scorer reliability coefficients (and also the reliability of the test) would always be too low to justify the use of subjective measures of language ability in serious language testing. This view is less widely held today. While the perfect reliability of objective tests is not obtainable in subjective tests, there are ways of making it sufficiently high for test results to be valuable. It is possible, for instance, to obtain scorer reliability coefficients of over 0.9 for the scoring of compositions.
It is perhaps worth making explicit something about the relationship between scorer reliability and test reliability. If the scoring of a test is not reliable, then the test results cannot be reliable either. Indeed the test reliability coefficient will almost certainly be lower than scorer reliability, since other sources of unreliability will be additional to what enters through imperfect scoring. In a case I know of, the scorer reliability coefficient on a composition writing test was .92, while the reliability coefficient for the test was .84. variability in the performance of individual candidates accounted for difference between the two coefficient.
How to make tests more reliable
As we have seen, there are two components of test reliability: the performance of candidates from occasion to occasion, and the reliability of the scoring. We will begin by suggesting ways of achieving consistent performances form candidates and then turn our attention to scorer reliability.
Take enough samples of behaviour
Other things being equal, the more items that you have on a test, the more reliable that test will be. This seems intuitively right. If we wanted to know how good an archer someone was, we wouldn’t rely on the evidence of a single shot at the target. That one shot could be quite unrepresentative of their ability. To be satisfied that we had a really reliable measure of the ability we would want to see a large number of shots at the target.
The same is true for language testing. It has been demonstrated empirically that the addition of further items will make a test more reliable. There is event a formula (the Spearman-Brown formula, see Appendix 1) that allows one to estimate how many extra items similar to the ones already in the test will be needed to increase the reliability coefficient to a required level. One thing to hear in mind, however, is that the additional items should be independent of each other and of existing items. Imagine a reading test that asks the question: ‘where did the thief hide the jewels?’ If an additional item following that took the form, ‘what was unusual about the hiding place?’, it would not make a full contribution to an increase in the reliability of the test. Why not? Because it is hardly possible for someone who got the original question wrong to get the supplementary question right. Such candidates are effectively prevented from answering the additional question; for them, in reality, there is no additional question. We do not get an additional sample of their behaviour, so the reliability of our estimate of their ability is not increased.
Each additional item should as far as possible represent a fresh start for the candidate. By doing this we are able to gain additional information on all of the candidates – information that will make test results more reliable. The use of the word ‘item’ should not be taken to mean only brief questions and answers. In a test of writing, for example, where candidates have to produce a number of passages, each of those passages is to be regarded as an item. The more independent passages there are, the more reliable will be the test. In the same way, in an interview used to test oral ability, the candidate should be given as many ‘fresh starts’ as possible. More detailed implications of the need to obtain sufficiently large samples of behaviour will be outline later in the book, in chapters devoted to the testing of particular abilities.
While it is important to make a test long enough to achieve satisfactory reliability, it should not be made so long that the candidates become so bored or tired that the behaviour they exhibit becomes unrepresentative of their ability. At the same time, it may often b necessary to resist pressure to make a test shorter than is appropriate. The usual argument for shortening a test is that it is not practical for it to be longer. The answer to this is that accurate information does not come cheaply: if such information is needed, then the price has to be paid. In general, the more important the decisions based on a test, the longer the test should be. Jephthah used the pronunciation of the word ‘shibboleth’ as a test to distinguish his own men from Ephraimites, who could not pronounce sh. Those who failed the test were executed. Any of Jephthah’s own men killed in error might have wished for a linger, more reliable test.
Exclude items which do not discriminate well between weaker and stronger students
Items on which strong students and weak students perform with similar degrees of success contribute little to the reliability of a test. Statistical analysis of items (Appendix 1) will reveal which items do not discriminate well. These are likely to include items which are too easy or too difficult for the candidates, but not only such items. A small number of easy, non-discriminating items may be kept at the beginning of a test to give candidates confidence and reduce the stress they feel.
Do not allow candidates too much freedom
In some kinds of language test there is a tendency to offer candidates a choice of questions and then to allow them a great deal of freedom in the way that they answer the ones that they haven chosen. An example would be a test of writing where the candidates are simply given a selection of titles from which to choose. Such a procedure is likely to have a depressing effect on the reliability of the test. The more freedom that is given, the greater is likely to be the difference between the performance actually elicited and the performance that would have been elicited had the test been taken, say, a day later. In general, therefore, candidates should not be given a choice, and the range over which possible answers might vary should be restricted. Compare the following writing task:
- Write a composition on tourism.
- Write a composition on tourism in this country.
- Write a composition on how we might developed the tourist industry in this country.
- Discuss the following measures intended to increase the number of foreign tourists coming to this country:
- More/better advertising and/or information (Where? What form should it take)/
- Improve facilities (hotels, transportation, communication, etc)
- Training of personnel (guides, hotel managers, etc)
The successive tasks impose more and more control over what is written. The fourth task is likely to be a much more reliable indicator of writing ability than the first. The general principle of restricting the freedom of candidates will be taken up again in chapters relating to particular skills. It should perhaps be said here, however, that in restricting the students we must be careful not to distort too much the task that w really want to see them perform. The potential tension between reliability and validity is addressed at the end of the chapter.
Write unambiguous items
It is essential that candidates should not be presented with item: whose meaning is not clear or to which there is an acceptable answer which the meaning is not clear or to which there is an acceptable answer which the test writer has not anticipated. In a reading test I once set the following open-ended question, based on a lengthy rending passage about English accents and dialects: where does the author direct the reader who is interested in non-standard dialects of English? The expected answer was the further reading section of the book. A number of candidates answered ‘page 3’, which was the place in the text where the author actually said that the interested reader should look in the Further reading section. Only the alertness of those scoring the test revealed that there was a completely unanticipated correct answer to the question. If that had not happened, a correct answer would have been scored as incorrect. The fact that an individual candidate might interpret the question in different ways on different occasions means that the item is not contributing fully to the reliability of the test.
The best way to arrive at unambiguous items is, having drafted them, to subject them to the critical scrutiny of colleagues, who should try as hard as, they can to find alternative interpretations to the ones intended. If this task is entered into in the right spirit – one of good-natured perversity – most of the problems can be identified before the test s administered. Pre-testing of the items on a group of people comparable to those for whom the test is intended (see chapter 7) should reveal the remainder. Where pre-testing is not practicable, scores must be on the lookout for patterns of response that indicate that there are problem items.
Provide clear and explicit instructions
This applies both to written and oral instructions. If it is possible for candidates to misinterpret what they are asked to do, then on some occasions some of them certainly will. It is by no means always the weakest candidates who are misled by ambiguous instructions, indeed it is often the better candidate who is able to provide the alternative interpretation. A common fault of tests written for the students of a particular teaching institution is the supposition that the students all know what is intended by carelessly worded instructions. The frequency of the complaint that students are unintelligent, have been stupid, have wilfully, misunderstood what they were asked to do, reveals that the supposition is often unwarranted. Test writers should not rely on the students powers of telepathy to elicit the desired behaviour. Again, the use of colleagues to criticise drafts of instructions (including those which will be spoken) is the best means of avoiding problems. Spoken instructions should always be read from a prepared text in order to avoid introducing confusion.
Ensure that tests are well laid out and perfectly legible
Too often, institutional tests are badly typed (or handwritten), have too much text in too small a space, and are poorly reproduced. As a result, students are faced with additional tasks which are not ones meant to measure their language ability. Their variable performance on the unwanted tasks will lower the reliability of a test.
Make candidates familiar with format and testing techniques
If any aspect of a test is unfamiliar to candidates, they are likely to perform less well than they would do otherwise (on subsequently taking a parallel version, for example). For this reason, every effort must b made to ensure that all candidates have the opportunity to learn just what will be required of them. This may mean the distribution of sample tests (or of past test papers) or at least the provision of practice materials in the case of tests set within teaching institutions.
Provide uniform and non-distracting conditions of administration
The greater the differences between one administration of a test and another, the greater the differences one can expect between a candidate’s performance on the two occasions. Great care should be taken to ensure uniformity. For example, timing should be specified and strictly adhered to; the acoustic conditions should be similar for all administrations of a listening test. Every precaution should be taken to maintain a quiet setting with no distracting sounds or movements.
We turn now to ways of obtaining scorer reliability, which, as we saw above, is essential to test reliability.
Use items that permit scoring which is as objective as possible
This may appear to recommendation to use multiple choice items, which permit completely objective scoring. This is not intended. While it would be a mistake to say that multiple choice items are never appropriate, it is certainly true that there are many circumstances in which they are quit inappropriate. What is more, good multiple choice items are notoriously difficult to write and always require extensive pre-testing. A substantial part of chapter 8 is given over to the shortcoming of the multiple choice technique.
An alternative to multiple choice is the open-ended item which has a unique, possibly one-word, correct response which the candidates produce themselves. This too should ensure objective scoring, but in fact problems with such matters as spelling which makes a candidate’s meaning unclear (say, in a listening test) often make demands on the scorer’s judgement. The longer the required response, the greater the difficulties of this kind. One way of dealing with this is to structure the candidate’s response by providing part of it. For example, the open-ended question, what was different about the result? May be designed to elicit the response, success was closely associated with high motivation. This is likely to cause problems for scoring. Greater scorer reliability will probably be achieved if the question is followed by:
……………………… was more closely associated with …………………
Items of this kind are discussed in later chapter.
Make comparisons between candidates as direct as possible
This reinforces the suggestion already made that candidates should not be given a choice of items and that they should be limited in the way that they are allowed to respond. Scoring the compositions all on tone topic will be more reliable than if the candidates are allowed to choose from six topics, as has been the case in some well-known tests. The scoring should be all the more reliable it the compositions are guided as in the example above, in the section, ‘do not allow candidates too much freedom’.
Provide a detailed scoring key
This should specify acceptable answers and assign points for acceptable partially correct response. For high scorer reliability the key should be as detailed as possible in its assignment of points. It should be the outcome of efforts to anticipate all possible responses and have been subjected to group criticism. (this advice applies only where responses can be classed as partially or totally ‘correct’, not in the case of compositions, for instance.)
This is especially important where scoring is most subjective. The scoring of compositions, for example, should not be assigned to anyone who has not learned to score accurately compositions from past administrations. After each administration, patterns of scoring should be analysed. Individuals whose scoring deviates markedly and inconsistently from the norm should not be used again.
Agree acceptable responses and appropriate scores at outset of scoring.
A sample of scripts should be taken immediately after the administration of the test. Where there are compositions, archetypical representatives of different levels of ability should be selected. Only when all scorers are agree on the scores to be given to these should real scoring begin. More will be said in chapter 9 about the scoring of compositions.
For short answer questions, the scorers should note any difficulties they have in assigning points (the key is unlikely to have anticipated every relevant response), and bring these to the attention of whoever is supervising tat part of the scoring. Once a decision has been taken as to the points to be assigned, the supervisor should convey it to all the scores concerned.
Identify candidates by number, not name
Scorers inevitably have expectations of candidates that they know. Except in purely objective testing, this will affect the way that they score. Studies have sown that even where the candidates are unknown to the scorers, the name on a script (or a photograph) will make a significant difference to the scores given. For example, a scorer may be influenced by the gender or nationality of a name into making predictions which can affect the score given. The identification of candidates only by number will reduce such effects.
Employ multiple, independent scoring
As a general rule, and certainly where testing is subjective, all scripts should be scored by at least two independent scorers. Neither scorer should know how the other has scored a test paper. Scores should be recorded on separate score sheets and passed to a third, senior, colleague, who compares the two sets of scorers and investigates discrepancies.
Reliability and validity
To be valid a test must provide consistently accurate measurements. It must therefore be reliable. A reliable test, however, may not be valid at all. For example, as a writing test we could require candidates to write down the translation equivalents of 500 words in their own language. This might well be a reliable test; but it is unlikely to be a valid test of writing.
In our efforts to make tests reliable, we must be wary of reducing their validity. Earlier in this chapter it was admitted that restricting the scope of what candidates are permitted to write in a composition might diminish the validity of the task. This depends in part on what exactly we are trying to measure by setting the task. If we are interested in candidates’ ability to structure a composition, then it would be hard to justify providing them with a structure in order to increase reliability. At the same time we would still try to restrict candidates in ways which would not render their performance on the task invalid.
There will always be some tension between reliability and validity. The tester has to balance gains in one against losses in the other.
- What published tests are you familiar with? Try to find out their reliability coefficients. What method was used to arrive at these? What are the standard errors of measurement?
- The TOEFL test has a standard error of measurement of 15. A particular American college states that it requires a score of 600 on the test for entry. What would you think of students applying to that college and making scores of 605, 600, 595, 590, 575?
- Look at your own institutional tests. Using the list of points in the chapter, say in what ways you could improve their reliability.
- What examples can you think of where there would be a tension between reliability and validity? In cases that you know, do you think the right balance has been struck?
For more on reliability in general and the relationship between different estimates of reliability and the different factors that account for it, see Anastasi and Urbina (1997). For reliability in educational measurement see Nitko (2001) and Feldt and Brennan (1989) – the latter being highly technical.
For four ‘relatively easy to calculate’ estimates of decision consistency see Brown (1990). For further discussion of consistency in criterion referenced testing, see Brown and Hudson (2002). Nitko (2001) also deals with decision consistency.
For what I think is an exaggerated view of the difficulty of achieving high reliability in more communicative tasks, see Lado (1961). This may have been written forty years ago, but the same belief are still expressed today in certain quarters.
- Because of the reduced length, which will cause the coefficient to be less than it would be for the whole test, a statistical adjustment has to e made using the Spearman-Brown formula (see Apendix 1)
- Note that a reliability coefficient can be misleading if there are even just a couple of candidates that score much higher (and/or much lower) than the other. the presence of such scores will cause the reliability coefficient to be misleadingly high. This is because the statistical methods used to estimate reliability compare the size of differences between candidates with the size of differences ‘within’ candidates (i.e. between candidates two scores). The greater the relative difference between candidates, the greater will be the reliability coefficient. The difference between candidates will be exaggerated by the inclusion in the study of untypical candidates of the kind identified above. It is this which leads to an inappropriate estimate of reliability. See Nitko (2002) for detail.
- These statistical statements are based on what is known about the way a person’s scores would tend to be distributed if they took the same test an indefinitely larger number of times (without the experience of any test-taking occasion affecting performance on any other occasion). The scores would follow what is called a normal distribution (see Woods, Fletcher, and Hughes, 1986, for discussion beyond the scope of the present book). It is the known structure of the normal distribution which allows us to say what percentage of scores will fall within a certain range (for example about 68 per cent of scores will fall within one standard error of measurement of the true score). Since about 68 per cent of actual scores will be within one standard error of measurement of the true score, we can be about 68 per cent certain that any particular actual score will be within one standard error of measurement of the true score.
- It should be clear that there is no such thing as a ‘good’ or a ‘bad’ standard error of measurement. It is the particular use made of particular scores in relation to a particular standard error of measurement which may be considered acceptable or unacceptable.
- A criterion-referenced test may be very consistent yet yield a low reliability coefficient. This is because candidates’ scores, although they classify the candidates consistently, may be very limited in range (see footnote 2). For this reason, it is recommended that one should use methods specifically designed for criterion-reference test.
- The reliability of one person scoring the same test responses on different occasions is called ‘intra-score reliability’; the reliability of different people scoring the same test responses is called ‘inter-scorer reliability’.