Welcome to Brighton 2018, come and join the fun!
The annual IATEFL conference starts officially tomorrow, Tuesday, but quite a few people were already here this morning, braving the blustery early morning rain to sign up for the Pre Conference (PCE) day. This is a day when each of the special interest groups (SIGs) organises a complete day dedicated to an interesting aspect of its ‘area of expertise’. I belong to the Tea SIG,, which does not mean that we sit around drinking cups of tea all day (although there was some of that as well) but Tea stands for ‘Testing, evaluation and assessment’. This year was dedicated to testing listening and the day consisted of very thought-provoking talks in the morning followed by lunch and then a practical focus in the afternoon as we mapped a listening activity and then developed items on it. These were then given a mini trial by another group who provided feedback. This way of structuring the day worked well, as everyone was fresh in the morning and receptive to the input provided in the traditional presentation format, but by the afternoon, I think, most people were happy to be doing something more practical. The idea was that everyone should go away, having learned something that both made them think and was practical too.
Authenticity or Reliability… Hey, what about both?
John Field began the day by calling for listening tests to be really designed in such a way as might realistically test what listeners at different levels can be expected to do. He underlined the need for ‘cognitive validity’ or rather asked: does the behaviour elicited from test takers correspond to the requirements of listening in real world contexts. His model of the way listening structures follows various stages, although they are not necessarily linear: decoding the sounds comes first, followed by word searches, recognising the boundaries between one word and another. Then comes parsing where different elements are recognised and labelled to some extent followed by the construction of meaning and finally the construction of discourse. In a nutshell, which I probably shouldn’t say, but anyway, what this means is that at lower levels the focus should be testing at word level and higher that this discourse meaning can be tested. He stressed that ‘knowledge is not recognition’ and if we are testing higher levels, we should be careful not to be testing complex cognitive processes which go beyond listening. John said a lot more, and I have barely done his fascinating talk any justice at all, but it was a great start to the day and he left us with this thought: the perceptual prominence of any word or clause is central to a correct response to the item. If something is not stressed in some way in a listening text, then, it is not realistic to expect it to be identified and unfair to create an item around it.
Sheila Thorn, then, took over and talked about authentic listening. She has long battled for this taking on examination boards and doubters of all kinds. Her basic intuition is that so many people study a language and then go to the country where it is spoken and flounder around in the dark, understanding very little. Something is definitely wrong here. She suggests that rather than simplifying texts at lower levels we should be providing them with longer texts but testing them on the content which is comprehensible, so that they will be exposed to authentic listening bu tested on content that they can understand. She also stressed how unrealistic the idea of doing multiple choice tests whilst listening would be ‘in real life’. You don’t listen to a podcast and answer multiple choice questions, after all. She suggested that the tasks should be more natural and connected to summarising skills, which is similar to what we might do normally when listening.
Yes, but what about the Stats?
Rita Green talked about the need to collect statistical evidence when developing tests, which means trialing items, ‘playing the detective’ as you evaluate the data you collect and interpret it and only then can you actually bank those items if they correspond to your requirements, so that if, for instance, an item on a test proves to have distractors that are far too easy, or if very few people answer one question, these need to be looked at and either revised or dropped. She described classic test theory which measures mostly test taking populations and the tests themselves but she added that modern test theory takes things a step further actually examining individual test takers, for instance, and the degree of error associated with every item and every test taker. Modern test theory also looks at Fit statistics, which does not mean how fast the test taker can run away from the examiner but whether items or individuals perform in predictable or unpredictable ways. Care of course must be taken with how the data is collected and interpreted but Rita concluded by saying that ‘without field trials and data analysis we are working blind: the more valid the test, the more reliable the test scores.’
The Afternoon Session
After lunch attention tends to flag somewhat so this was the perfect moment to do something practical. Under the expert guidance of the afternoon moderator team: Thom Kiddle, Felicity O’Dell, Russell Whitehead and Russell Whitehead, we embarked on a voyage of discovery through the process of item writing. This involved firstly listening to a text and mapping it for gist, key points etc. and then comparing our results in small groups. We then wrote items for that text (our groups were assigned multiple choice). We began by deciding on the context, the learners, age, interests, needs etc. and whether we would allow them to watch the video or not. We decided not to as a text appeared in the middle, which provided the gist of the news story, s we would not have been testing listening. We then swapped items and trialed the ones produced by another group and added our constructive feedback. The items were then returned to the original writers. This was a perfect way to work in the afternoon, and whilst time was short, it gave a glimpse of what it means to be an item writer which, I think, was extremely valuable for all involved.
A few thoughts
All this gave us a lot to think about and the discussion with the panel later was interesting and quite lively at times. The question of test purpose was broached as were other issued such as Global English or test context, and test taker aims and needs. The thorny issue of whether or not to opt for multiple choice also came up and the answer was that although they may not be natural they are practical. Practicality was another key issue which jars somewhat with the notion of authenticity. To mark authentic tasks such as summarising requires a lot more ‘rater power’ than multiple choice questions. I, personally, do not feel that multiple choice items are ‘evil’, but they should be one of more options and the needs of the test takers must be central. It is useless for PhD students who have to write long articles or theses to take tests that only require 250 word essays. I know this is not listening but it is just as true of listening. the Ielts listening exam is not ‘academic’, even for those doing the academic version. For students who intend to do MAs etc. surely it makes more sense to test them on their abilities to listen to lectures. Apparently, Ielts, however, will soon be revised, so one step at a time, we are perhaps moving in new directions.
The fun?
If this all sounds quite serious to you and you’re wondering about the fun element, I have to say that serious things can be fun but a large element of this conference is the social side of things. Friends meet up at the conference and exchange their news and experiences, and new friends are always made here. This evening was the first event which was a welcome reception culminating in dancing, so things definitely got off to a good start.
Hope to see you around the conference. 🙂