Test Development and Administration December 13 -17, 2004
The following is a Guest Discussion that took place on the NIFL-Assessment Listserv from December 13 through December 17, 2004. The topic is Test Development and Administration.
Summary of Discussion
The discussion was led by April Zenisky, Director of Computer-Based Testing Initiatives, and ABE Test Developer; Center for Educational Assessment, University of Massachusetts at Amherst. It opened with questions regarding the development of a test for managers in training in which the focus of the test was on critical thinking and problem-solving skills. April noted that computer-based tests (CBTs) can be more conducive to assessing non-tangible items such as critical thinking and problem-solving skills because they use various item formats other than multiple choice questions. An interest of April's is the degree to which adult education students are familiar with computer use and response actions (clicking, drag-and-drop, scrolling, and typing), and what their needs around computer skills generally are. This led to a discussion in which it was noted that certain tests that have switched from pencil/paper format to CBT are no longer equivalent because they change the domain of test items to include computer skill and use. It was also noted that the teacher or tester also faces a learning curve in terms of using CBTs as opposed to traditional paper and pencil tests. April's research has indicated that although the challenges are greater with CBTs because of the expanded domain, users like the benefits because they bring new skills for use in the real world. It was also noted that CBTs provided better opportunities for people with learning disabilities because of the possible expanded item formats. The discussion concluded with a focus on Universal Test Design. Several on-line resources related to the discussion topics were provided.
Good morning, afternoon, and evening to you all,
Our conversation on test admin and development runs from today, Dec. 13, through Friday, Dec. 17. I'm jumping right in here: IF YOUR STATE IS DEVELOPING ASSESSMENTS (or you already did that or you want to start doing it), WE WOULD LIKE TO HEAR FROM YOU!
Also, if your state/program/system is challenged by training test
administrators, please let us know what successes you are experiencing,
as well as the difficulties that you are faced with.
Here is the information on this week's guest topic:
Test Development and Administration
Guest: April Zenisky, Director of Computer-Based Testing Initiatives,
and ABE Test Developer; Center for Educational Assessment, University of Massachusetts at Amherst.
Suggested preparations for this week's discussion:
- "A Basic Primer for Understanding Standardized Tests and Using
April Zenisky, Lisa Keller, Stephen G. Sireci
(an overview of validity, reliability, interpreting scale scores,
standard error of measurement)
- "ACLS, SABES, UMass: Perfect Together!"
(an overview of the collaboration and products developed by these
agencies, including new statewide assessments)
Thanks and looking forward to the discussion.
Moderator, NIFL Assessment Discussion List, and
Coordinator/Developer LINCS Assessment Special Collection at
I am working with large organization to develop assessments for managers in training. They want the questions to focus on measuring whether or not a management trainee can think critically and problem solve. The questions will focus on IMPACT and INDICATIONS of daily company procedures.
The types of questions need to assess if the trainee understands the
impact of a particular procedure on the productivity, bottom-line, etc. for the business. We also want to assess whether or not the trainees can identify the indicators that something is, or is not, going (or set up) according to procedures.
For example, does the trainee understand the IMPACT if department
materials are unorganized and/or dirty? That the mailbag has gone unopened? Can s/he identify the INDICATORS that something is not running according to procedure, is out of place or that an employee is not performing their job requirements?
This assessment needs to written in standardized format (i.e., multiple
choice, true false, some fill in the blank and a few short answers).
I've created many assessments for basic skills, so have an understanding of assessment development. However, I can see that testing intangible problem solving skills, will require a different approach. (By the way, the precise training modules have not been developed.)
I am working with the company to identify the proficiencies, but would
very much appreciate samples of how to phrase the questions so that they will accurately measure the proficiencies in the above mentioned format.
Does anyone have samples or guidelines I might use?
Thanks for your input.
Julie's question is especially interesting to me as my own research interests as a test developer involve alternate item types (not multiple-choice) that are computer based. Specifically, I am interested in uses of innovative or novel types that are highly engaging to the test-taker but still auto-scored. In many cases, there might be one correct answer, but the questions might be written to allow for more than one right (or wrong) answer.
For example, the different item formats for use in computerized testing allow you to incorporate different actions such as clicking, drag-and-drop, scrolling, and typing. The drag-and-drop option is nice because people can rearrange onscreen items in different ways given the frame of the question (for example, put things in some order (numerical, alphabetical, chronological, etc.)) and it lets them visualize the groupings they are creating.
So, here in Massachusetts my colleagues at the Center for Educational Assessment and I are working with the Massachusetts DoE and practitioners in the state to develop new, computerized assessments
that are directly based on the Massachusetts curriculum. The MA assessments will be different from the computerized BEST Plus in that the student willl enter answers directly into the computer, which
is different from the BEST Plus' approach (which works well for that domain being tested -- a test administrator scores a student's spoken answers on a few dimensions and enters those into the computer).
One thing I'm interested in learning more about myself about is people's experiences with adult learners and computerized testing. Some work I'm involved in now is finding out about adult learners' familiarity with computerized response actions (clicking, drag-and-drop, scrolling, and typing). Does anyone out there have any experience with or thoughts about using computer technology for testing with adults that they'd be interested in sharing with the list? More broadly, for practitioners, what do you find that your students can do with respect to using a computer and what skills do you see them needing assistance with? What might you consider to be the important
navigational and other features of computerized activities/assessments?
I'm looking forward to hearing from you!
This is a great question. So often we turn to computers to help us solve our problems, taking for granted that using a computer relies on skills that are far from universal. When the TOEFL switched to a computer-adaptive test I felt like I was the only one who noticed that they were comparing apples and oranges- a lone voice crying in the wilderness.
Suddenly students were required to write and essay on the computer instead of on paper, and they could no longer look back at questions once they had passed. Working in a second language is enough of a barrier to higher ed- forcing people to wait for computer literacy on TOP of that was seen by many learners as yet another hurdle. I don't think we can assume literacy.
One thing I have realized in working with the BEST Plus is how much Tech Support the TESTERS need- starting up a computer, opening a file, loading a CD, using a touch pad, even plugging it in. These are teachers; they are people who have graduated college, people with email accounts. We really need to be careful not to presume too much computer ability. Just because it seems like second nature to me, using a computer relies on a whole new set of reading, writing and cognitive skills compared to pencil and paper testing.
Thanks for asking!
Framingham Adult ESL PLUS
So great to hear from you -- about a year to the day of our scoring meeting in Amherst!
Your questioning of the comparability of computer-based tests (CBTs) to paper-and-pencil counterparts is well-founded. I see a lot of research on this topic, and often the results that are out there are very dependent on a confluence of domain content and population. Sometimes paper scores are 'better' (i.e., higher), and sometimes the computer scores are and there's not much rhyme or reason to it, but one common thread seems to be that people _like_ computerization more. Many
teachers and ABE students that I have spoken with recently have expressed valid concerns about fairness with computerized tests, but see how CBTs (and the practice with computer skills that comes
along with them) have benefits for students --- for example, I've been told that many employers (Home Depot and others) now require most job applications to be completed at computer workstations.
On the issue of score comparability and the TOEFL, often scores from a 'new' computerized test are not intended to be equivalent to the old paper-based test being replaced. I am less familiar with the specifics of the TOEFL testing program, but a similar example is the American Institutes of Certified Public Accountants CPA Exam. The old paper test was based on test specifications from some number of years ago, and the new computerized version (released April 2004) was created in
response to a more recent job analysis that found that the information entry-level accountants needed had changed over the years, and that computers were an integral part of how accountants were doing their work. Thus, the logical choice for the CPAs was a computerized assessment with funky new formats (very performance-oriented with actual spreadsheets and tax forms and the like embedded in the test). I don't think that anyone at the AICPA, however, would argue that the old paper and new computerized CPA Exam scores are equivalent -- they measure somewhat different things by design.
I do know that the TOEFL folks did a lot of research into items types and what kinds of skills should be required by those test-takers, and the computer-adaptive test that they came out with is intended to reflect something a little different than the previous version.
Thanks also for bringing up the need for test administrator support -- as procedures and practices change from one mode of testing to another, _standardization_ between test administrations (a critical part of ensuring fairness for all test-takers) has to be maintained.
Yes, the issue does revolve around the domain being tested. I can see the CPA exam switching the definition of its behavioral domain to include computer ability.
However, I don't ever remember hearing that the TOEFL was redefining the domain it is intended to test to include computer skills. Sure, an argument can be made the these skills are a part of college success, but it's the Test of English for Foreign Learners, not TOE and Computer Skills for FL TOECSFL? doesn't have that ring! :-)
That argument ignores the fact that this creates an uneven playing field. Native English speakers take no computer literacy exam to determine entry.
Forgive the vitriol, April; it's not aimed at you; I know that you don't work on the TOEFL. It's just that this test has become a real gatekeeper for many people who have become close to me.
This is the problem- the advantages of computer-based-testing are numerous, but do they add new skills that are not part of the domain. Or, do they compensate for skills that are part of the domain but may be lacking? Holding a pencil, or scanning text with your finger may be part of the domain of literacy being tested, and computers don't allow for that.
No worries, Kevin! This is the kind of conversation psychometricians live for. :-)
I like the new acronym (TOECSFL) you devised. As far as the TOEFL itself, I noodled around on the ETS website and found several research reports where they've looked into some of the computer familiarity issues in the TOEFL population (http://www.ets.org/ell/research/download.html). I'm inclined to view such reports are one source of evidence, and also I think that people's real-world experiences are another valuable source of data about how it works out in practice. Has the TOEFL construct changed? I just am not familiar enough with the old and new to answer for sure.
According to a report (linked on the above Web page), one 1998 report by Kirsch, Jamieson, Taylor, and Eignor) looked at 90,000 TOEFL candidates and found that "16 percent of the TOEFL population was
judged to have low computer familiarity, another 34 percent to have moderate familiarity, and approximately 50 percent to have high familiarity" -- and how those low, medium, and high labels were defined is in the report. Then again, a July 2004 report by Breland, Lee, and Muraki looked at mode analysis for essays and saw a slight difference in scores favoring handwritten over computerized. On one hand, and then the other.....
One question you raised -- "do they add new skills that are not part of the domain?" -- is a critical validity question regarding construct-irrelevant variance. In other words, is some part of people's scores due to a skill other than the construct _supposed_ to be measured by the test. As part of those 'good practices' that I was getting into in my earlier posting in response to Eileen question, there's a raft of quantitative and qualitative evidence to be collected to inform practice one way or another.
I think there's also some effort to strike a balance in standardized testing here too. I think the term "standardized test" is often viewed as another synonym for "bunch of multiple-choice questions" and one reason folks are looking to the computer is the potential to ask people to demonstrate their knowledge in more contextualized problems.
Here's an example of a contextualized question. Given an entire passage about eclipses (sp?), take the sentence "This is the only part of the Sun that can be seen during an eclipse such as the one in
February 1979." The task is, if the word "one" is bolded -- what does that refer to? Rather than just four decontextualized answer choices, a test-taker has the entire passage to choose from, thereby minimizing the test-wiseness part of MC questions -- sure it's harder, but perhaps more useful for identifying true mastery of antecedents and the like.
What do you think? Kevin, and/or others?
I'm glad to hear you're working on developing more computerized tests. My interest is focused on accessibility for people with disabilities, and a computerized assessment is inherently far more accessible for many types of disabilities, especially when combined with other accessible features/software that might interface with the assessment, like JAWS or other text-to-speech software programs. It would help, too, if the student were able to adjust the size/type of font, screen colors, and be able to highlight text for visual tracking. The presentation of one question at a time is also helpful, as well as the use of a keyboard or speech-to-text software for short-answer or essay questions. I'm sure there's many more possibilities....
Thanks for your good work!
Disabilities Project Manager
Arkansas Adult Learning Resource Center
Thanks for your thoughts! Accessibility is definitely one aspect that computerized testing can help with, particularly from the perspective of the principles of Universal Test Design. Indeed, a computerized test _can be_ preferable to paper for exactly all the reasons you mention -- there's a huge potential there for the test medium to be customized by and for the individual.
One thing that I wonder about in this regard is computer familiarity and different learners' knowledge of the buttons and bells and whistles to make some of those adjustments? Tutorials are by
and away critical to any computerized testing application, as are opportunities for navigating around practice systems prior to the "for real" test administration.
The issue of computer familiarity is one that is very important to us in our work in Massachusetts. Given the wide ranges of 1)student academic proficiencies and 2) students' computer familiarity
levels and 3) computer use in programs, the interaction between a test-taker and a computer to measure academic skills is an issue we are focusing on -- to what extent would content and not computer familiarity (construct-irrelevant variance) be tested by a computer-based test in our target population?
Just a footnote: here's a link to read some more about Universal Test Design, for those whose interest may be piqued....
[Thurlow, M., Quenemoen, R., Thompson, S., & Lehr, C. (2001). Principles and characteristics of inclusive assessment and accountability systems (Synthesis Report 40). Minneapolis, MN: National
Center on Educational Outcomes. Available at
I had heard of Universal Design but not of Universal Test Design. Could you talk about that a bit?
I am also very interested to hear how you see principles and practice
connecting in your work. For example, "I believe this and here's how it
shows up in what I do" or "I do this because I believe that."
In response to Eileen's questions about Universal Test Design:
(Patti, Marie, and other folks, please feel free to jump in too.) Universal Test Design is an evolving approach to developing assessments that assessments that are designed and developed from the beginning to be accessible and valid for the widest range of students, including
students with disabilities and students with limited English proficiency. Rather than thinking about making test accommodations that some might perceive as 'boosts' for individuals with disabilities or
English language learners, the thinking here (as with more general Universal Design) is that universal test design is intended to apply to all test takers by considering the entire test population at the outset and throughout of test development. What are the actual skills and
abilities being tested? How can all members of the testing population demonstrate their mastery? While the need for assistive technologies may not always be eliminated for all students, the approach is about being inclusive from the outset.
I say that UTD is evolving because it really is an emerging area of focus for research in testing. There's been a lot of excellent accommodations research done over the years that has really set the
stage for this, and increasingly there's an interest from the get-go in really thinking about the larger sense of fairness in testing and how that can be promoted to the greatest extent possible.
UTD has a role in all aspects of test development, from test conceptualization (defining constructs explicitly and including all students fully) and test construction (developing items minimizing
measurement of extra stuff (linguistic complexity, etc.)) to tryout (including full range in tryout), item analysis (complete statistical and sensitivity analyses to evaluate and eliminate items exhibiting differential item functioning), and revision.
Here's a link to a page from the National Center for Educational Outcomes that provides additional links to UTD resources.
As far as connecting principles and practices, Universal Test Design and the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999) are the specific texts that formally guide our work at UMass. One specific example comes from an evaluation I am doing right now with students and teachers at several adult learning centers in Massachusetts. Rather than making assumptions about the computer interface and environment that we are working in, I believe that the
students and their teachers are _the_ source for informing testing practice. Our findings thus far (with students at a range of levels) are informative in every way. Another example is in working with the Massachusetts DOE and SABES staff on procedures for standardization of test administration procedures for the TABE, BEST Plus, and REEP. Of course while the publishers of each of those assessments has provided documents of administration guidelines, helping to communicating the
importance of the "how you go about doing testing" is a big part of good measurement practice, to me.
Please note: We do not control and cannot guarantee the relevance, timeliness, or accuracy of the materials provided by other agencies or organizations via links off-site, nor do we endorse other agencies or organizations, their views, products or services.