Sufficient statistics are in to start a full scale analysis of the Star Trek quiz. (The Babylon 5 quiz needs to be extended with more difficult questions before that can be effectively analyzed.) Details follow.

In the previous version, the only modelling was that of a single classroom. This one is far more involved. These were modelled on a two parameter Rasch fit, which means each question has two assigned parameters: difficulty and discrimination. Unlike the single classroom analysis of the last version, these definitions are driven by the ability of the student answering the question.

If a student has ability level x, then there is a certainly probability of answering a question correctly. If the difficulty of the question has the same value x as the testee’s ability, then there is a 50% probability that the student answers correctly. Students with lesser difficulty have less probability of answering correctly, while students with higher difficulty have greater probability of answering correctly. The rate at which these probabilities change is related to the item’s discrimination value. A higher discrimination means the change is more sudden.

148 people responded to the quiz, and only 18 of them got 100%. 17 of the 20 questions here can be assigned values on this scale. The remaining three (“Who played Spock,” “James T. Kirk’s middle name” and “bad shirt colour”) questions were too easy to be effectively modelled. (In fact, everybody answered “who played Spock” correctly.) The rest were modelled with a basic structure as follows.

First, I needed to define the parameters of the normal distribution. These is extremely preliminary, as 12% of the testing population got 100%. When a testee scores 100%, his or her ability cannot be measured. At best, a confident lower limit can be assigned. In the long term, I intend to add enough difficult questions that nobody gets 100% on the test, but every question is answered correctly by at least one testee. Still, there appeared to be enough distribution that the 18 testees who scored 100% could be removed and the remaining scores formed a decent normal (bell, Gaussian) distribution. Furthermore, that distribution’s upper tail would account for 14% of the population, which is close enough to the 12% that I went ahead with the fit using this data.

To complete the fit, one must arbitrarily choose the mean and standard deviation for the distribution. Both of these values are taken to be 42. Once that has been established, rudimentary fits can be performed to determine the difficulty and discrimination of the questions. Done properly, I would refit student performance and update the norms and keep going back and forth with Newton-Raphson or Runge-Kutte methods until it is the optimal fit for all parameters. That is the way this will be handled once I start programming the test software. While using Google Forms to administer the test, my analysis tools amount to spreadsheets, so I’m only doing a first generation fit at this point. Approximately 2/3 of testees fall between 0 and 84 for their overall performance.

The actual test parameters came out as follows:

Question | Difficulty | Discrimination |
---|---|---|

1 – Series creator | -52 | 0.03 |

2 – Who played Spock? | Lower than we can measure | Indeterminate |

3 – T. in “James T. Kirk” | Can’t be statistically measured. (146 out of 148 answered this correctly.) | Indeterminate |

4 – Ship’s engineer | -57 | 0.04 |

5 – Last name only? | -110 – This is a very flaky fit. 129 out of 148 answered correctly. | 9 |

6 – Serial number of ship | -47 | 0.03 |

7 – First captain | 119 | 0.05 |

8 – Shirt colour | Too easy to measure. (144 out of 148 answered correctly.) | Indeterminate |

9 – First through Guardian of Forever | 33 | 0.04 |

10 – Real name of Leo Walsh | 46 | 0.035 |

11 – When was Kirk surgically altered? | 91 | 0.015 |

12 – Theme music composer | 92 | 0.025 |

13 – Art director / production designer reference on TNG? | 1 | 0.06 |

14 – Final episode? | 122 | 0.02 |

15 – Cyrano Jones profession | -3 | 0.03 |

16 – What is quadrotriticale? | 56 | 0.03 |

17 – Which episode was referenced in the 30 year anniversary? | 1 | 0.05 |

18 – Species that experiences Pon Farr? | -34 | 0.05 |

19 – Episode without nasty kids? | 36 | 0.02 |

20 – Episode with no sequel? | 20 | 0.02 |

Our DC comics quiz is complete, and our Marvel comics quiz is in progress. Expect to see the first of these on Monday.

Here’s my thing with this chart: I have no idea what the hell it means.

The second column on the chart represents the relative difficulty of the questions. If 50% of the testees answered the question correctly, then the difficulty is 42. If only 1/6 answered correctly, then the difficulty is 84, and so forth. The discrimination determines how quickly the questions transition from people getting them wrong to getting them right as difficulty increases. Let’s use question 10 as an example.

The difficulty of question 10 is 46; just over half of the testing population answered the question correctly. (The mean and standard deviation are both 42. In high school math terms, this has a z-score of 0.095.) With a discrimination of 0.035, we get a relatively rapid transition between people answering correctly and incorrectly. (The probability P of answering correctly, with difficulty b, discrimination a and ability x is P=1/(1+EXP(-a(x-b)) in this model.)

The discrimination is easier to see comparing questions 11 and 12. Their difficulty scores are nearly identical (91 and 92 respectively.) However, testees with an estimated ability of -50 had approximately a 10 percent chance of answering #11 correctly, while they only had approximately a 2.5% chance of answering #12 correctly. #11 does a weaker job of distinguishing between people who are above and below the 91 point difficulty score.

Questions 7 and 14 also had comparable difficulty scores (119 and 122) but their discrimination scores are dramatically different (0.05 and 0.02.) You must be more capable to have a 50% chance of answering #14 correctly than #7. However, a person only needs an ability of approximately 12 to have a 10% chance of answering #14 correctly, but a person needs an approximate ability of 75 before he or she has a 10% chance of answering #7 correctly. Question 7 does a much better job of “detecting” is someone has a high ability than question 14 does.

The specific values are primarily interesting to those of us who actually model these responses. I included the chart for those who want to compare to the contents of the summer school course on assessment. If you are not in that group, then the main point is that we are building questions that fit the models so effectively that we are confident the high quality geek test we are developing is going to work.

I should’ve probably phrased my statement in a way that better represents the fact that I am a math teacher’s worst nightmare (; I don’t understand that chart and you are not going to make me understand it. I’m one of those people with a complete blind-spot for math and numbers in general. I sometimes manage to fake it by use of eidetic memory and I sometimes find statistics interesting, but the process of math just does not ever work for me.

Okay, how about a short version:

Higher difficulty means you have to know your Star Trek better to answer correctly. (In other words, higher difficulty means it’s more difficult. That sounds obvious but believe me, it isn’t.) 42 = “average” difficulty.

Discrimination means the question does a better job of identifying how well people know their stuff. That means the question is more useful for this purpose. 0.04 means it does a pretty good job. 0.02 isn’t great.

(That sound about right, Blaine?)

In a nutshell, yes.

How about publishing the scores per respondent? Yes, I took this and the B5 quiz.

I can’t publish per respondent without publishing the e-mail addresses that were provided, and I’m not about to do that. If you want to know your individual performance, send me a private e-mail from your listed e-mail address or including your unique identifier and I’ll send you your results.

What I can provide is the breakdown in a frequency table.

Score: Number of responses with that score.

0% – 15%: 0 responses

20%: 1 response

25%: 0 responses

30%: 1 response

35%: 1 response

40%: 2 responses

45%: 1 response

50%: 3 responses

55%: 9 responses

60%: 10 responses

65%: 12 responses

70%: 13 responses

75%: 10 responses

80%: 21 responses

85%: 18 responses

90%: 17 responses

95%: 11 responses

100%: 18 responses

Mean: 78%

Median and mode: 80%

Standard deviation: 17%

The Marvel test is in progress. Can we have a link please? I may be an expert for Babylon 5, but I’d be the perfect average student for a marvel test. :-)

Nevermind, I thought you were talking about the execution of the test, not the creation.

Yes, I know reading comprehension is important for tests AND blogs. :-)