A Case for Relative Grading

Absolute grading is currently and has historically been the overwhelming standard in the world's formal education institutions. Generally speaking, in this system, each question on a test is worth a predetermined, fixed number of points. Each student is then awarded a number of points within the range of zero to that number for that question, depending on the correctness and completeness of their answer. 

I argue that this simplistic method violates at least two fundamental principles of accurate evaluations of competence. I will show that relative grading, in which students’ scores are evaluated in terms of other students’ results as well as their own, has the potential to mitigate these issues with absolute grading by accounting for factors that it neglects.

It is very important to note that the most important benefits of having a more accurate grading system are not in the increased accuracy itself, per se. First, a grading system that accurately captures the extent of students' knowledge and understanding incentivizes them to study and pay attention during lectures. 

Second, evaluating students by comparing them to their peers motivates each of them to reach their full potential. Relatively even grade distributions will incentivize the more advanced to get ahead, while relatively uneven grade distributions will incentivize the less advanced to keep up. 

Third, a relative grading system discourages cheating and yet could potentially encourage collaboration in the appropriate circumstances. 

Fourth, a relative grading system can keep teachers more accountable by making it much harder for them to manipulate the difficulty of their tests in order to make their students' grade distributions reflect more ideally on them. 

Finally, a relative grading system, especially one implemented as guided by the two principles explained below, facilitates an "answer at your own risk" paradigm designed to gauge students' confidence in their answers.

The first principle that absolute grading as defined above violates is that the ratio of the reward for getting a question completely right and the penalty for getting it completely wrong should be proportional to the question's difficulty. The difficulty of each question would ideally be determined retroactively through software using the group's scores on each question. 

In an absolute grading system, if a question is worth k points, then this could be translated to a k/2 points reward and a k/2 points penalty, making the ratio one. However, any two values that add up to k will work. For this reason especially, one could very well argue that the standard system defined in the introduction is self-balancing in the sense that the expected score on each question is inversely proportional to its difficulty in the first place.

This reasoning would indeed be perfectly valid if the test included every conceivable question ever, as there could not be any sampling error. In any case, lack of an objective metric for the difficulty of questions makes it only fair to assign all students the same set of questions, so that none has an unfair advantage. A small extension that is sometimes employed is allowing students to select a subset of the questions to answer. 

However, this can only work in the current system if the optional questions and their respective weights are known by the student in advance of their selection, or if the optional questions only are known in advance and the total weight of all allowed subsets is the same. In summary, implementation of the first principle primarily affects not the relative distribution of the students' scores on an identical test, but their absolute distribution, so to speak. Thus, the first principle primarily provides a more sophisticated way to grade on a curve.

Photos from Wikipedia

Photos from Wikipedia

The second principle is that the weight of each question should be inversely proportional to the standard deviation of the group's scores on it. Software is especially important for this principle's implementation, as this factor is both considerably harder to predict and much more computationally intensive. There are three main justifications for this principle, namely, that standard deviation positively correlates with randomness; that it provides for an arbitrarily large set of questions, which encourages students to study more comprehensively and, relatedly, may very well allow them to "play to their strengths" depending on the test's rules and whether it is feasible to complete it in the given time; and that it "intelligently" rewards marginally better scores on questions that reflect disproportionately more preparation.

O-4 (4).jpg

As an example of the first justification, take true or false questions. If half of the class got the question right and half got it wrong, then this would be exactly the expected result from a completely ignorant group. Similarly, if a multiple choice question has four possible answers and only a fourth of the class got it correct, then this would almost surely indicate a problem with the question, even more so than with the true or false question, just as the standard deviation is higher in this case. There is no need in the new system for the teacher to throw out the question, as its weight will automatically be reduced, perhaps all the way to nothing. This gets at some of the essence of the second justification. If a teacher is concerned that a particular question may confuse students, they can simply include it in the test and have the software judge by the results.

As for the third justification, imagine an essay question where nearly the entire class scored within eighty to ninety percent credit. In other words, the standard deviation of the students' scores was low. The question should then be worth more so that the students who studied harder receive a larger reward for the extra studying that resulted in their extra five to ten percent.

These ideas are ready to be implemented through software and adopted, and I think it is quite a shame that they are not, at least as far as I know. There are, of course, more avenues to explore here, such as considering whether the distribution of scores on a question is relatively normal or tends toward either the higher or lower range, and how that information could be used to tweak the test itself, if not merely the students' grades. Furthermore, it would be very important to carefully consider the potential effects this could have on student collaboration and cheating. Finally, I have only discussed the principles themselves, not the best mathematical formulas and statistical techniques through which to implement them.