When Accuracy is the Wrong Evaluation Measure

Say you are developing a process returning a list of results, and you want to evaluate how well your process is doing. This is a process more complex than a database or dictionary lookup. You expect it cannot be perfect, but you have criteria for judging whether each result in the list is correct or not, and whether it returned all the intended results. Let’s call this general case information retrieval.¹

Definition: Information Retrieval – A process returning zero to many results. Each returned result can be evaluated as correct or incorrect. Likewise results not returned can be evaluated as correctly or incorrectly rejected.

So information retrieval systems have two valid states in relation to any target in the potential result set, correct selection and correct rejection. There are also two ways the system can be in error, namely incorrectly rejecting a result and incorrectly returning a result.

Definition: Type I Error – false negative or incorrectly rejecting a result.

Definition: Type II Error – false positive or incorrectly returning a result.

	correct	incorrect
returned:	good	Type II error
rejected:	Type I error	good

The naive approach to evaluating an information retrieval system is a simple comparison of correct results to incorrect results, measuring either accuracy or error, usually as a percentage.

Definition: Accuracy – (correctly returned + correctly rejected) / (set of potential results)

Definition: Error – (Type I and II errors) / (set of potential results)

Accuracy and error are complements of each other, and while they are good measures for many day-to-day problems, their usefulness falls apart when we evaluate information retrieval systems where the set of items that should be rejected is large.

We need to measure something else to evaluate systems like this, precision and recall.

Definition: Precision – Fraction (percent) of returned results in an information retrieval system that are correct.

(correctly returned) / (correctly returned + Type II errors)

Definition: Recall – Fraction (percent) of correct results in an information retrieval system that are returned.

(correctly returned) / (correctly returned + Type I errors)

By themselves neither precision or recall is a particularly good measure. Simply return nothing and precision will be “off the charts” (literally, since you will divide by zero!), but then recall will be zero. Alternately return everything and recall will be 100%, but precision suffers.

By calculating the harmonic mean of precision and recall we have a much better measure of information retrieval performance in one number.²

Definition: F Score – A measure of system accuracy derived from precision and recall. The most general case, giving equal weighting to both parameters, is the harmonic mean

F₁ = (2 * precision * recall) / (precision + recall)

You can vary the weighting of precision and recall by applying the formula with an additional parameter.

F_a = ((1 + a²) * precision * recall) / ((a² * precision) + recall)

Setting a > 1 weights recall higher, while a < 1 weights precision higher.

There is one more measure, fallout, which measures the rate of Type II errors. It shows the probability of a false positive, which may be an important measure, depending on your system.³

Definition: Fallout – probability of a Type II error (false positive) in an information retrieval system.

(Type II errors) / (Type II errors + correctly rejected)

Finally, lets look at the numbers from some made-up information retrieval systems. In the first two sets notice as we scale-up the proportion of of correctly rejected results in the set of potential results, accuracy and error get better and better, but precision, recall, and F score remain consistent (and the F score is consitently poor despite improving accuracy). In the last set notice the different ways we can achieve high accuracy with a poor F score.

correctly	correctly	Type I	Type II					F
returned	rejected	error	error	accuracy	error	precision	recall	score	fallout
25	99	100	3	54.6%	45.4%	89.3%	20.0%	32.7%	2.9%
25	990	100	3	90.8%	9.2%	89.3%	20.0%	32.7%	0.3%
25	9,900	100	3	99.0%	1.0%	89.3%	20.0%	32.7%	0.0%
25	99,000	100	3	99.9%	0.1%	89.3%	20.0%	32.7%	0.0%

100	99	3	25	87.7%	12.3%	80.0%	97.1%	87.7%	20.2%
100	990	3	25	97.5%	2.5%	80.0%	97.1%	87.7%	2.5%
100	9,900	3	25	99.7%	0.3%	80.0%	97.1%	87.7%	0.3%
100	99,000	3	25	100.0%	0.0%	80.0%	97.1%	87.7%	0.0%

34	99,850	115	1	99.9%	0.1%	97.1%	22.8%	37.0%	0.0%
100	99,700	100	100	99.8%	0.2%	50.0%	50.0%	50.0%	0.1%
75	99,700	75	150	99.8%	0.2%	33.3%	50.0%	40.0%	0.2%
125	99,625	245	5	99.8%	0.3%	96.2%	33.8%	50.0%	0.0%
195	99,525	5	275	99.7%	0.3%	41.5%	97.5%	58.2%	0.3%

Notes

[1] This is a restricted definition of information retrieval. As you can see in the Wikipedia article the term frequently applies when results have varying degrees of relevancy.

[2] Since perfect precision and recall together result in an F score of 1.0, I represent it as a percentage, which would probably make a real statistician cringe.

[3] For consistency’s sake also shown as a percentage.

craftyThoughts

Composability -> Expressiveness -> Correctness -> Testability -> Intention

When Accuracy is the Wrong Evaluation Measure

Notes

Leave a Reply Cancel reply