Say you are developing a process returning a list of results, and you want to evaluate how well your process is doing. This is a process more complex than a database or dictionary lookup. You expect it cannot be perfect, but you have criteria for judging whether each result in the list is correct or not, and whether it returned all the intended results. Let’s call this general case *information retrieval*.^{1}

**Definition:** Information Retrieval – A process returning zero to many results. Each returned result can be evaluated as correct or incorrect. Likewise results not returned can be evaluated as correctly or incorrectly rejected.

So information retrieval systems have two valid states in relation to any target in the potential result set, *correct selection* and *correct rejection*. There are also two ways the system can be in error, namely *incorrectly rejecting a result* and *incorrectly returning a result*.

**Definition:** Type I Error – false negative or incorrectly rejecting a result.

**Definition:** Type II Error – false positive or incorrectly returning a result.

correct | incorrect | |
---|---|---|

returned: | good | Type II error |

rejected: | Type I error | good |

The naive approach to evaluating an information retrieval system is a simple comparison of correct results to incorrect results, measuring either *accuracy* or *error*, usually as a percentage.

**Definition:** Accuracy – (correctly returned + correctly rejected) / (set of potential results)

**Definition:** Error – (Type I and II errors) / (set of potential results)

Accuracy and error are complements of each other, and while they are good measures for many day-to-day problems, their usefulness falls apart when we evaluate information retrieval systems where the set of items that *should be rejected* is large.

We need to measure something else to evaluate systems like this, *precision and recall*.

**Definition:** Precision – Fraction (percent) of returned results in an information retrieval system that are correct.

(correctly returned) / (correctly returned + Type II errors)

**Definition:** Recall – Fraction (percent) of correct results in an information retrieval system that are returned.

(correctly returned) / (correctly returned + Type I errors)

By themselves neither precision or recall is a particularly good measure. Simply return nothing and precision will be “off the charts” (literally, since you will divide by zero!), but then recall will be zero. Alternately return everything and recall will be 100%, but precision suffers.

By calculating the harmonic mean of precision and recall we have a much better measure of information retrieval performance in one number.^{2}

**Definition:** F Score – A measure of system accuracy derived from precision and recall. The most general case, giving equal weighting to both parameters, is the harmonic mean

F_{1} = (2 * precision * recall) / (precision + recall)

You can vary the weighting of precision and recall by applying the formula with an additional parameter.

F_{a} = ((1 + *a*^{2}) * precision * recall) / ((*a*^{2} * precision) + recall)

Setting *a* > 1 weights recall higher, while *a* < 1 weights precision higher.

There is one more measure, fallout, which measures the rate of Type II errors. It shows the probability of a false positive, which may be an important measure, depending on your system.^{3}

**Definition:** Fallout – probability of a Type II error (false positive) in an information retrieval system.

(Type II errors) / (Type II errors + correctly rejected)

Finally, lets look at the numbers from some made-up information retrieval systems. In the first two sets notice as we scale-up the proportion of of correctly rejected results in the set of potential results, accuracy and error get better and better, but precision, recall, and F score remain consistent (and the F score is consitently poor despite improving accuracy). In the last set notice the different ways we can achieve high accuracy with a poor F score.

correctly | correctly | Type I | Type II | F | |||||
---|---|---|---|---|---|---|---|---|---|

returned | rejected | error | error | accuracy | error | precision | recall | score | fallout |

25 | 99 | 100 | 3 | 54.6% | 45.4% | 89.3% | 20.0% | 32.7% | 2.9% |

25 | 990 | 100 | 3 | 90.8% | 9.2% | 89.3% | 20.0% | 32.7% | 0.3% |

25 | 9,900 | 100 | 3 | 99.0% | 1.0% | 89.3% | 20.0% | 32.7% | 0.0% |

25 | 99,000 | 100 | 3 | 99.9% | 0.1% | 89.3% | 20.0% | 32.7% | 0.0% |

100 | 99 | 3 | 25 | 87.7% | 12.3% | 80.0% | 97.1% | 87.7% | 20.2% |

100 | 990 | 3 | 25 | 97.5% | 2.5% | 80.0% | 97.1% | 87.7% | 2.5% |

100 | 9,900 | 3 | 25 | 99.7% | 0.3% | 80.0% | 97.1% | 87.7% | 0.3% |

100 | 99,000 | 3 | 25 | 100.0% | 0.0% | 80.0% | 97.1% | 87.7% | 0.0% |

34 | 99,850 | 115 | 1 | 99.9% | 0.1% | 97.1% | 22.8% | 37.0% | 0.0% |

100 | 99,700 | 100 | 100 | 99.8% | 0.2% | 50.0% | 50.0% | 50.0% | 0.1% |

75 | 99,700 | 75 | 150 | 99.8% | 0.2% | 33.3% | 50.0% | 40.0% | 0.2% |

125 | 99,625 | 245 | 5 | 99.8% | 0.3% | 96.2% | 33.8% | 50.0% | 0.0% |

195 | 99,525 | 5 | 275 | 99.7% | 0.3% | 41.5% | 97.5% | 58.2% | 0.3% |

## Notes

[1] This is a restricted definition of *information retrieval*. As you can see in the Wikipedia article the term frequently applies when results have varying degrees of relevancy.

[2] Since perfect precision and recall together result in an F score of 1.0, I represent it as a percentage, which would probably make a real statistician cringe.

[3] For consistency’s sake also shown as a percentage.