ROC-AUC
此内容尚不支持你的语言。
Basic Concepts
Consider a two-class prediction problem (binary classification), in which the outcomes are labeled either as positive or negative. There are four possible outcomes from a binary classifier:
- True Positive(TP): the prediction is positive and the actual value is positive.
- True Negative(TN): the prediction is negative and the actual value is negative.
- False Positive(FP): the prediction is positive and the actual value is negative.
- False Negative(FN): the prediction is negative and the actual value is positive.
Precision measures the accuracy of positive predictions. It is the ratio of true positive predictions to the total positive predictions (true positives + false positives). High precision indicates that fewer false positives are present.
Recall (also known as sensitivity or true positive rate) measures the ability of a model to identify all relevant instances. It is the ratio of true positive predictions to the actual number of positive cases (true positives + false negatives). High recall means the model successfully captured most of the actual positives.
F1 score is
Calculation and Interpretation
some probablistic interpretation
The ROC curve can be parameterized by and .
Given a threshold parameter , the instance is identified as positive if and negative otherwise. follows a probability density function if the instance actually belongs to the class positive, and otherwise. Therefore, the true positive rate is given by and the false positive rate is given by .
Note that
Read more on ROC’s Wiki page and some CSDN blog post
Suppose the numbers of positive and negative samples in a batch are and , respectively.
where is a characteristic function:
The time complexity is . Denote by the rank score of -th sample, i.e., for with the smallest prediction score, and for with the largest prediction score, .
For each positive sample, its rank is precisely the same amount of contribution towards the count of pairs where positive sample is picked over negative sample (because of higher score).
Therefore, one can first sum up all ranks from positive samples, then subtract those cases where two positive samples are paired. Note that for the highest ranked positive sample, there are positive samples below it, corresponding to a subtraction of . A simple deduction leads to the final subtraction is
Finally we have
The time complexity is , which is noticeably better than when and are large.
Now we calculate with an example:
- Calculate AUC with the original formula:
- Calculate AUC with the quicker formula:
Footnotes
-
Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874. ↩