ROC-AUC

此内容尚不支持你的语言。

Basic Concepts

Consider a two-class prediction problem (binary classification), in which the outcomes are labeled either as positive or negative. There are four possible outcomes from a binary classifier:

True Positive(TP): the prediction is positive and the actual value is positive.
True Negative(TN): the prediction is negative and the actual value is negative.
False Positive(FP): the prediction is positive and the actual value is negative.
False Negative(FN): the prediction is negative and the actual value is positive.

Precision measures the accuracy of positive predictions. It is the ratio of true positive predictions to the total positive predictions (true positives + false positives). High precision indicates that fewer false positives are present.

\begin{equation*} \text{Precision} = \frac{TP}{TP+FP} \end{equation*}

Recall (also known as sensitivity or true positive rate) measures the ability of a model to identify all relevant instances. It is the ratio of true positive predictions to the actual number of positive cases (true positives + false negatives). High recall means the model successfully captured most of the actual positives.

\begin{equation*} \text{Recall} = \frac{TP}{TP+FN} \end{equation*}

F1 score is

F1 = \frac{2}{1/\text{Precision}+1/\text{Recall}}

Calculation and Interpretation

some probablistic interpretation

The ROC curve $\{(x,y)\}$ can be parameterized by $y=TPR(T)$ and $x=FPR(T)$ .

Given a threshold parameter $T$ , the instance is identified as positive if $X>T$ and negative otherwise. $X$ follows a probability density function $f_{1}(x)$ if the instance actually belongs to the class positive, and $f_{0}(x)$ otherwise. Therefore, the true positive rate is given by $TPR(T)=\int_{T}^{\infty}f_{1}(x)dx$ and the false positive rate is given by $\text{FPR}(T)=\int_{T}^{\infty}f_{0}(x)dx$ .

Note that

FPR^\prime(T)=\frac{d}{dT}\int_{T}^{\infty}f_{0}(x)fx=-f_{0}(T)

\begin{aligned} \text{AUC} & = \int_{x=0}^{1}ydx \\ & = \int_{x=0}^{1}TPR(T)dx \\ & = \int_{\infty}^{-\infty}TPR(T)FPR^\prime(T)dT \\ & = \int_{-\infty}^{\infty}TPR(T)f_{0}(T)dT \\ & = \int_{-\infty}^{\infty}\int_{T}^{\infty}f_{1}(x)dxf_{0}(T)dT \\ & = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\mathbf{1}\{x\geq T\}f_{1}(x)dx f_{0}(T)dT \\ & = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}\mathbf{1}\{T^\prime\geq T\}f_{1}(T^\prime)f_{0}(T)dT^\prime dT \\ & = P(X_{1}\geq X_{0}) \end{aligned}

Read more on ROC’s Wiki page and some CSDN blog post

Suppose the numbers of positive and negative samples in a batch are $M$ and $N$ , respectively.

\text{AUC} = \frac{\sum\mathbf{I}(P_{pos}, P_{neg})}{M\cdot N}

where $\mathbf{I}$ is a characteristic function:

\mathbf{I}(p, q)=\left\{ \begin{aligned} 1, &\quad p > q \\ 0.5, &\quad p = q \\ 0, &\quad p < q \end{aligned} \right.

The time complexity is $O(MN)$ . Denote by $r_{i}$ the rank score of $i$ -th sample, i.e., for $x_{j}$ with the smallest prediction score, $r_{j}=0$ and for $x_{k}$ with the largest prediction score, $r_{k}=M+N-1$ .

For each positive sample, its rank is precisely the same amount of contribution towards the count of pairs where positive sample is picked over negative sample (because of higher score).

Therefore, one can first sum up all ranks from positive samples, then subtract those cases where two positive samples are paired. Note that for the highest ranked positive sample, there are $M-1$ positive samples below it, corresponding to a subtraction of $M-1$ . A simple deduction leads to the final subtraction is

(M-1)+\cdots+1+0=\frac{(M-1)M}{2}.

Finally we have

\text{AUC} = \frac{\sum_{i\in\text{Positive}}r_{i}-\frac{(M-1)M}{2}}{M\cdot N}

The time complexity is $O((M+N)\log(M+N))$ , which is noticeably better than $O(MN)$ when $M$ and $N$ are large.

Now we calculate with an example:

\begin{array}{|c|c|c|} \hline \text{real label} & \text{pred score} & \text{rank} \\ \hline 1 & 0.9 & \mathbf{5} \\ 1 & 0.7 & \mathbf{4} \\ 0 & 0.6 & 3 \\ 1 & 0.55 & \mathbf{2} \\ 0 & 0.2 & 1 \\ 0 & 0.1 & 0 \\ \hline \end{array}

Calculate AUC with the original formula:

\text{AUC} = \frac{\mathbf{1}_{0.9}\cdot 3+\mathbf{1}_{0.7}\cdot 3+\mathbf{1}_{0.55}\cdot 2}{3\cdot 3}=\frac{8}{9}.

Calculate AUC with the quicker formula:

\text{AUC} = \frac{5+4+2}{3\cdot 3}-\frac{3-1}{2\cdot 3}=\frac{8}{9}.

Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874. ↩

ROC-AUC

Basic Concepts

Calculation and Interpretation

Footnotes