Competency 6.2: Learn about key diagnostic metrics and their uses.
Metrics for Classifiers
Accuracy:
The easiest measure of model goodness is accuracy. It is also called agreement, when measuring the inter-rater reliability.
Accuracy = # of agreements/ Total # of assessments
It is generally not considered a good metric across fields, since it has non even assignment to categories and not useful. E.g. 92% accuracy in the Kindergarten Failure Detector Model in the extreme case always says Pass.
Accuracy = # of agreements/ Total # of assessments
It is generally not considered a good metric across fields, since it has non even assignment to categories and not useful. E.g. 92% accuracy in the Kindergarten Failure Detector Model in the extreme case always says Pass.
Kappa:
Kappa = (Agreement - Expected Agreement) / (1 - Expected Agreement)
If Kappa value
= 0, agreement is at chance
= 1, agreement is perfect
= negative infinity, agreement is perfectly inverse
> 1, something is wrong
< 0, agreement is worse than chance
0<Kappa<1, no absolute standard. For data-mined models, 0.3-0.5 is considered good enough for publishing.
Kappa is scaled by the proportion of each category, influenced by the data set. We can compare the Kappa values within the same data set, but not between two data sets.
If Kappa value
= 0, agreement is at chance
= 1, agreement is perfect
= negative infinity, agreement is perfectly inverse
> 1, something is wrong
< 0, agreement is worse than chance
0<Kappa<1, no absolute standard. For data-mined models, 0.3-0.5 is considered good enough for publishing.
Kappa is scaled by the proportion of each category, influenced by the data set. We can compare the Kappa values within the same data set, but not between two data sets.
ROC:
The Receiver 
Operating Characteristic Curve (ROC) is used while a model predicts 
something having two values (E.g correct/incorrect, dropout/not dropout)
 and outputs a probability or other real value (E.g. Student will drop 
out with 73% probability).
It takes any number as cut-off (threshold) and some number of predictions (maybe 0) may then be classified as 1's and the rest may be classified as 0s. There are four possibilities for a classification threshold:
True Positive (TP) - Model and the Data say 1
False Positive (FP) - Data says 0, Model says 1
True Negative (TN) - Model and the Data say 0
False Negative (FN) - Data says 1, Model says 0
The ROC Curve has in its X axis Percent False Positives (Vs. True Negatives) and in Y axis Percent True Positives (Vs. False Negatives). The model is good if it is above the chance line in its diagonal.
It takes any number as cut-off (threshold) and some number of predictions (maybe 0) may then be classified as 1's and the rest may be classified as 0s. There are four possibilities for a classification threshold:
True Positive (TP) - Model and the Data say 1
False Positive (FP) - Data says 0, Model says 1
True Negative (TN) - Model and the Data say 0
False Negative (FN) - Data says 1, Model says 0
The ROC Curve has in its X axis Percent False Positives (Vs. True Negatives) and in Y axis Percent True Positives (Vs. False Negatives). The model is good if it is above the chance line in its diagonal.
A':
A' is the 
probability that if the model is given an example from each category, it
 will accurately identify which is which. It is a close relative of ROC 
and mathematically equivalent to Wilcoxon statistic. It gives useful 
result, since we can compute statistical tests for:
- whether two A' values are significantly different in the same or different data sets.
- whether an A' value is significantly different than choice.
- whether two A' values are significantly different in the same or different data sets.
- whether an A' value is significantly different than choice.
A' Vs Kappa:
A' is more 
difficult to compute and works only for 2 categories. It's meaning is 
invariant across data sets i.e) A'=0.6 is always better than A'=0.5. It 
is easy to interpret statistically and has value almost always higher 
than Kappa values. It also takes confidence into account.
Precision and Recall:
Precision is the probability that a data point classified as true is actually true.
Precision = TP / (TP+FP)
Recall is the probability that a data point that is actually true is classified as true.
Recall = TP / (TP+FN)
They don't take confidence into account.
Metrics for Regressors
Linear Correlation (Pearson correlation):
In r(A,B) when A's value changes, does B change in the same direction?
It assumes a linear relationship.
If correlation value is
1.0 : perfect
0.0 : none
-1.0 : perfectly negatively correlated
In between 0 and 1 : Depends on the field
0.3 is good enough in education since a lot of factors contribute to just any dependent measure.
Different functions (outliers) may also have the same correlation.
It assumes a linear relationship.
If correlation value is
1.0 : perfect
0.0 : none
-1.0 : perfectly negatively correlated
In between 0 and 1 : Depends on the field
0.3 is good enough in education since a lot of factors contribute to just any dependent measure.
Different functions (outliers) may also have the same correlation.
R square:
R square is 
correlation squared. It is the measure of what percentage of variance in
 dependent dependent measure is explained by a model. If predicting A 
with B,C,D,E, it is often used as the measure of model goodness rather 
than r.
MAE/MAD:
Mean Absolute 
Error/ Deviation is the average of absolute value of actual value minus 
predicted value. i.e) the average of each data point's difference 
between actual and predicted value. It tells the average amount to which
 the predictions deviate from the actual value and is very interpret 
able.
RMSE:
Root Mean 
Square Error (RMSE) is the square root of average of (actual value minus
 predicted value)^2. It can be interpreted similar to MAD but it 
penalizes large deviation more than small deviation. It is largely 
preferred to MAD. Low RMSE is good.
| 
RMSE/ MAD | 
Correlation | 
Model | 
| 
Low | 
High | 
Good | 
| 
High | 
Low | 
Bad | 
| 
High | 
High | 
Goes in the right direction, but systematically biased | 
| 
Low | 
Low | 
Values are in the right range, but doesn’t capture relative change | 
Information Criteria:
BiC:
Bayesian 
Information Criterion (BiC) makes trade-off between goodness of fit and 
flexibility of fit (number of parameters). The formula for linear 
regression:
BiC' = n log (1-r^2) + p log n 
where n - number of students, p - number of variables
If value > 0, worse than expected, given number of variables
   value <0, better than expected, given number of variables
It can be used 
to understand the significance of difference between models. (E.g. 6 
implies statistically significant difference)
AiC:
An Information 
Criterion/ Akaike's Information Criterion (AiC) is an alternative to 
BiC. It has slightly different trade-off between goodness and 
flexibility of fit.
Note: There is no single measure to choose between classifiers. We have to understand multiple dimensions and use multiple metrics.
 
No comments:
Post a Comment