Then, let’s take a look at the mathematical definitions. Assuming a relationship between a covariate X and Y, let Y = f(X) + ϵ, where ϵ follows a normal distribution, ϵ∼N(0,σϵ), f̂(X) denotes the estimated model. In this case, the mean squared error of a point x is:
So the equation is composed of 3 parts: variance, bias and irreducible error which is the noise term in the true relationship and can’t be reduced by any model.
Tradeoff between variance and bias: With the increase in model complexity, the bias drops down and var goes up. At the same time, it will lead to overfitting, meaning your model performs too well on the training set, but poorly on testing data. Keep your model simple or introducing regularization are the methods to control bias. For variance, you can construct your data with resampling methods, then multiple models would be generated on these datasets, and finally got averaged to reduce variance.
Figure 2. Bias & Var Tradeoff
Cost Function & Loss Function
Both functions deal with training error. In Andrew Ng’s Neural Networks and Deep Learning course, he gives very specific definition of the two: the loss function computes the error for a single training example; the cost function is the average of the loss functions of the entire training set. Then, to optimize the loss function or cost function), we can get the objective function, which is composed of empirical risk (related with loss/ cost function) and structural risk (related with regularization).
L1 & L2
L1-norm loss function tries to minimize the sum of absolute error between the targets and predictions. while L2-norm loss function changes absolute value into squared one. If the error > 1, then L2 will make the error become larger than L1. In this way, L2 is more sensitive than L1 on outliers.
Regularization is used to prevent overfitting by introducing a regularization term in the objective function to add penalty. L1 (lasso) is the sum of weights and l2 is the sum of square of weights. L1 can change large coefficients into 0, while L2 can only make the coefficient close to 0. Hence, L1 can works as a feature selection process.
Figure 3. L1 & L2
Classification Evaluation
Figure 4. Diagnostic Testing Matrix
Specificity & Sensitivity
Sensitivity (recall,TPR) meansures the proportion of positives that are correctly identified. Specificity measures how good a test is at avoiding false alarms.
Precision & Recall
Precision tells the fraction of retrieved instances that are relevant (Precision = TP / predicted positives). Recall calculates the fraction of relevant instances which have been retrieved (Recall = TP / actual positives).
Figure 5. Precision & Recall Curve
So when make predictions of earthquake, we don’t mind to make too many alarms. Because our goal is to minimize the casualties, we want to have higher recall at the expense of precision. It will become a different when finding criminals. We will try to keep a high precision value. This is very understandable, as we don’t want to make troubles to the innocent ones. Of course, we want both precision and recall to be of high score, but there is a tradeoff between the two. So the decision should be made under the specific situation, whether we want high precision or more retrieve more information. In the precision/ recall curve, if the curve is more close to the point (1,1), the performance is better. Precision (P) and recall (R) can be combined in one meansure (F). If β -> 0, then we give more importance to P, else we focus more on R.
Fβ=(β2+1)PR/ β2P+ R
When β=1, we get F1, which is a single measure of performance of the test. F1=2(PR/ (P+R))
Micro-averaging & Macro-averaging
For multi-class problems, let C1,…, Ck denote k classes. For each class Ci, Pi = TPi / (TPi + FPi); Ri = TPi / (TPi + FNi)
Micro-averaging: Pμ=TP/ (TP+FP), Rμ=TP/ (TP+FN);
Macro-averaging: PM=1/K * (∑iPi); RM=1/K * (∑iRi)
ROC & AUC
It intends to evaluate the performance of binary classifier, with tpr as y axis and fpr as x axis. tpr=TP/ (TP+FN), fraction of positive examples correctly classified; fpr=FP/ (FP+TN), fraction of negative examples incorrectlt classified.
The vertical line in the figure below represents a threshold θ,which will be used when drawing the curve. If the prediction is on the right side of the line, then it will be predicted as positive; negative otherwise. When drawing ROC, we will start with the largest θ (point (0,0), meaning all the predictions are negative), and gradually decrease it (end point (1,1), all positive).
Figure 6. ROC Curve
select best classifier: Firstly, let’s conduct a formula transformation process. N: number of samples, NEG: number of negative ones, neg: ratio of negatives, POS: number of positives, pos: ratio of positives. acc = (TP + TN) /N = tpr*pos + neg - neg*fpr, so we can rewrite it into: tpr = neg/pos *fpr + (acc- neg)/pos, we can think it as a line, with slope = neg/pos, intercept = (acc- neg)/pos Given, the slope (neg/pos), we can move this line to obtain multiple tangency points with the ROC curve, through this we will also get multiple intercepts, and select the highest value from them. Then, we can get the blue line in Fig. 7. About the corresponding accuracy, the blue line will have an intersection with the descending diagonal line, project this point to the y-axis, we can get the accuracy score.
Figure 7. Best Classifier
AUC (Area under ROC Curve), also measures the performance of a classifier. AUC ∈[0,1], typically, AUC >0.5. The larger AUC is, the better performance the classifier will obtain.
Attention: If you are still confused about how the ROC comes out. Don’t worry, I will walk you through the ROC generation process step by step, and after reading this section, you will be able to draw ROC by hand.
Let’s take a binary classification problem as the example, given the left two columns(predicted probability and true y), I will start by setting the threshlod(the term ‘cutoff’ shown in the figure) θ at 0.9: so TP is 1, FP is 0, and TP+FN =5, which will always be 5 in this example, FP+TN = 5
So TPR = 1/5 = 0.2, FPR = 0/5, I get my first point(0, 0.2). Then, continue to decrease the cutoff, to calculate new TPR, FPR, I will get (0, 0.4), (0.2, 0.4)… as the following points. After I calculate all 10 points, I can draw the whole curve(shown in the bottom section in the figure). See, it’s not hard. I strongly recommend you to draw this curve by hand at least once.
Figure 8. Draw ROC by Hand
PR Curve
PR curve is plotted by using precision against recall. Different from ROC, the curve reaches more to the right top corner, means the better model performance.
ROC & PR
ROC is insensitive to class distribution, if the ratio of positives to negatives changes, the ROC will not change, while PR curve will change dramatically. So if you try to change the ratio of pos/neg over time, ROC will still be very stable. In that way, it can reflect the performance of a classifier more objectively.
But when #neg » #pos instances, and you care more about the positive instances(e.g. predicting the morbidity rate of heart disease), it’s better to user precision recall curve than ROC to evaluate your classifier. This is because neither precision nor recall has included TN in the formula, so no matter how large proportion of #neg is, it won’t affect the PR curve.
Example: a highly skewed dataset:
Tips: In such case, accuracy is not a good classifier indicator. Because the major class will influence the accuracy score heavily. Suppose we have 90 pieces of positives, and 10 negatives. If the classifier gives the prediction output that all these 100 records belong to positive class, then the accuracy will be 90%. It seems like a good classifier, but actually it misses all the negative samples.
For the past 4 months, I have been working on cardiovascular disease risk prediction. Through this, I come up with an idea to utilize GAN to learn in a progressive way and decide to write a paper on this topic(Sry, I can’t talk much about my idea in details). Then, I began doing background research and found three related topic. In this post, I will give summarizations of these topic.
NLP algorithms are designed to learn from language, which is usually unstructured with arbitrary length. Even worse, different language families follow different rules. Applying different sentense segmentation methods may cause ambiguity. So it is necessary to transform these information into appropriate and computer-readable representation. To enable such transformation, multiple tokenization and embedding strategies have been invented. This post is mainly for giving a brief summary of these terms. (For readers, I assume you have already known some basic concepts, like tokenization, n-gram etc. I will mainly talk about word embedding methods in this blog)
Recently, I have been working on NER projects. As a greener, I have spent a lot of time in doing research of current NER methods, and made a summarization. In this post, I will list my summaries(NER in DL), hope this could be helpful for the readers who are interested and also new in this area. Before reading, I assume you already know some basic concepts(e.g. sequential neural network, POS,IOB tagging, word embedding, conditional random field).
This post is for explaining some basic statistical concepts used in disease morbidity risk prediction. As being a CS student, I have found a hard time figuring out these statistical concepts, hope my summary would be helpful for you.
Leave a Comment