Figure 1. Sigmoid Function & Decision Boundary (Image source: ML Cheatsheet)
Parameter Estimation: MLE
Assuming all the training set (N samples) were generated independently, so the likelihood of the parameters can be represented as:
Then, we want to obtain w, b which maximize the log likelihood:
Correspondingly, we try to find w, b to minimize the negative log-likelihood for logistic regression: , which is exactly in the form of cross-entropy error function.
Loss Function & Cost Function
In linear regression, squared error is used to measure the loss between y and for each sample, and sum of squared error to calculate cost function for the whole dataset. While in logistic regression, cross entropy is the substitution.
Loss function:
Cost function:
In the training process, the goal is to minimize the Ein(W), which means to find a spot where ▿Ein(W)=0, when Ein(W) is a continuous convex function. But unlike linear regression, whose ▿Ein(W) is linear, it is hard to calculate ▿Ein(W) for LR directly. So that’s where gradient descent is introduced.
Gradient Descent
To obtain optimal w, b by using gradient descent, we need to iteratively takes steps in the direction of negative of the gradient, update the weights, until certain requirements have been met. (e.g. ||w (k+1)−wk||< ϵ or [J(w(k+1)−J(wk)]< ϵ ). As the conditional likelihood for logistic regression is concave, we can find the global minimum rather than a local minimum.
Repeat:
,
where α is a positive constant learning rate to control the size of each step.
Regularization
Regularization is used for reduce overfitting by adding penalty for large values of w. + λΦ(w)
One way is to add L2 (ridge) norm to the log likelihood:
Φ(w) = (λ would be set as λ/2)
The other is using L1 (lasso) norm:
Φ(w) = |w|
L2-regularized loss function is smooth, meaning the optimum is the stationary point (0-derivative point). L2 can be regarded as weight decay, it won’t make the coefficients become zero. On the contrary, L1 can. Meanwhile, L1 regularization gives you sparse estimates.
Derivative Calculation
Model Interpretation
Before talking about how to interpret LR, let’s take a look at a few terms first.
Odds v.s. Odds Ratio:
Odds = probability of event / probablity of non-event. So if odds = 3, then the probability of having that event is twice higher than that event won’t happen.
Odds ratio = Odds / Odds. Odds explains the risk ratio of having an event vs not, while odds ratio can explain the influence of each single variable on the probability of an event happening or not.
suppose
with 1 unit increase in x1, the odds ratio will be , which is exactly , meaning with 1 unit change, the odds will multiply by a factor of .
Logit Function v.s. Sigmoid Function:
Logit Function is the log of odds. Its output ranges from - \infty to + \infty.
The dataset contains around 420,000 pieces url records. We want to identify if it’s a bad url.
Table 1
url
label
iamagameaddict.com
bad
slightlyoffcenter.net
bad
So the 1st step is to cut the url into tokens. As the format of a url is kind of fixed: it contains protocol, host name (primary domain) and so on, we can firstly cut the url by ‘.’, ‘/’, and remove some common parts in the url, such as ‘http’, ‘com’, ‘net’. Then, we would use logistic regression to do the prediction.
I have used gridsearchCV to find optimal parameters which is l2, with C (equal to 1/λ) set to be 100, then the classificatio accuracy of training set is around 97%, while for the testing set, it can also be up to 95%.
References
[1]. BISHOP, C. M. (2016). PATTERN RECOGNITION AND MACHINE LEARNING. S.l.: SPRINGER-VERLAG NEW YORK.
[2]. Murphy, K. P. (2013). Machine learning: a probabilistic perspective. Cambridge, Mass.: MIT Press.
For the past 4 months, I have been working on cardiovascular disease risk prediction. Through this, I come up with an idea to utilize GAN to learn in a progressive way and decide to write a paper on this topic(Sry, I can’t talk much about my idea in details). Then, I began doing background research and found three related topic. In this post, I will give summarizations of these topic.
NLP algorithms are designed to learn from language, which is usually unstructured with arbitrary length. Even worse, different language families follow different rules. Applying different sentense segmentation methods may cause ambiguity. So it is necessary to transform these information into appropriate and computer-readable representation. To enable such transformation, multiple tokenization and embedding strategies have been invented. This post is mainly for giving a brief summary of these terms. (For readers, I assume you have already known some basic concepts, like tokenization, n-gram etc. I will mainly talk about word embedding methods in this blog)
Recently, I have been working on NER projects. As a greener, I have spent a lot of time in doing research of current NER methods, and made a summarization. In this post, I will list my summaries(NER in DL), hope this could be helpful for the readers who are interested and also new in this area. Before reading, I assume you already know some basic concepts(e.g. sequential neural network, POS,IOB tagging, word embedding, conditional random field).
This post is for explaining some basic statistical concepts used in disease morbidity risk prediction. As being a CS student, I have found a hard time figuring out these statistical concepts, hope my summary would be helpful for you.
Leave a Comment