Let’s take a two-class classification problem using linear models as the example.
y(x) = wTφ(x) + b
W: a weight vector; φ(x): input vector; b: bias; training set: N input vectors x1,…, xN, with corresponding target values t1,…, tN where tn ∈{−1,1}; so the decision boundary can be represented as wTφ(x) + b = 0; for the 2 classes, one satisfies y(xn) > 0, meaning tn = 1; the other satisfies y(xn) < 0, meaning tn = -1, so that tn*y(xn) > 0. In this way, we define the functional margin as γ = t*(wTφ(x) + b), and we want to find the minimum functional margin over the whole training set.
But when we multiply w, b, the value of functional margin will change too. So we introduce geometric margin, which represents the distance between a point to the hyperplane.
For the training set,
Maximum Margin Classifier
So the problem becomes:
As we have mentioned before, if we change w → κw and b → κb, the distance from xn to the decision boundary is unchanged. So we set this freedom to tn (wTφ(xn) + b)= 1; so all training set will meet the constraint: tn (wTφ(xn)+b) >= 1, n=1,…,N.
Then the problem can be changed into:
s.t. tn (wTφ(xn)+b) >= 1, n=1,…,N.
To make this constrained optimization problem much easier to solve, Lagrange Duality has been introduced. (an represents the Lagrange multiplier, an >= 0) This procedure can also help us to work on non-linear classification problem, by introducing a kernel function.
The problem becomes:
The dual problem is:
Details of this transformation is shown at the end of this blog.
Then, set the derivatives of L(w, b, a) with respect to w, b to 0, we get 2 conditions:
Put these 2 conditions back into previous formula:
s.t.
Soft Margin: Introducing slack variable ξn to allow misclassified data: ξn = 0, represents data on the correct side of margin, or correctly classified; 0 < ξn <= 1, means data inside of margin, but on correct side of decision boundary; ξn > 1, denotes data on the wrong side of decision boundary. Thus, we want to minimize:
where C > 0, it controls the trade-off between the slack variable and the margin. Then, the Lagrangian formula is defined as:
With KKT conditions:
Again set the derivatives with respect to w, b, ξn to 0, then we get, and obtain the new form of Lagrangian, in which we maximize it with respect to a:
s.t.
Then, we can obtain b by averaging:
Inner Product: In previous formula, k(xn, xm) can be viewed as the inner product of two vectors.
If xn, xm are parallel, there inner product is maximized.
If xn, xm are perpendicular, then their inner product is 0.
Let’s take a look back at the Lagrangian formula. Being perpendicular means two features are completely different, as there product is 0, it won’t influence the value of L. However, if they are parallel, meaning xn, xm, both features are completely alike. if 2 features predict the same class output, then anamtntmk(xn, xm) will be positive, meaning L will decrease; if 2 features gives the opposite predictions, then anamtntmk(xn, xm) is negative, so the value of L will increase. This is the situation we want most, as it can tell the 2 classes apart.
Lagrange Duality
At the end of this post, I will give a brief summary about Lagrange Duality. Assume the original optimization problem as:
s.t.
The corresponding Lagrange formula:
The problem becomes:
with θp(w) = f(w), if w satisfies the constraints, otherwise, θp(w) = ∞. So our goal is to minimize θp(w), which is
Set the duality problem:
References
[1]. BISHOP, C. M. (2016). PATTERN RECOGNITION AND MACHINE LEARNING. S.l.: SPRINGER-VERLAG NEW YORK.
[2]. Murphy, K. P. (2013). Machine learning: a probabilistic perspective. Cambridge, Mass.: MIT Press.
For the past 4 months, I have been working on cardiovascular disease risk prediction. Through this, I come up with an idea to utilize GAN to learn in a progressive way and decide to write a paper on this topic(Sry, I can’t talk much about my idea in details). Then, I began doing background research and found three related topic. In this post, I will give summarizations of these topic.
NLP algorithms are designed to learn from language, which is usually unstructured with arbitrary length. Even worse, different language families follow different rules. Applying different sentense segmentation methods may cause ambiguity. So it is necessary to transform these information into appropriate and computer-readable representation. To enable such transformation, multiple tokenization and embedding strategies have been invented. This post is mainly for giving a brief summary of these terms. (For readers, I assume you have already known some basic concepts, like tokenization, n-gram etc. I will mainly talk about word embedding methods in this blog)
Recently, I have been working on NER projects. As a greener, I have spent a lot of time in doing research of current NER methods, and made a summarization. In this post, I will list my summaries(NER in DL), hope this could be helpful for the readers who are interested and also new in this area. Before reading, I assume you already know some basic concepts(e.g. sequential neural network, POS,IOB tagging, word embedding, conditional random field).
This post is for explaining some basic statistical concepts used in disease morbidity risk prediction. As being a CS student, I have found a hard time figuring out these statistical concepts, hope my summary would be helpful for you.
Leave a Comment