Figure 1. Customized Text Input Format and Model Topology with LSTM/CNN
The author also tried replace LSTM with CNN, with adding a fifth part to the previous four-component input format(ctx-a: context of argu). To be more specific, every four-component input has been passed into hidden unit, then add this hidden unit together with mr into the next layer hidden unit. In my opinion, they added a fifth part: the ctx-a to input, is because CNN can't handle long dependency as well as LSTM
Paper: End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF
In this paper, the author argues existing sequence labeling tasks, though achieved good performance, still need rely on info(e.g. word spelling, capital pattern) to augment their results. Especially when coming across with rare words, pre-trained embedding will probably not work well. In this sense, carrying the info about capitalization, prefix or suffix will be more helpful. In this paper, the author combines char-level and word-level representation together, enables the model to be truly free from feature engineering or pre-trained embedding and task-domain(but in section 4.4, it’s stated that NER task with pre-trained word embedding will have better performance than using random embedding).
Figure 2. Model Topology and Char Representation Layer Structure
Paper: Neural Architectures for Named Entity Recognition
In this paper, the authors stresses two points.
name entity is usually containing multiple tokens, so jointly tagging is important.
a name usually shows orthographic evidence(what does the name look like?) and distributional evidence(where tend to be appeared in a corpus?) (distributional sensitivity is captured by using char-based word representation model)
Two model frameworks have been tested, one is LSTM+CRF, the other is transition-based chunking model. For both models, they utilize the same input embedding strategy, which maintains spelling sensitive and word order sensitive, dropout training has also been added to prevent the model depending too much on one representation.( These two sensitive features exist in various language system, so such features ensure this model would be suitable for many different languages.)
Figure 3 shows the embedding topology: gathering char-level info in both forward and backward order by using bi-LSTM. Details: 1) char lookup table will be initialized at random, containing every char, 2) as LSTM has bias towards to the most recent inputs, forward LSTM for representing suffix of the word, backward LSTM for prefix, 3) previous model use CNN for char-embedding, convnets are designed to find position invariant features, so it works well on image recognition, while for prefix, suffix, stem in words, I think these are relatively more flexible than edges/objects in a image, so I would vote for LSTM against CNN for doing char embedding.
The final step is deriving word embedding from a lookup table, then concatenating these two parts together.
Figure 3. Input Word Embedding
Then let’s go back to the model architecture: LSTM+CRF is shown in the left part in Figure 4. Firstly, applying pretrained embedding, then passing each word into both forward LSTM layer and inversed LSTM layer, concatenating the results from these 2 layers(thus both the left and right context of the word will be obtained), finally passing the output to CRF model.(The process is relatively similar to the ones I have introduced before.)
About the transition model: name entity tends to have multiple tokens, so the author utilizes stack structure to gradually collect the tokens –> build chunk –> pop the chunk until the building process is done. stack LSTM is suitable for such task, as it allows operation at both ends.
Object needed: two stacks, a buffer: contains words need to be processes, transition inventory: contains following transitions
Operation explanation: shift transition: moves a word from buffer to stack, out transition: moves a word from buffer to output stack, reduce(y) transition: pops all items from the top of the stack, and these items form a chunk, pushes the label of this chunk to output stack. (The process is shown in the bottom part of Figure 4, with the sentence ‘Mark Watney visited Mars’ as the example)
Figure 4. LSTM+CRF VS Transition-based Chunking Model
Summary
In summary, the trend in NER is to implement end-to-end task/language-free model, usually traditional ML models CRF/HMM are used as the final layer for tagging. Sequential NN is used to capture context info in sentences, due to its long dependency property. The key to enable the model fits for multiple tasks or languages is to find proper embedding(both char & word level) format to capture prefix/suffix, word order info, which helps generateing better word representation in the DL layer. However, if the task requires strong professional knowledge, e.g. EHR/EMR, dictionary containing relating professional terms is still needed.
To be continued…
Reference
[1].Jie Zhou, Wei Xu. End-to-end Learning of Semantic Role Labeling Using Recurrent Neural Networks, the 7th International Joint Conference on Natural Language Processing(2015), 1127-1137
[2].Xuezhe Ma, Eduard Hovy. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF.
[3].Guillaume Lample. Neural Architectures for Named Entity Recognition
For the past 4 months, I have been working on cardiovascular disease risk prediction. Through this, I come up with an idea to utilize GAN to learn in a progressive way and decide to write a paper on this topic(Sry, I can’t talk much about my idea in details). Then, I began doing background research and found three related topic. In this post, I will give summarizations of these topic.
NLP algorithms are designed to learn from language, which is usually unstructured with arbitrary length. Even worse, different language families follow different rules. Applying different sentense segmentation methods may cause ambiguity. So it is necessary to transform these information into appropriate and computer-readable representation. To enable such transformation, multiple tokenization and embedding strategies have been invented. This post is mainly for giving a brief summary of these terms. (For readers, I assume you have already known some basic concepts, like tokenization, n-gram etc. I will mainly talk about word embedding methods in this blog)
This post is for explaining some basic statistical concepts used in disease morbidity risk prediction. As being a CS student, I have found a hard time figuring out these statistical concepts, hope my summary would be helpful for you.
Finally, I’m writing something about neural network. I will start with the branch I’m most familiar with, the sequential neural network. In this post, I won’t talk about the forward/ backword propogation, as there are plenty of excellent blogs and online courses. My motivation is to give clear comparison between RNN, LSTM and GRU. Because I find it’s very important to bare in mind the structural differences and the cause and effect between model structure and function of these models. These knowledge help me understand why certain type of model works well on certain kinds of problems. Then when working on real world problems(e.g. time series, name entity recognition), I become more confident in choosing the most appropriate model.
Leave a Comment