Page Not Found
Page not found. Your pixels are in another canvas.
Page not found. Your pixels are in another canvas.
This is a page not in th emain menu
Published:
For the past 4 months, I have been working on cardiovascular disease risk prediction. Through this, I come up with an idea to utilize GAN to learn in a progressive way and decide to write a paper on this topic(Sry, I can’t talk much about my idea in details). Then, I began doing background research and found three related topic. In this post, I will give summarizations of these topic.
Published:
NLP algorithms are designed to learn from language, which is usually unstructured with arbitrary length. Even worse, different language families follow different rules. Applying different sentense segmentation methods may cause ambiguity. So it is necessary to transform these information into appropriate and computer-readable representation. To enable such transformation, multiple tokenization and embedding strategies have been invented. This post is mainly for giving a brief summary of these terms. (For readers, I assume you have already known some basic concepts, like tokenization, n-gram etc. I will mainly talk about word embedding methods in this blog)
Published:
Recently, I have been working on NER projects. As a greener, I have spent a lot of time in doing research of current NER methods, and made a summarization. In this post, I will list my summaries(NER in DL), hope this could be helpful for the readers who are interested and also new in this area. Before reading, I assume you already know some basic concepts(e.g. sequential neural network, POS,IOB tagging, word embedding, conditional random field).
Published:
This post is for explaining some basic statistical concepts used in disease morbidity risk prediction. As being a CS student, I have found a hard time figuring out these statistical concepts, hope my summary would be helpful for you.
Published:
Finally, I’m writing something about neural network. I will start with the branch I’m most familiar with, the sequential neural network. In this post, I won’t talk about the forward/ backword propogation, as there are plenty of excellent blogs and online courses. My motivation is to give clear comparison between RNN, LSTM and GRU. Because I find it’s very important to bare in mind the structural differences and the cause and effect between model structure and function of these models. These knowledge help me understand why certain type of model works well on certain kinds of problems. Then when working on real world problems(e.g. time series, name entity recognition), I become more confident in choosing the most appropriate model.
Published:
Both bagging and boosting are designed to ensemble weak estimators into a stronger one, the difference is: bagging is ensembled by parallel order to decrease variance, boosting is to learn mistakes made in previous round, and try to correct them in new rounds, that means a sequential order. GBDT belongs to the boosting family, with a various of siblings, e.g. adaboost, lightgbm, xgboost, catboost. In this post, I will mainly explain the principles of GBDT, lightgbm, xgboost and catboost, make comparisons and elaborate how to do fine-tuning on these models.
Published:
Tree is one of the most widely used model, with a large family(regression tree, classification tree, bagging: RF, boosting: GBDT), and implementations(classification: [ID3, C4.5, CART], regression: CART). This series of posts will start from a brief introduction of the basic principles of DT, RF and GBDT, then go into details of GBDT and other boosting techniques(e.g. lightgbm, xgboost, catboost), and dive deeper by making comparisons.
Published:
In human learning, we have teachers, who master the knowledge well. Thus, being taught by the teachers, we students can quickly get to the point of that question, filter out false ones, learn very fast, and achieve good marks in the test by ourselves. That’s why PI is introduced, to help ML models learning in a more fast and accurate way on training set, and work well indepedently on testing set. So, in a normal classification paradigm, you will get a set of pair as: (xi, yi), after introducing PI, the pair will be (xi, yi*, yi)
Published:
Stacked generalization, stacking, is composed of two types of layers:
Published:
Recently, I have read some papers about Reinforcement Learning (RL). To me, it’s quite an interesting topic. But it’s also very complex, as it involves so many terms, definitions and methods. So I wrote this post to make an introductary overview of RL.
Published:
Before introducing density clustering algorithm, I will first talk about the shortcomings of other clustering methods. Partitioning algorithm, such as k-means, it requires to declare k in the first step of clustering. Moreover, it has restrictions on the shape of clusters (convex shape), meaning it requires gaussian shape clusters.
Published:
Clustering is a process of grouping similar objects together. It belongs to unsupervised learning, as it’s unlabeled. There is a set of different clustering method, including partitioning method (flat clustering), hierarchical clustering and density-based method.
Published:
This post is used for elaborating details and make comparisons of some very similar and confusing pairs to me. And I will teach you how to draw ROC/AUC by hand(The most exciting part to me).
Published:
Similarities
Published:
The key of SVM is to find a hyperplane, which is built on some important instances (support vectors), to seperate data instances correctly. Here comes with a very contradictory process to construct the plane: the margin of the hyperplane is chosen to be the smallest distance between decision boundary and support vectors; at the same time the decision boundary need to be the one which the margin is maximized. This is because there can be many hyperplanes (Fig. 1) to seperate data correctly. Choosing the one which leads to the largest gap between both classes may be more resistant to any perturbation of the training data.
Published:
Logistic regression uses sigmoid function to estimate the probability of a sample belonging to a certain class, and obtains the unknown parameters by using maximum likelihood estimation. It assumes the data is linearly seperable as linear regression does. For example, a 2D dataset, it can be seperate by a linear decision boundary, which is wX+b=0. If a point makes wx+b>0, then it is more likely belongs to class 1, otherwise, class 0.
Published:
When building a recommendation system, it mainly deals with problems, such as: How to collect data (known ratings) in the utility matrix; How to estimate unknown ratings from the known; Evaluating approach
Published:
Rules of cellular automata:
Published:
NYC as a metropolitan city, its public transportation plays a vital role for the convenience of both citizens and visitors. However, during summer time, the peak tourist season, a major issue occurred frequently with subway cabins over years. These cabins are regarded as hotcars, caused by the air-conditioning breakdowns happened 10 times a day during summer. In total, there were nearly 6,500 “hotcars” from 2010 to 2014, and the majority of those incidents occurred during the hottest months. This problem brings terrible experiences for the users. So my friends and I create this interactive visualization by D3.js to summarize the past ‘hotcars’ history.
Published:
Published in BMC Systems Biology, 2016
Background: plant microRNAs have been found still active after digestion. Goal: exploratory research about whether exogenous miRNA derived from vegetables will have impact on human in RNA interaction level. More specific: computational prediction of the influence of plant microRNA on human RNA expression level in different organs(stomach, kidney, liver etc). Future Exploration: may aim at exploring the influence of GMO(genetically modified organisms) food on human.Role: three-year research, working as the main contributor.
Download here
Published in AMIA 2019 Summit, 2018
Goal: infectious disease(flu, HFRS, mumps etc.) morbidity rate prediction. More specific: compared to traditional infeactious disease prediction which mainly focus on historical morbidity incidences, our research uses multimodal deep learning, combining info from morbidity history, weather, air quality and search engine/ social network trend. And the result(avg MAPE ~12%) greatly outperforms traditional ML method(ARIMA, xgboost etc). Role: three-month research, working as the major contributor.