A RL problem be defined as a Markov Decision Process (MDP). Because each state satisfies markov property, meaning the future state are independent of any previous states or actions. About the environment, it can be fully observable, or partially observable. When the agent partially observe the environment, meaning the observations received by the agent is dependent on the current state and the previous action, it becomes a partiarlly observable Markov Decision Process (POMDP).
Attention: POMDP is still MDP, as it can be converted into MDP.
A MDP can be represented as a 5-tuple (S, A, P, R, γ). S: set of states, A: set of actions. P: transition probability, R: reward function, γ: a discount factor, ranging from 0 to 1.
Why γ < 1: because it can prevent infinite accumulation of rewards in a non-episodic MDP, where there is no terminal state. While in an episodic MDP, the state will be reset after each episode of a certain length. If γ close to 0, it leads to short-sighted evaluation, otherwise, it will lead to far-sighted evaluation.
To solve the MDP problem, there are 2 fundamental methods: value-based and policy-based.
Value function includes state-value function and action-value function. The state-value function is about given policy π to select actions, then we can obtain: Vπ = E[ Gt | St = s] Of course, there be an optimal result: V* (s) = maxπVπ(s) Where, Gt denotes the total sum of discounted rewards from time t:
Gt = Rt+1 + γRt+2 + … = ∑k=0∞γk Rt+K+1
The action-value function denotes as being in state s and take action a, the expected reward will be: qπ(s,a) = E[ Gt| St = s, At = a], and the optimal q* means how good for an agent to take action a in state s. And the relationship between V* and q* is V* (s) = maxa q* (s,a)
To solve these equations, we can apply Bellman equation, which help to rewrite the value function in terms of the payoff from one initial state and its successors (immediate reward and discounted accumulative reward from successors). Let’s state-value function as an example:
Value iteration begins with a random value function, then iteratively update new the q and V value function until reaching the optimal one (converges).
Policy iteration methods search for optimal policy π* to maximize the expected return. It contains two steps. Policy evaluation: given current policy, which is fixed, calculate values, iterate until values converge. Policy improvement: given fixed values, find optimal action using one-step look-ahead.
About value iterate, it need to calculate all actions in each iteration, which can be slow. Also, the policy may converge before the values. In every pass, it needs to update both values (explicitly) and policy (possibly implicitly). About policy iteration, it will conduct several passes to update utilities with fixed policy.
To be continued…
Reference
[1]. Sutton, R. S. & Barto, A. G. (1998) Reinforcement Learning: An Introduction. Cambridge, Mass.: MIT Press.
[2]. Leslie, P. K. (1996) Reinforcement learning: A survey. Journal of artificial intelligence research 4, 237-285
For the past 4 months, I have been working on cardiovascular disease risk prediction. Through this, I come up with an idea to utilize GAN to learn in a progressive way and decide to write a paper on this topic(Sry, I can’t talk much about my idea in details). Then, I began doing background research and found three related topic. In this post, I will give summarizations of these topic.
NLP algorithms are designed to learn from language, which is usually unstructured with arbitrary length. Even worse, different language families follow different rules. Applying different sentense segmentation methods may cause ambiguity. So it is necessary to transform these information into appropriate and computer-readable representation. To enable such transformation, multiple tokenization and embedding strategies have been invented. This post is mainly for giving a brief summary of these terms. (For readers, I assume you have already known some basic concepts, like tokenization, n-gram etc. I will mainly talk about word embedding methods in this blog)
Recently, I have been working on NER projects. As a greener, I have spent a lot of time in doing research of current NER methods, and made a summarization. In this post, I will list my summaries(NER in DL), hope this could be helpful for the readers who are interested and also new in this area. Before reading, I assume you already know some basic concepts(e.g. sequential neural network, POS,IOB tagging, word embedding, conditional random field).
This post is for explaining some basic statistical concepts used in disease morbidity risk prediction. As being a CS student, I have found a hard time figuring out these statistical concepts, hope my summary would be helpful for you.
Leave a Comment