logo Hurry, Grab up to 30% discount on the entire course
Order Now logo

Ask This Question To Be Solved By Our ExpertsGet A+ Grade Solution Guaranteed

expert
Kim WooddOthers
(5/5)

675 Answers

Hire Me
expert
Chander MohanEngineering
(5/5)

712 Answers

Hire Me
expert
Charu SinghalStatistics
(/5)

501 Answers

Hire Me
expert
Neill BissonnetteData mining
(5/5)

811 Answers

Hire Me
Others
(5/5)

Reinforcement learning is a learning paradigm that states that one learns by interacting with the environment

INSTRUCTIONS TO CANDIDATES
ANSWER ALL QUESTIONS

Chapter 4

Project 3: An introduction to reinforcement learning I

4.1 Introduction

Reinforcement learning is a learning paradigm that states that one learns by interacting with the environment in which one is in, rather than by being provided explicit rules to follow. Specifically, reinforcement learning states that an agent (e.g. a human or a robot) learns which actions to take in the environment so as to maximize a specific reward signal. That is, the agent initially has no information about the set of desirable actions to take and only learns them gradually by interacting with its environment, which will periodically issue rewards to the agent. Reinforcement learning is fundamentally different from supervised learning where the agent would be explicitly taught which behaviour to take [6].

Thus, to cast a problem as a reinforcement learning problem one needs to define an environment, an agent acting in this environment and a reward signal. Among the many particular problems that can be cast as reinforcement learning problem is learning to drive a car. Here the environment is a network of roads, the agent is a human or a robot (such as a self-driving vehicle), the set of actions are to accelerate forward and backward, stopping, turning left and right, and the reward signal could be defined as the time that has passed without crashing the car. Each action taken by the agent brings the agent from one state of the environment to another state of the environment. For driving a car, say, your initial state is the parking spot in front of a house (which belongs to the network of roads) and once you start to drive (i.e. accelerate and turn left or right) you find yourself in another state of the environment (another part of the network of roads). You maximize your reward by simply not crashing your car.

While the concept is astonishingly simple, reinforcement learning (and its combination with deep neural networks, the so-called deep reinforcement learning) is responsible for most of the recent breakthroughs in artificial intelligence, such as self-driving cars and computers beating humans in the games of Chess and Go [5] as well as in several video games such as the Atari games, StarCraft II or Dota II [4]. What makes reinforcement learning superior to other learning approaches is that we do not teach the agent which particular rules to follow. In the case of (video) games, we do not explicitly teach the agent which behaviour we think is best. The agent learns only by maximizing its expected reward (e.g. how to win the game as fast as possible). This has led to surprising new strategies learned by the various agents for solving problems, which were previously unknown to humans. In fact, reinforcement learning is currently believed to be our best shot at realizing true artificial intelligence.

Remark 4. This entire course can be regarded as an example for reinforcement learning. You are provided with which results are expected for each project, and your task is to figure out how to obtain these results using trial-and-error. Doing so as best as possible then maximizes your cumulative reward (your final grade on this course).

4.2 Background

In this first part of the introduction to reinforcement learning, we consider the problem of a multi-armed bandit : We are presented with k slot machines, each of which produces a particular numerical reward. The problem is that we do not know ahead of time which slot machine (or bandit) pays the most reward, which is the bandit we should be playing all the time. This setting simplifies the reinforcement learning problem in that there is only a single state of the environment, and we aim to learn which action, i.e. which bandit to play, will yield the highest payout in the long run.

Each of the k actions, i.e. each of the k bandits we can be playing, has an expected (mean) reward given that this action has been selected. We denote this as the value of that action. The action selected at time step t is denoted by At, with the corresponding obtained reward being denoted as Rt. The value q∗(a) of the action a is the expected reward provided that we have selected action a at time step t,

q∗(a) = E(Rt|At = a).

The above equation reads that the value of the action a is obtained as the expected value of the reward Rt given that we select action a at step t.

The goal of the reinforcement learning problem is to learn the true value q∗(a), i.e. to obtain as good an estimate Qt(a) for the true value q∗(a) of the action a as possible, having played the bandits t − 1 times. The following example provides some clarification.

Example 1. Assume that there is only one bandit, i.e. k = 1. You play the bandit 9 times and obtain the following rewards:

R1 = 0.79110577, R2 = 1.58662319, R3 = 1.83898341,

R4 = 1.93110208, R5 = 1.28558733, R6 = 1.88514116,

R7 = 0.24560206, R8 = 2.25286816, R9 = 1.51292982.

The mean reward after playing this bandit 9 times is then

Q 1 • • • + R ) = 1 Σ R≈ 1.4811.

Note that we call this value estimate Q10 instead of Q9 since this is the expected reward we will receive when playing the bandit a 10th time. Thus, after playing the bandit 9 times, we estimate the value of the single action a = 1 (we only have one bandit to play so a = 1 means playing bandit number 1) is

q∗(1) ≈ Q10(1).

The more often we play the bandit, the more accurate this estimate will become.

Example 2. If there is only one bandit, there is no learning problem since there is no choice to be made. Now consider that there are two bandits. You play the first bandit at t = 1 and obtain

R1 = 2.12948391.

You then try the second bandit at t = 2 and obtain

R2 = 1.36126959.

Disappointed with the lower payout of the second bandit, you return to the first bandit and exclusively play the first bandit afterwards. Is this a smart choice?

The previous example is known as the exploration–exploitation problem. Exploitation refers to using your current knowledge to choose the action that maximizes your reward over a single step. Exploration in turn refers to choosing an action that gives you a lower reward over a single step but may lead to a higher reward in the long run.

(5/5)
Attachments:

Related Questions

. The fundamental operations of create, read, update, and delete (CRUD) in either Python or Java

CS 340 Milestone One Guidelines and Rubric  Overview: For this assignment, you will implement the fundamental operations of create, read, update,

. Develop a program to emulate a purchase transaction at a retail store. This  program will have two classes, a LineItem class and a Transaction class

Retail Transaction Programming Project  Project Requirements:  Develop a program to emulate a purchase transaction at a retail store. This

. The following program contains five errors. Identify the errors and fix them

7COM1028   Secure Systems Programming   Referral Coursework: Secure

. Accepts the following from a user: Item Name Item Quantity Item Price Allows the user to create a file to store the sales receipt contents

Create a GUI program that:Accepts the following from a user:Item NameItem QuantityItem PriceAllows the user to create a file to store the sales receip

. The final project will encompass developing a web service using a software stack and implementing an industry-standard interface. Regardless of whether you choose to pursue application development goals as a pure developer or as a software engineer

CS 340 Final Project Guidelines and Rubric  Overview The final project will encompass developing a web service using a software stack and impleme