reinforcement learning -...

Reinforcement Learning

LU 1 - Introduction

Dr. Joschka BodeckerAG Maschinelles Lernen und Naturlichsprachliche Systeme

Albert-Ludwigs-Universitat Freiburg

[email protected]

AcknowledgementSlides courtesy of Martin Riedmiller and Martin Lauer

Prof. Dr. M. Riedmiller, Dr. M. Lauer, Dr. J. Boedecker Machine Learning Lab, University of Freiburg Reinforcement Learning (1)

Organisational issues

Dr. Joschka BoedeckerRoom 00010, building [email protected] hours: Tuesday 2 - 3 pm

no script - slides available onlinehttp://ml.informatik.uni-freiburg.de/teaching/ws1516/rl


http://ml.informatik.uni-freiburg.de/teaching/ws1516/rl

Dates winter term 2015/2016

3+1Lecture Monday, 14:00 (c.t.) - 15:30, SR 02-017, building 052Wednesday, 16:00 (s.t) - 17:30, SR 02-017, building 052

Exercise sessions on Wednesday, 16:00 - 17:30, interleaved with lecturestarting at Oct. 28held by Jan Wulfing, [email protected]


Goal of this lecture

Introduction of learning problem typeReinforcement LearningIntroduction to the mathematical basicsof an independently learning system.


Goal of the 1. unit

Motivation, definition and differentiation

Outline

I Examples

I Solution approaches

I Machine Learning

I Reinforcement Learning

I Overview


Example Backgammon

Can a program independently learn Backgammon?

Learning from success (win) andfailure (loss)

Neuro-Backgammon:Playing at world champion level(Tesauro, 1992)


Example pole balancing (control engineering)

Can a program independently learn balancing?

Learning from success and failure

Neural RL Controller:Noise, inaccuracies, unknownbehaviour, non-linearities, ...(Riedmiller et.al. )


Example robot soccer

Can programs independently learn how to cooperate?

Learning from success and failure

Cooperative RL Agents:Complexity, distributed intelligence, ...(Riedmiller et.al. )


Example: Autonomous (e.g. humanoid) robots

Task: Movement control similar to humans(walking, running, playing soccer, cycling, skiing,...)Input: Image from cameraOutput: Control signals to the joints

Problems:

I very complex

I consequences of actions hard to predict

I interference / noise


Example: Maze


The ’Agent Concept’

[Russell and Norvig 1995,page 33] ”An agent isanything that can beviewed as perceiving itsenvironment throughsensors and acting uponthat environment througheffectors.”

examples:

I a human

I a robot arm

I an autonomous car

I a motor controller

I ...


Solution approaches in ’Artificial Intelligence’ (AI)

I Planning / search (e.g. A∗, backtracking)

I Deduction (e.g. logic programming, predicate logic)

I Expert systems (e.g. knowledge generated by experts)

I Fuzzy control systems (fuzzy logic)

I Genetic algorithms (evolution of solutions)

I Machine Learning (e.g. reinforcement learning)


Types of learning (in humans)

I Learning from a teacher

I Structuring of objects

I Learning from experience


Types of Machine Learning (ML)

I Learning with a teacher. Supervised Learning:Examples of input / (target-)output. Goal: generalization (in general notsimply memorization)

I Structuring / recognition of correlations. Unsupervised learning:Goal: Clustering of similar data points, e.g. for preprocessing.

I Learning through reward / penalty. Reinforcement Learning:Prerequisite: Specification of target goal (or events to be avoided). . . .


Machine Learning: ’ingredients’

1. Type of the learning problem (given / seeked)

2. Representation of learned solution knowledgetable, rules, linear mapping, neural network, . . .

3. Solution process (observed data 7→ solution)(heuristic) search, gradient descent, optimization technique, . . .

Not at all: ’For this problem I need a neural network’


Emphasis of the lecture: Reinforcement Learning

I No information regarding the solution strategy required

I Independent learning of a strategy by smart trial of solutions (’trial anderror’)

I Biggest challenge of a learning system

I Representation of solution knowledge by usage of a function approximator(e.g. tables, linear models, neural networks, etc.)


RL using the example of autonomous robots

bad: Damage (fall, ...)good: task done successfullybetter: fast / low energy / smoothmovements /. . .⇒ optimization!


Reinforcement Learning (RL)

Also: Learning from evaluations, autonomous learning, neuro dynamicprogramming

I Defines a learning type and not a method!Central feature: Evaluating training signal - e.g. ’good’ / ’bad’

I RL with immediate evaluation:Decision 7→ EvaluationExample: Parameter for a basketball throw

I RL with rewards delayed in timeDecision, decision, . . . , decision → evaluationsubstantially harder; interesting, because of versatile applications


Delayed RL

I Decision, decision, . . . , decision → evaluation

I Example: Robotics, control systems, games (chess, backgammon)

I Basic problem: Temporal credit assignment

I Basic architecture: Actor-critic system


Multistage decision problems


Actor-critic system (Barto, Sutton, 1983)

Actor: In situation s choose action u (strategy π : S → U)Critic: ’Distribution’ of the external signal onto single actions


Reinforcement Learning

I 1959 Samuel’s Checker-Player: Temporal difference (TD) methods

I 1968 Michie and Chambers: Boxes

I 1983 Barto, Sutton’s AHC/ACE, 1987 Sutton’s TD(λ)

I Early 90ies: Correlation between dynamic programming (DP) and RL:Werbos, Sutton, Barto, Watkins, Singh, Bertsekas

I DP - classic optimization technique (late 50ies: Bellman)too much effort for large tasksAdvantage: Clean mathematical formulation, convergences

I 2000 Policy Gradient methods (Sutton et. al, Peters et. al, ...)

I 2005 Fitted Q (Batch DP method) (Ernst et. al, Riedmiller, ..)

I many examples of successful, at least practically relevant applications since


Other examples

field input goal exampleoutput (actions)

games board situation winning backgammon, chessvalid move

robotics sensor data reference value pendulum, robot soccercontrol variable

sequence state gain assembly line, mobile networkplanning candidate

benchmark state goal position mazedirection


Goal: Autonomous learning system


Approach - rough outline

I Formulation of the learning problem as an optimization task

I Solution by learning based on the optimization technique of DynamicProgramming

I Difficulties:I very large state spaceI process behaviour unknown

I Application of approximation techniques (e.g. neural networks, ...)


Outline of lecture

1. part: Introduction

2. part: Dynamic ProgrammingMarkov Decision Problems, Backwards DP, Value Iteration, Policy Iteration

3. part: Approximate DP / Reinforcement LearningMonte Carlo methods, stochastic approximation, TD(λ), Q-learning

4. part: Advanced methods of Reinforcement LearningPolicy Gradient methods, hierarchic methods, POMDPs, relationalReinforcement Learning

5. part: Applications of Reinforcement LearningRobot soccer, Pendulum, RL competition


Further courses on machine learning

I lecture: machine learning (summer term)

I lab course: deep learning (Wed., 10-12)

I Bachelor-/ Master theses, team projects


Further readings

D. P. Bertsekas and J.N. Tsitsiklis. Neuro Dynamic Programming. AthenaScientific, Belmont, Massachusetts, 1996.

A. Barto and R. Sutton. Reinforcement Learning. MIT Press, Cambridge,Massachusetts, 1998.

M. Puterman. Markov Decision Processes: Discrete Stochastic DynamicProgramming. John Wiley and Sons, New York, 1994.

L.P. Kaelbling, M.L. Littman and A.W. Moore. Reinforcement Learning: Asurvey. Journal of Artificial Intelligence Research, 4:237-285, 1996

M. Wiering (ed.). Reinforcement learning : state-of-the-Art. Springer, 2012

WWW:

I http://www-all.cs.umass.edu/rlr/

I http://richsutton.com/RL-FAQ.html


http://www-all.cs.umass.edu/rlr/

http://richsutton.com/RL-FAQ.html

reinforcement learning -...

Documents