# Reinforcement Learning with Generalized Feedback: Beyond Numeric Rewards

This **workshop** will be held on **Monday, September 23rd 2013**, as part of the ECML/PKDD 2013 conference.

Please note that the submission deadline has been extended to **July 5th, 2013!**

## Background

### Motivation

Reinforcement learning
is traditionally formalized within the *Markov Decision Process* (MDP) framework: By taking
actions in a stochastic and possibly unknown environment, an agent moves between states in this environment; moreover, after each action, it receives a
numeric, possibly delayed reward signal. The agent's learning task then consists of developing a strategy
that allows it to act optimally, that is, to devise a policy (mapping states to actions) that maximizes its long-term (cumulative) reward.

In recent years, different generalizations of the standard setting of reinforcement learning have emerged; in particular, several attempts have been made to relax the quite restrictive requirement for numeric feedback and to learn from different types of more flexible training information. Examples of generalized settings of that kind include

- Learning from Expert Demonstration:
- The training information consists of the action traces of an expert demonstrating the task, and the learner is supposed to devise a policy so as to imitate the
expert. A specific instantiation of this setting is
*apprenticeship learning*, which can be realized, for example, through*inverse reinforcement learning}*. - Learning from Qualitative Feedback:
- In this setting, the agent
is not (necessarily) provided with a numeric reward signal.
Instead, it is
supposed to learn from more general
types of feedback, such as ordinal rewards \cite{Weng11}
or qualitative comparisons between trajectories or policies, like in
*preference-based reinforcement learning*. - Learning from Multiple Feedback Signals:
- Here, feedback is
provided in the form of multiple, possibly conflicting reward
signals. The task of
*multi-objective reinforcement learning*is to learn a policy that optimizes all of them at the same time, or at least finds a good compromise solution.

Learning in generalized frameworks like those mentioned above can be considerably harder than learning
in MDPs. In qualitative settings, for example, where rewards cannot be easily aggregated over different states, policy evaluation becomes a non-trivial task.
Many approaches assume a *hidden* numeric reward function and interpret qualitative feedback as indirect or implicit information about that function. This assumption is already quite restrictive, however, and immediately imposes a total order on trajectories, which is not very natural in the settings of preference-based and multi-objective reinforcement learning. Purely qualitative approaches, on the other hand, completely give up the assumption of an underlying numeric reward function. This makes them more general but comes with a loss of properties that are crucial for standard reinforcement learning techniques (such as policy and value iteration).

The above extensions and variants of reinforcement learning are closely connected and largely intersecting with *preference learning*, a new subfield of machine learning that deals with the learning of (predictive) preference models from observed/revealed or automatically extracted preference information. For example, inverse reinforcement learning and apprenticeship learning can be seen as a specific type of preference learning in dynamic environments. Likewise, preference-based and multi-objective reinforcement learning make use of generalized formalisms for representing preferences as well as learning techniques from the field of preference learning and *learning-to-rank*.

### Goals and Objectives

The most important goal of this workshop is to help in unifying and streamlining research on generalizations of standard reinforcement learning, which, for the time being, seem to be pursued in a rather disconnected manner. Indeed, many of the extensions and generalizations discussed above are still lacking a sound theoretical foundation, let alone a generally accepted underlying framework comparable to Markov Decision Processes for conventional reinforcement learning. Besides, many of the commonalities shared by these generalizations have apparently not been recognized or explored so far. A formalization in terms of preferences may provide such a theoretical underpinning. Ideally, the workshop will help the participants to identify some common ground of their work, thereby helping the field move toward a theoretical foundation of reinforcement learning with generalized feedback.

Apart from fostering theoretical developments of that kind, we are also interested in identifying and exchanging interesting applications and problems that may serve as benchmarks for qualitative or preference-based reinforcement learning (such as cart-pole balancing or the mountain car for classical reinforcement learning).

### Topics of Interest

Topics of interest include but are not limited to- novel frameworks for reinforcement learning beyond MDPs
- algorithms for learning from preferences and non-numeric, qualitative, or structured feedback
- theoretical results on the learnability of optimal policies, convergence of algorithms in qualitative settings, etc.
- applications and benchmark problems for reinforcement learning in non-standard settings.

## Program

### 9:30 - 10:30 Session 1

9:30 - 9:40 | Eyke Hüllermeier, Johannes Fürnkranz: Opening Remarks |

9:40 - 10:30 | Invited Talk by Michele Sebag |

### 10:30 - 11:00 Coffee break

### 11:00 - 12:40 Session 2: Interactive Reinforcement Learning

11:00 - 11:25 | L. Adrian Leon, Ana C. Tenorio, Eduardo F. Morales: Human Interaction for Effective Reinforcement Learning |

11:25 - 11:50 | Riad Akrour, Marc Schoenauer, and Michele Sebag: Interactive Robot Education |

11:50 - 12:15 | Paul Weng, Robert Busa-Fekete and Eyke Hüllermeier: Interactive Q-Learning with Ordinal Rewards and Unreliable Tutor |

12:15 - 12:40 | Omar Zia Khan, Pascal Poupart, and John Mark Agosta: Iterative Model Refinement of Recommender MDPs based on Expert Feedback |

### 12:40 - 14:00 Lunch break

### 14:00 - 15:30 Session 3: RL with Non-numerical Feedback

14:00 - 14:25 | Christian Wirth, Johannes Fürnkranz: Preference-Based Reinforcement Learning A Preliminary Survey |

14:25 - 14:50 | Robert Busa-Fekete, Balazs Szörenyi, Paul Weng, Weiwei Cheng and Eyke Hüllermeier: Preference-based Evolutionary Direct Policy Search |

14:50 - 15:15 | Daniel Bengs, Ulf Brefeld: A Learning Agent for Parameter Estimation in Speeded Tests |

15.15 - 15.30 | Discussion |

### 15:30 - 16:00 Coffee break

### 16:00 - 17:15 Session 4: Inverse RL and Multi-Dimensional Feedback

16:00 - 16:25 | Hideki Asoh, Masanori Shiro, Shotaro Akaho, Toshihiro Kamishima,Koiti Hasida, Eiji Aramaki, and Takahide Kohro: Applying Inverse Reinforcement Learning to Medical Records of Diabetes |

16:25 - 16:50 | Mohamed Oubbati, Timo Oess, Christian Fischer, and Günther Palm: Multiobjective Reinforcement Learning Using Adaptive Dynamic Programming And Reservoir Computing |

16:50 - 17:15 | Petar Kormushev, Darwin G. Caldwell: Comparative Evaluation of Reinforcement Learning with Scalar Rewards and Linear Regression with Multidimensional Feedback |

17:15 - 17.30 | Discussion |

## Organization

### Workshop Chairs

- Johannes Fürnkranz (TU Darmstadt)
- Eyke Hüllermeier (Universität Marburg)

### Programme Committee

- Riad Akrour, INRIA Saclay
- Robert Busa-Fekete, University Marburg
- Damien Ernst, University of Liége
- Raphael Fonteneau, INRIA Lille
- Levente Kocsis, Hungarian Academy of Sciences
- Francis Maes, K.U. Leuven
- Jan Peters, TU Darmstadt
- Constantin Rothkopf, Frankfurt Institute for Advanced Studies
- Csaba Szepesvàri, University of Alberta
- Christian Wirth, TU Darmstadt
- Paul Weng, Université Pierre et Marie Curie, Paris
- Bruno Zanuttini, Université de Caen