Distributional Inverse Reinforcement Learning
Most inverse reinforcement learning methods ask for one reward number per state-action pair. That is already hard, but it is also a little too tidy. In many settings, the reward behind a behavior is not a fixed scalar. A robot making contact with a fragile object can receive variable outcomes from almost identical states. An animal’s dopamine response can fluctuate from trial to trial even for the same behavioral transition.
This project asks what happens if IRL learns the reward as a distribution:
The goal is not only to imitate an expert policy, but to recover a richer object: the reward distribution and the induced return distribution. That gives us a way to talk about variance, skew, tails, and risk preference from demonstrations alone.
Why Mean Matching Is Not Enough
Classical maximum-entropy IRL can be written as a game that matches expected returns:
This works well when reward is deterministic, but it discards higher-order information when reward is random. Two reward models can agree on the mean while disagreeing completely about variance or tail behavior. For risk-aware behavior, that distinction is often the whole point.
DistIRL replaces mean matching with a first-order stochastic dominance (FSD) violation. If the expert return distribution should dominate the policy return distribution, then every threshold should look at least as favorable under the expert:
The reward-learning loss measures the region where this ordering is violated:
The computational trick is to move this objective into quantile space:
That form makes the objective compatible with Monte Carlo return samples and quantile regression.
The Algorithmic Picture
The algorithm has three moving parts.
- A reward distribution model , implemented with a differentiable parametric family such as a Gaussian, skew-normal, or quantile parameterization.
- A distributional critic, trained with quantile regression, that estimates the return distribution induced by the current policy and reward model.
- A policy update that optimizes a distortion risk measure (DRM), so the learned policy can express risk-averse or risk-seeking behavior instead of collapsing everything to expected return.
The reward model is learned through an energy-based view. The FSD loss scores how compatible a proposed reward distribution is with the demonstrations:
Using variational inference with a prior , the reward update becomes
For policy learning, DistIRL optimizes a DRM:
where selects which parts of the return distribution should matter. A concave dual distortion emphasizes lower-tail outcomes and produces more conservative policies. The theory shows that, under regularity assumptions, the algorithm reaches an -stationary point with iteration complexity.
What It Recovers
The smallest sanity check is a 5 by 5 gridworld with two rewarding goal states. One goal is reliable; the other has stochastic reward drawn from . A risk-averse expert chooses the more reliable goal in 9 out of 10 demonstrations.
DistIRL recovers both high-reward locations and identifies the high-variance stochastic goal. Bayesian IRL, which models uncertainty but still matches expected return, recovers a plausible mean yet misses the variance structure.
A Neuroscience Use Case
The most interesting testbed is spontaneous mouse behavior. The dataset records mice freely exploring an arena, converts depth-camera video into discrete behavioral syllables, and includes a time-aligned dopamine trace from the dorsolateral striatum. We model each syllable as a state and the next syllable as the action, giving a compact 10-state, 10-action MDP.
Prior work showed that dopamine can act as a reward signal for reproducing the observed transitions. DistIRL asks a sharper question: can the reward distribution inferred from behavior resemble the measured dopamine fluctuation distribution?
With a skew-normal reward family, S-DistIRL recovers reward distributions that track the empirical dopamine distributions more closely than deterministic rewards or Bayesian IRL. This is exciting because it suggests that behavior alone can carry information about the variability of an internal neuromodulatory signal, not just its average level.
Offline Control Results
The continuous-control experiments use risk-sensitive D4RL-style benchmarks. The reward includes stochastic penalties for unsafe conditions, such as high-speed failures in HalfCheetah or large pitch angles in Hopper and Walker2d. Experts are trained with a risk-averse distributional SAC variant, and each method receives only 10 demonstration trajectories.
| Method | HalfCheetah | Hopper | Walker2d |
| DistIRL | 3469 +/- 59 | 886 +/- 1 | 1526 +/- 148 |
| DistIRL-qtr | 3294 +/- 172 | 747 +/- 79 | 1211 +/- 182 |
| Offline ML-IRL | 826 +/- 231 | 192 +/- 56 | 240 +/- 50 |
| ValueDICE | 1259 +/- 78 | 260 +/- 10 | 798 +/- 311 |
| Behavior Cloning | 2828 +/- 281 | 346 +/- 1 | 1321 +/- 26 |
| Expert | 3540 +/- 44 | 892 +/- 3 | 1478 +/- 200 |
DistIRL consistently outperforms the offline IRL baselines in the stochastic-reward setting and approaches expert performance. The HalfCheetah return distributions show why: the learned DistIRL return CDF stays much closer to the expert distribution, while the mean-matching baseline spreads mass into the low-return region.
Takeaway
Distributional IRL is useful because it changes the object being inferred. Instead of asking for a single reward explanation, it recovers a distributional explanation of behavior. That extra structure matters for at least three reasons:
- It can recover stochastic reward structure, including variance and skewness.
- It supports risk-aware imitation learning without requiring online interaction with the environment.
- It can turn behavior into a probe of latent internal signals, as in the dopamine experiment.
The broader lesson is simple: if expert behavior is shaped by uncertainty, then the reward model should be allowed to be uncertain in the same language. Mean reward is a useful summary, but sometimes the interesting science is in the distribution.