Distributional Inverse Reinforcement Learning

Most inverse reinforcement learning methods ask for one reward number per state-action pair. That is already hard, but it is also a little too tidy. In many settings, the reward behind a behavior is not a fixed scalar. A robot making contact with a fragile object can receive variable outcomes from almost identical states. An animal’s dopamine response can fluctuate from trial to trial even for the same behavioral transition.

This project asks what happens if IRL learns the reward as a distribution:

The goal is not only to imitate an expert policy, but to recover a richer object: the reward distribution and the induced return distribution. That gives us a way to talk about variance, skew, tails, and risk preference from demonstrations alone.

Why Mean Matching Is Not Enough

Classical maximum-entropy IRL can be written as a game that matches expected returns:

This works well when reward is deterministic, but it discards higher-order information when reward is random. Two reward models can agree on the mean while disagreeing completely about variance or tail behavior. For risk-aware behavior, that distinction is often the whole point.

DistIRL replaces mean matching with a first-order stochastic dominance (FSD) violation. If the expert return distribution should dominate the policy return distribution, then every threshold should look at least as favorable under the expert:

The reward-learning loss measures the region where this ordering is violated:

The computational trick is to move this objective into quantile space:

That form makes the objective compatible with Monte Carlo return samples and quantile regression.

Quantile view of FSD. The shaded region marks thresholds where the desired dominance relation is violated.

The Algorithmic Picture

The algorithm has three moving parts.

A reward distribution model , implemented with a differentiable parametric family such as a Gaussian, skew-normal, or quantile parameterization.
A distributional critic, trained with quantile regression, that estimates the return distribution induced by the current policy and reward model.
A policy update that optimizes a distortion risk measure (DRM), so the learned policy can express risk-averse or risk-seeking behavior instead of collapsing everything to expected return.

The reward model is learned through an energy-based view. The FSD loss scores how compatible a proposed reward distribution is with the demonstrations:

Using variational inference with a prior , the reward update becomes

For policy learning, DistIRL optimizes a DRM:

where selects which parts of the return distribution should matter. A concave dual distortion emphasizes lower-tail outcomes and produces more conservative policies. The theory shows that, under regularity assumptions, the algorithm reaches an -stationary point with iteration complexity.

What It Recovers

The smallest sanity check is a 5 by 5 gridworld with two rewarding goal states. One goal is reliable; the other has stochastic reward drawn from . A risk-averse expert chooses the more reliable goal in 9 out of 10 demonstrations.

DistIRL recovers both high-reward locations and identifies the high-variance stochastic goal. Bayesian IRL, which models uncertainty but still matches expected return, recovers a plausible mean yet misses the variance structure.

Gridworld reward recovery from 10 demonstrations. DistIRL captures both mean and variance, while the mean-matching baseline misses the stochastic structure.

A Neuroscience Use Case

The most interesting testbed is spontaneous mouse behavior. The dataset records mice freely exploring an arena, converts depth-camera video into discrete behavioral syllables, and includes a time-aligned dopamine trace from the dorsolateral striatum. We model each syllable as a state and the next syllable as the action, giving a compact 10-state, 10-action MDP.

Prior work showed that dopamine can act as a reward signal for reproducing the observed transitions. DistIRL asks a sharper question: can the reward distribution inferred from behavior resemble the measured dopamine fluctuation distribution?

With a skew-normal reward family, S-DistIRL recovers reward distributions that track the empirical dopamine distributions more closely than deterministic rewards or Bayesian IRL. This is exciting because it suggests that behavior alone can carry information about the variability of an internal neuromodulatory signal, not just its average level.

Learned reward distributions and empirical CDFs for two mouse behavior state-action pairs. S-DistIRL most closely tracks the measured dopamine distribution shape.

Offline Control Results

The continuous-control experiments use risk-sensitive D4RL-style benchmarks. The reward includes stochastic penalties for unsafe conditions, such as high-speed failures in HalfCheetah or large pitch angles in Hopper and Walker2d. Experts are trained with a risk-averse distributional SAC variant, and each method receives only 10 demonstration trajectories.

Method	HalfCheetah	Hopper	Walker2d
DistIRL	3469 +/- 59	886 +/- 1	1526 +/- 148
DistIRL-qtr	3294 +/- 172	747 +/- 79	1211 +/- 182
Offline ML-IRL	826 +/- 231	192 +/- 56	240 +/- 50
ValueDICE	1259 +/- 78	260 +/- 10	798 +/- 311
Behavior Cloning	2828 +/- 281	346 +/- 1	1321 +/- 26
Expert	3540 +/- 44	892 +/- 3	1478 +/- 200

Risk-sensitive D4RL performance averaged over 5 random seeds.

DistIRL consistently outperforms the offline IRL baselines in the stochastic-reward setting and approaches expert performance. The HalfCheetah return distributions show why: the learned DistIRL return CDF stays much closer to the expert distribution, while the mean-matching baseline spreads mass into the low-return region.

HalfCheetah return distribution comparison. DistIRL better aligns with the expert return distribution than the mean-matching baseline.

Takeaway

Distributional IRL is useful because it changes the object being inferred. Instead of asking for a single reward explanation, it recovers a distributional explanation of behavior. That extra structure matters for at least three reasons:

It can recover stochastic reward structure, including variance and skewness.
It supports risk-aware imitation learning without requiring online interaction with the environment.
It can turn behavior into a probe of latent internal signals, as in the dopamine experiment.

The broader lesson is simple: if expert behavior is shaped by uncertainty, then the reward model should be allowed to be uncertain in the same language. Mean reward is a useful summary, but sometimes the interesting science is in the distribution.