← Back to News

Two papers presented at AAAI-26!

Our lab presented two papers at the 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26) in Singapore, covering the topics of enhancing Reinforcement Learning performance via variance reduction through training a behaviour policy, and synthesising reward monitors from expressive quantitative temporal logical specifications.

AAAIRLMLConference 28 Jan 2026

We are excited to share three members' papers featured in our AAAI-26 news entry, with short explanations and poster links.

Alex Goodall: Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

Authors: Alexander W. Goodall, Edwin Hamel-De le Court, and Francesco Belardinelli.

This work proposes Behaviour Policy Optimization (BPO), a variance-reduction regime for RL that replaces purely on-policy data collection with a learned behaviour policy $\mu$ designed to yield provably lower-variance return estimates for improving a target policy $\pi$. Building on recent off-policy evaluation theory, they show a one-step variance-optimal behaviour policy has the form $\hat\mu(a \mid s) \propto \pi(a \mid s)\sqrt{\hat q_\pi(s,a)}$, where $\hat q_\pi$ is an auxiliary quantity linked to the variance of importance-sampling estimators.

Paper | Poster

Omar Adalat: Expressive Temporal Specifications for Reward Monitoring

Authors: Omar Adalat and Francesco Belardinelli.

The paper studies how to specify and monitor rewards using richer temporal logic descriptions, going beyond simple Boolean reward functions, which results in sparse rewards and is inadequate in dealing with long-horizon tasks. A quantitative version of LTL on finite traces is used (LTLf[F]), where each proposition can be true to a degree in $[0,1]$, so a whole formula also evaluates to a satisfaction score in $[0,1]$. Composing reward monitors with the MDP allows retaining Markovian policies for optimality, and the use of the formal logic used not only allows interpretable and easily specifiable temporal-based reward functions, but also allows syntactically identifying which formulas correspond to safety and reachability properties - thus allowing globally blocking rewards for the remainder of an episode whenever a safety property is violated.

Paper | Poster