Making Sense of Active Inference: Optimal Control Without Cost Function

Jan 06, 2024

Here is my attempt to introduce the mechanics of active inference, a framework for modeling and understanding sentient agents originated in neuroscience. My main interest will be centered on active inference's approach to unify perception and control under the same optimization objective. As we will explain below, a unified objective likely has many advantages, including simpler agent design and better performance. We will see that active inference approaches the unified objective from a very perception-centric perspective rather than a control-centric perspective, which is itself a recent movement in reinforcement learning and optimal control.

Why unify perception and control?

We will consider agents that build an explicit model of the environment and behave in a way such that a reward function is optimized (a.k.a., model-based reinforcement learning; MBRL). Perception can be understood as making predictions about the future states of the environment.

The traditional recipe for building such agents is to let them interact with the environment and collect data, optimize the predictive accuracy of the model, and use the model to plan sequences of reward-maximizing actions. As the model becomes more accurate, the agent's ability to achieve high rewards should increase correspondingly.

Unfortunately, a recent paper titled Objective Mismatch in Model-based Reinforcement Learning found that the correspondence between model accuracy and achieved rewards is often violated in practice, if not non-existent at all. There are numerous ways why this could be the case. For example, in model-base RL, there is a well-known phenomenon that inaccuracies in the model (due to overfitting on training data and lack of regularization on out-of-distribution data) tends to be exploited by the planner, which has been studied extensively (for example, see When to Trust Your Model). Model exploitation can lead to premature convergence, but more often catastrophic failure. The bottom line here is that the conflict arises because the model is not trained for how it is supposed to be used: optimizing rewards.

To better understand and address the objective mismatch problem, a subfield has emerged within the MBRL community called decision-aware MBRL which aims to develop unified objective functions for model learning and action selection so that they are aware of each other's role in the optimization process. In a recent survey, we studied all existing decision-aware MBRL approaches and found that these unified objective functions are developed based on the principle of either better estimating or directly optimizing rewards with respect to a pair of model and action selection policy. Notable approaches include value-prediction which trains the model to predict values (cumulative rewards) instead of environment observations (e.g., see this paper) and distribution correction which corrects for model error in reward estimation or optimization (e.g., see this paper). Experimental results have shown increased robustness of agents trained with these unified objective functions to model misspecification, distracting observations, and other perceptual challenges. But it's clear that these approaches are very control-centric, models are developed solely for the purpose of control.

A handwavy introduction of active inference

Active inference can be seen as a way to develop agent objectives using ideas from the Free Energy Principle (FEP). The FEP roughly states that the agent has a probabilistic model of the sensory observations it is supposed to receive; both perception and control should gather evidence for the probabilistic model (i.e., maximize its likelihood).

The FEP is often brought up at the same time with predictive processing. In the RL context, predictive processing can be understood as the idea that both perception and control should suppress prediction error, where perception achieves this objective by building a better model of the environment, control achieves this objective by changing the environment such that environment-generated signals are better predicted by the model. Under this view, perception and control work towards the same goal: minimize prediction error. In contrast to the control-centric approach to unified objectives in RL, models are developed for the purpose of prediction, and so is control.

There are solid cognitive and neuroscience motivations for this perception-centric view of active inference over utility-maximizing optimal control, mostly claiming advantages of active inference in agents with limited information storage and processing capacity. For example, in a paper titled Nonmodular Architecture of Cognitive Systems Based On Active Inference, the authors found that an active inference controller is robust to model misspecification.

Most relevant to the present discussion is a paper titled Learning Action-oriented Models Through Active Inference, where the authors showed that an active inference agent with a misspecified model succeeds at a task while learning model parameters that deviate significantly from the actual environment statistics. And if the agent were to accurately capture the environment statistics in this setting, it would not solve the task. Furthermore, the parameters learned by the active inference agent are optimistic in the sense that the learned transition dynamics is biased towards solving the task.

Despite these promising results, how to properly formulate perception and control as prediction error minimization has been an active research effort: active inference formulations have gone through drastic changes over the years and I think it is still not fully settled (see this paper on active inference and different kinds of free energy).

I will unpack two versions of active inference in detail, an old and a new one, to illustrate the conceptual underpinning.

Optimal control without cost function

A paper published in 2012 titled Active Inference And Agency: Optimal Control Without Cost Functions really embodies the idea that optimal control should be based on prediction rather than reward or cost functions. The authors proposed two agents: an agency-free agent and an agency-based agent. I will discuss the former agent and contrast it with the standard approach to (partially observable) Markov decision process.

In partially observable Markov decision process (POMDP), we model the environment using a set of hidden states s ∈ 𝒮, actions a ∈ 𝒜, and observations o ∈ 𝒪. Upon taking action aₜ at time step t, the environment transitions into a new state sₜ₊₁ according to probability distribution P(sₜ₊₁|sₜ, aₜ). However, the agent cannot directly observe the environment states but only observations sampled from P(oₜ|sₜ).

The most established approach to planning in POMDP is based on performing backward dynamic programming in the belief space. Basically, the planner tries to think about "given that I believe the environment is in a certain state, what will my next belief be under the POMDP model of the environment, and what value is associated with that belief". At every time step of interaction with the environment, the agent first updates its belief about the environment (e.g., using a Kalman filter), and then uses the updated belief to find the optimal plan.

Active inference tries to formulate both the belief update and the planning process as inference or prediction (i.e., inference about the future). Similar to the optimal control formulation, the agent knows that observations are sampled from P(oₜ|sₜ). However, the active inference agent models the environment transitions as P(sₜ₊₁|sₜ), eschewing the action variable in the optimal control formulation.

To be more precise, this paper considers solving episodic tasks with a maximum of T time steps. The agent explicitly represents the hidden states at all time steps, and observations until the current time step τ. The probabilistic model is defined as follows:

\(P(o_{1:\tau}, s_{1:T}) = \prod_{t=1}^{\tau}P(o_t\vert s_t)\prod_{t=1}^{T}P(s_t\vert s_{t-1})\)

where P(s₁| s₀) = P(s₁) is the initial state distribution.

Perception as hidden state inference: As said before, the agent performs all mental activities for the purpose of maximizing the likelihood of the probabilistic model. For perception in the current context, this corresponds to updating beliefs about the hidden states by minimizing variational free energy, an upper bound on the (negative) model evidence.

Specifically, the agent tries to form its belief about the hidden states by optimizing a parameterized distribution

\(Q(s_{1:\tau}) = \prod_{t=1}^{\tau}Q(s_t)\)

towards minimal free energy (we assume beliefs about states at adjacent time steps are not correlated, a.k.a., the mean-field assumption). The free energy function is defined as:

\(\begin{align} \mathcal{F}(o_{1:\tau}, Q) &= \mathbb{E}_{Q(s_{1:T})}[\log Q(s_{1:T}) - \log P(o_{1:\tau}, s_{1:T})]\\ &= \mathbb{E}_{Q(s_{1:T})}[\sum_{t=1}^{T}\log Q(s_{t}) - \sum_{t=1}^{\tau}\log P(o_t\vert s_t) - \sum_{t=1}^{T}\log P(s_t\vert s_{t-1})] \\ &= - \underbrace{\sum_{t=1}^{\tau}\mathbb{E}_{Q(s_t)}[\log P(o_t\vert s_t)]}_{\text{Past observations}} + \underbrace{\sum_{t=1}^{T}\mathbb{E}_{Q(s_{t-1:t})}[\log Q(s_{t}) - \log P(s_t\vert s_{t-1})]}_{\text{Past and future states}} \end{align}\)

Taking the derivative of F w.r.t. Q, we can show that the optimal Q has the following form (see derivation at the end):

\(Q^*(s_t) \propto \exp\left(\mathbb{I}[t \leq \tau]\log P(o_t\vert s_t) + \mathbb{E}_{Q^*(s_{t-1})}[\log P(s_t\vert s_{t-1})] + \mathbb{E}_{Q^*(s_{t+1})}[\log P(s_{t+1}\vert s_t)]\right)\)

Arrange slightly differently, we have:

\(Q^*(s_t) \propto \left\{\begin{array}{ll}\exp\left(\mathbb{E}_{Q^*(s_{t-1})}[\log P(o_t, s_t\vert s_{t-1})] + c_1\right) & t \leq \tau \\ \exp\left(\mathbb{E}_{Q^*(s_{t-1})}[\log P(s_t\vert s_{t-1})] + c_2\right) & t > \tau \end{array}\right.\)

where c₁, c₂ correspond to the last term in the previous equation. We see that these terms highly resemble Bayesian posterior and posterior predictive distributions.

Control as observation inference: This part is where active inference deviates from optimal control. First, the authors introduce an additional mapping P(oₜ₊₁|oₜ, aₜ). I like to think about this as "reflex", where actions are quickly mapped to observations (or the other way around assuming inverse dynamics uniqueness). They then require the agent to select actions such that the next observation under the reflex distribution has the least free energy under the current updated belief about hidden states:

\(a_\tau = \arg\min_{a} \sum_{o_{\tau+1}}P(o_{\tau+1}\vert o_\tau, a)\mathcal{F}(o_{1:\tau+1}, Q^*)\)

where

\(\mathcal{F}(o_{1:t+1}, Q^*)\)

is the free energy after observing the hypothetical next observation, evaluated using the current belief.

Connections to optimal control with cost function

Let's now try to understand what is meant by control as observation inference. The magic lies in not explicitly representing actions in the transition model.

Let us assume we have access to an (reward-maximizing) optimal policy π*(a|s) and a regular transition model P(s'|s, a). Assuming we always choose actions from the optimal policy, the transition model eschewing the action variable can be constructed from:

\(P^{\pi^*}(s'\vert s) = \sum_{a}P(s'\vert s, a)\pi^*(a\vert s)\)

Let us also find the optimal control counterpart of the reflex model. Defining

\(b(s_\tau\vert o_{\tau}) = P(s_t\vert o_{\tau}, o_{1:\tau-1})\)

as the Bayesian posterior, we can construct the reflex model as:

\(\begin{align} P(o_{\tau+1}\vert o_{\tau}, a) &= \sum_{s_{\tau+1}}P(o_{\tau+1}\vert s_{\tau+1})\sum_{s_{\tau}}P(s_{\tau+1}\vert s_{\tau}, a)b(s_{\tau}\vert o_{\tau}) \\ &\triangleq P(o_{\tau+1}\vert b, a) \end{align}\)

We are now able to simplify the hypothetical free energy

\(\mathcal{F}(o_{1:\tau+1}, Q^*)\)

We are currently at time step t = τ + 1. For future time steps t > τ + 1, since we have minimized the corresponding terms in F (let's denote them with F˲), we assume that it is approximately zero:

\(\begin{align} \mathcal{F}_{>}(Q^*) &= \sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(s_{t-1:t})}[\log Q^*(s_t) - \log P(s_t\vert s_{t-1})] \\ &= \sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(s_{t-1})}D_{KL}[Q^*(s_t) \vert \vert P(s_t\vert s_{t-1})] \\ &\approx 0 \end{align}\)

This is simply saying we can make good prediction of future states.

For past time steps t ≤ τ, we know that the corresponding terms in F (let's denote them with F˱) are approximately equal to the likelihood of past observations evaluated under the model:

\(\mathcal{F}_{<}(Q^*) \approx \log P(o_{1:\tau})\)

And this does not change no matter what the next observation o_{τ+1} will be.

Thus, the only term that differentiates F(o_{1:τ+1},Q*) for different o_{τ+1} is the current step, which equals:

\(\begin{align} \mathcal{F}_{\tau+1} &= -\mathbb{E}_{Q^*(s_{\tau+1})}[\log P(o_{\tau+1}\vert s_{\tau+1})] \\ &\approx -\mathbb{E}_{P(s_{\tau+1}\vert Q^*)}[\log P(o_{\tau+1}\vert s_{\tau+1})] \\ &\approx -\log \sum_{s_{\tau+1}}P(o_{\tau+1}\vert s_{\tau+1})\sum_{s_{\tau}}P^{\pi^*}(s_{\tau+1}\vert s_{\tau})Q^*(s_{\tau}) \\ &\triangleq -\log P(o_{t+1}\vert \pi^*) \end{align}\)

where

\(P(s_{\tau+1}|Q^*) = \sum_{s_{\tau}}Q^*(s_{\tau})P(s_{\tau+1}|\tau)\)

is the posterior predictive distribution.

Thus, F_{τ+1} represents the (negative) predictive likelihood of the next observation under the optimal policy, and selecting actions comes down to finding:

\(\begin{align} a_{\tau+1} &= \arg\max_{a}\sum_{o_{\tau+1}}P(o_{\tau+1}\vert b, a)\log P(o_{t+1}\vert \pi^*) \end{align}\)

which is simply saying: let's find an action such that the next predicted observation coincide with the one that would be generated by the optimal policy.

An illustration of optimal control as observation inference (2012) and optimal control as prior inference (2015), which we will unpack below.

Modern active inference

In 2015, a paper titled Active Inference And Epistemic Value marks the beginning of a new era for active inference, where the ability to handle epistemic uncertainty is claimed as a central property and a natural consequence of active inference. This new version was later refined in a paper titled Active Inference: A Process Theory and the most updated version (the modern version) of active inference was comprehensively reviewed in this paper (highly recommended).

The difference between the modern version and the previous version is that the agent now explicitly represents actions with no additional reflex mapping. We still consider an episodic setting with a maximum of T time steps. We denote the action sequence to be modeled as π=a1:T−1. The probabilistic model is defined as:

\(P(o_{1:\tau}, s_{1:T}, \pi) = \prod_{t=1}^{\tau}P(o_t\vert s_t)\prod_{t=1}^{T}P(s_t\vert s_{t-1}, \pi)P(\pi)\)

where P(s₁|s₀, π) = P(s₁).

The agent now represents beliefs about not only the hidden states but also the action sequence. This is captured in the distribution

\(Q(s_{1:T}, \pi) = \prod_{t=1}^{T}Q(s_t|\pi)Q(\pi)\)

The free energy function is defined as:

\(\begin{align} \mathcal{F}(o_{1:\tau}, Q) &= \mathbb{E}_{Q}[\log Q(s_{1:T}, \pi) - \log P(o_{1:\tau}, s_{1:T}, \pi)] \\ &= D_{KL}[Q(\pi)\vert \vert P(\pi)] + \mathbb{E}_{Q}[\log Q(s_{1:T}\vert \pi) - \log P(o_{1:\tau}, s_{1:T}\vert \pi)] \\ &\triangleq D_{KL}[Q(\pi)\vert \vert P(\pi)] + \mathbb{E}_{Q(\pi)}[\mathcal{F}(o_{1:\tau}, Q\vert \pi)] \end{align}\)

where F(o_{1:τ},Q|π) denotes action-conditioned free energy.

Perception as hidden state inference: Perception in the modern version is very similar to the previous version, except that we now have to find the optimal state estimates for each π. Borrowing the previous results, we have:

\(Q^*(s_t\vert \pi) \propto \left\{\begin{array}{ll}\exp\left(\mathbb{E}_{Q^*(s_{t-1})}[\log P(o_t, s_t\vert s_{t-1}, \pi)] + c_1\right) & t \leq \tau \\ \exp\left(\mathbb{E}_{Q^*(s_{t-1})}[\log P(s_t\vert s_{t-1}, \pi)] + c_2\right) & t > \tau \end{array}\right.\)

Control as prior inference: We saw in the previous version of active inference that goal-directed actions are induced by generating optimistic predictions - predictions about sensory consequences that would have been observed if the agent were to act optimally. This version of active inference induces such behavior using a goal-directed prior over actions sequences.

Specifically, the prior over π is defined as:

\(P(\pi) \propto \exp\left(-\mathcal{G}(\pi\vert Q^*)\right)\)

where G(π|Q*) is known as the expected free energy (EFE) defined as:

\(\mathcal{G}(\pi\vert Q^*) \triangleq \mathbb{E}_{Q^*(o_{\tau+1:T}, s_{\tau+1:T}\vert \pi)}[\log Q^*(s_{\tau+1:T}\vert \pi) - \log \tilde{P}(o_{\tau+1:T}, s_{\tau+1:T}\vert \pi)]\)

and

\(Q^*(o_{\tau+1:T}, s_{\tau+1:T}|\pi) = \prod_{t=\tau+1}^{T}P(o_t|s_t)Q^*(s_t|\pi)\)

is the joint predictive distribution.

The optimal posterior over action sequences under this prior is:

\(Q^*(\pi) \propto \exp\left(-\mathcal{G}(\pi\vert Q^*) - \mathcal{F}(o_{1:\tau}, Q^*\vert \pi)\right)\)

Thus, the posterior is simply a minor modification of the prior based on which action sequence likely generated the observed signals. If we assume the dynamics model fits the data well so that F≈0, then we can see that the prior is doing the majority of the heavy-lifting.

Control prior design choices

The choice of the EFE prior is usually justified as "a free energy minimizing agent should a priori believe they will choose actions that minimize free energy". Note, however, the "probabilistic model" P̃(o_{τ₊₁:T}, s_{τ₊₁:T}|π) in the EFE prior is not necessarily the same as the probabilistic model used to perform state estimation. There are thus many design decisions in defining P̃.

We can definitely choose them to be the same:

\(\tilde{P}(o_{\tau+1:T}, s_{\tau+1:T}|\pi) = \prod_{t=\tau+1}^{T}P(o_t|s_t)Q^*(s_t|\pi)\)

Then the EFE becomes:

\(\begin{align} \mathcal{G}(\pi\vert Q^*) &= \sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(o_t, s_t\vert \pi)}[ \log Q^*(s_t\vert \pi) - \log P(o_t\vert s_t) - \log Q^*(s_t\vert \pi)] \\ &= \sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(o_t, s_t\vert \pi)}[- \log P(o_t\vert s_t)] \\ &= \sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(s_t\vert \pi)}\mathcal{H}[P(o_t\vert s_t)] \end{align}\)

In other words, the EFE becomes the expected entropy H of future observations.

We can alternatively choose:

\(\tilde{P}(o_{\tau+1:T}, s_{\tau+1:T}|\pi) = \prod_{t=\tau+1}^{T}P(o_t|s_t)\tilde{P}(s_t)\)

where

\(\tilde{P}(s_t) \propto \exp(R(s_t))\)

is a desired distribution over states, defined using a state-based reward function. Then, the EFE becomes:

\(\begin{align} \mathcal{G}(\pi\vert Q^*) &= \sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(o_t, s_t\vert \pi)}[ \log Q^*(s_t\vert \pi) - \log P(o_t\vert s_t) - \log \tilde{P}(s_t)] \\ &= \sum_{t=\tau+1}^{T} \underbrace{D_{KL}[Q^*(s_t\vert \pi)\vert \vert \tilde{P}(s_t)]}_{\text{Risk}} + \underbrace{\mathbb{E}_{Q^*(s_t\vert \pi)}\mathcal{H}[P(o_t\vert s_t)]}_{\text{Ambiguity}} \end{align}\)

This gives us the well-known risk-ambiguity decomposition.

To get the final decompositions, we will choose:

\(\tilde{P}(o_{\tau+1:T}, s_{\tau+1:T}|\pi) = \prod_{t=\tau+1}^{T}\tilde{P}(o_t)\tilde{P}(s_t|o_t)\)

where

\(\tilde{P}(o_t) \propto \exp(R(o_t))\)

We will then add and subtract the EFE with

\(\sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(o_t, s_t|\pi)}[\log Q^*(s_t|o_t, \pi)]\)

where Q*(sₜ|oₜ, π) ∝ Q*(sₜ|π)P(oₜ|sₜ). The EFE becomes the following:

\(\begin{align} \mathcal{G}(\pi\vert Q^*) &= \sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(o_t, s_t\vert \pi)}[ \log Q^*(s_t\vert \pi) - \log \tilde{P}(o_t) - \log \tilde{P}(s_t\vert o_t)] \\ &= \sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(o_t\vert \pi)}[- \log \tilde{P}(o_t)] + \mathbb{E}_{Q^*(o_t\vert \pi)}D_{KL}[Q^*(s_t\vert o_t, \pi) \vert \vert \tilde{P}(s_t\vert o_t)] \\ &\quad - \mathbb{E}_{Q^*(o_t\vert \pi)}D_{KL}[Q^*(s_t\vert o_t, \pi) \vert \vert Q^*(s_t\vert \pi)] \\ &\geq \sum_{t=\tau+1}^{T}\underbrace{\mathbb{E}_{Q^*(o_t\vert \pi)}[- \log \tilde{P}(o_t)]}_{\text{Pragmatic value}} - \underbrace{\mathbb{E}_{Q^*(o_t\vert \pi)}D_{KL}[Q^*(s_t\vert o_t, \pi) \vert \vert Q^*(s_t\vert \pi)]}_{\text{Epistemic value}} \end{align}\)

where the last line is obtained by dropping the second term in the second line, because KL divergence is non-negative. This gives us the well-known pragmatic-epistemic value decomposition. But note that it is a bound on the EFE and not the EFE itself as defined in the prior.

If we care doing a quick manipulation on the epistemic value as follows:

\(\begin{align} &\mathbb{E}_{Q^*(o_t\vert \pi)}D_{KL}[Q^*(s_t\vert o_t, \pi) \vert \vert Q^*(s_t\vert \pi)] \\ &= \mathbb{E}_{Q^*(o_t, s_t\vert \pi)}[\log Q^*(s_t\vert \pi) + \log P(o_t\vert s_t) - \log Q^*(o_t\vert \pi) - Q^*(s_t\vert \pi)] \\ &= \mathbb{E}_{Q^*(s_t\vert \pi)P(o_t\vert s_t)}[\log P(o_t\vert s_t)] - \mathbb{E}_{Q^*(o_t, s_t\vert \pi)}[\log Q^*(o_t\vert \pi)] \\ &= -\mathbb{E}_{Q^*(s_t\vert \pi)}\mathcal{H}[P(o_t\vert s_t)] - \mathbb{E}_{Q^*(o_t\vert \pi)}[\log Q^*(o_t\vert \pi)] \end{align}\)

Plug this back to the previous decompositions, we have:

\(\begin{align} \mathcal{G}(\pi\vert Q^*) &\geq \sum_{t=\tau+1}^{T}\mathbb{E}_{Q^*(o_t\vert \pi)}[- \log \tilde{P}(o_t)] + \mathbb{E}_{Q^*(s_t\vert \pi)}\mathcal{H}[P(o_t\vert s_t)] + \mathbb{E}_{Q^*(o_t\vert \pi)}[\log Q^*(o_t\vert \pi)] \\ &= \sum_{t=\tau+1}^{T}\underbrace{D_{KL}[Q^*(o_t\vert \pi) \vert \vert \tilde{P}(o_t)]}_{\text{Risk}} + \underbrace{\mathbb{E}_{Q^*(s_t\vert \pi)}\mathcal{H}[P(o_t\vert s_t)]}_{\text{Ambiguity}} \end{align}\)

This gives us a risk-ambiguity decomposition in the observation space.

Overall, this exercise suggests that whether an active inference will perform well on an actual (reward-seeking) task depends on whether the designer can specify a good prior for the task.

Connections to optimal control without cost function

A salient question at this point is whether modern active inference has lost its root of "optimal control without cost function"?

To see the connection, let's modify the generative model slightly into the following factorized format:

\(P(o_{1:\tau}, s_{1:T}, a_{1:T}) = \prod_{t=1}^{\tau}P(o_t|s_t)\prod_{t=1}^{T}P(s_t|s_{t-1}, a_{t-1})\pi(a_t|s_t)\)

where P(s₁|s₀, a₀) = P(s₁). We will correspondingly modify the belief distribution as

\(Q(s_{1:T}, a_{1:T}) = \prod_{t=1}^{T}Q(s_t)Q(a_t|s_t)\)

The free energy function is defined as:

\(\begin{align} \mathcal{F}(o_{1:\tau}, Q) &= \mathbb{E}_{Q}[\log Q(s_{1:T}, a_{1:T}) - \log P(o_{1:\tau}, s_{1:T}, a_{1:T})] \\ &= \sum_{t=1}^{T}\mathbb{E}_{Q}[\log Q(s_t) + \log Q(a_t|s_t) - \log P(o_t, s_t|s_{t-1}, a_{t-1}) - \log \pi(a_t|s_t)] \\ &= \sum_{t=1}^{T}\mathbb{E}_{Q}[D_{KL}[Q(a_t|s_t)||\pi(a_t|s_t)]] + \mathbb{E}_{Q}[\log Q(s_t) - \log P(o_t, s_t|s_{t-1}, a_{t-1})]\\ &\triangleq \sum_{t=1}^{T} \mathbb{E}_{Q(s_t)}[D_{KL}[Q(a_t|s_t)||\pi(a_t|s_t)]] + \mathbb{E}_{Q(s_{t-1}, a_{t-1})}[\mathcal{F}(o_{t}, Q|a_{t-1})] \end{align}\)

where for t > τ, oₜ does not exist.

Perception as hidden state inference: It's easy to see that the optimal state estimates remain as approximate Bayesian posteriors:

\(Q^*(s_t) \propto \left\{\begin{array}{ll}\exp\left(\mathbb{E}_{Q^*(s_{t-1}, a_{t-1})}[\log P(o_t, s_t|s_{t-1}, a_{t-1})] + c_1\right) & t \leq \tau \\ \exp\left(\mathbb{E}_{Q^*(s_{t-1}, a_{t-1})}[\log P(s_t|s_{t-1}, a_{t-1})] + c_2\right) & t > \tau \end{array}\right.\)

Control as prior inference: Defining a similar notion of EFE for this factorized generative model:

\(\pi(a_t|s_t) \propto \exp\left(-\mathcal{G}(a_t, s_t|Q^*)\right)\)

where

\(\mathcal{G}(a_t, s_t|Q^*) \triangleq \mathbb{E}_{Q^*(o_{\tau+1:T}, s_{\tau+1:T}|s_t, a_t)}[\log Q^*(s_{\tau+1:T}, a_{\tau+1:T}) - \log \tilde{P}(o_{\tau+1:T}, s_{\tau+1:T}, a_{\tau+1:T}|s_t, a_t)]\)

we obtain a similar optimal policy where the posterior over actions roughly corresponds to the prior:

\(Q^*(a_t|s_t) \propto \exp\left(-\mathcal{G}(a_t, s_t|Q^*) - \mathbb{E}_{Q^{*}(s_{t-1}, a_{t-1})}[\mathcal{F}(o_{t}, Q^*|a_{t-1})]\right)\)

Familiar readers might recognize a lot of similarity between this factorization and standard notations in RL. This is indeed the factorization underlying most deep RL-based implementations of active inference.

To see the connection with "optimal control without cost function", let us define a new state variable s̃ = [s, a] which is the concatenation of the original state and action variables. We can then rewrite the factorized generative model as:

\(P(o_{1:\tau}, \tilde{s}_{1:T}) = \prod_{t=1}^{\tau}P(o_t|\tilde{s}_t)\prod_{t=1}^{T}P(\tilde{s}_t|\tilde{s}_{t-1})\)

where P(oₜ|s̃ₜ) = P(oₜ|sₜ, aₜ) = P(oₜ|sₜ). We have visualized this and the 2012 model in the figure in the previous section.

Under this representation, we dismantle the conventional treatment of states and actions as separate objects and instead understand actions as just a subset of states which we can directly affect or control if we put our mind to. In this way, we realize active inference's motivation to unify both perception and control as hidden states inference, except that inference of active states is minimally affected by observations but rather largely modulated by the prior. But this is exactly the vision of active inference: actions are just precise predictions.

Closing thoughts

We started with a motivation to understand how active inference formulates a single objective function for perception and control, which may in turn address the objective mismatch problem in RL. We have seen that active inference adopts a very perception-centric approach to the development of the unified objective as opposed to the control-centric approach in RL.

On the surface, both versions of active inference appear to unify perception and control under a single objective of minimizing free energy, and yet the devil is in the detail of the generative model. Given the intricacies of the prior design choices, one may ask whether active inference still conforms to the mismatched perception-control paradigm, where the perception model is trained to predict environment observations and the action selection policy is optimized for some other criteria. I think the key here is that among all possible prior design choices, there is only a subset that is correct. As I discuss in another post, while the EFE objective may seem ad-hoc, it actually gives rise to a key property which makes agents robust to model error. In this sense, the EFE objective can be seen as a type of distribution correction objective (and can in fact be derived from this perspective). This further aligns the perception-centric approach with the control-centric approach in RL.

Other resources and code examples

Most of the contents in this post were taken from my thesis and previous open sourced materials. In section 2.4.3 of the thesis, I gave an attempt to connect EFE and expected value and tried to understand the implicit assumptions made by EFE (although the connections were never validated). In section 2.4.5, I gave a brief overview of the history and neuroscience motivations behind active inference.

You can checkout my implementation of the old version (i.e., optimal control without cost) here and a slightly variation of the new version here (which implements the action oriented model paper). Both notebooks contain explanation of the formulations and experiments on concrete examples.

Appendix

Derivation for optimal Q(s_{1:T}) in optimal control without cost

We start by finding the derivative of F w.r.t. to a specific element of vector Q(sₜ), for t ≤ τ:

\(\begin{align} \nabla_{Q(s_t)}\mathcal{F}(o_{1:\tau}, Q) &= \nabla_{Q(s_t)}\mathbb{E}_{Q(s_t)}[\log Q(s_t)] -\nabla_{Q(s_t)}\mathbb{E}_{Q(s_t)}[\log P(o_t\vert s_t)] \\ &\quad - \nabla_{Q(s_t)}\mathbb{E}_{Q(s_{t-1})Q(s_t)}[\log P(s_t\vert s_{t-1})] - \nabla_{Q(s_t)}\mathbb{E}_{Q(s_{t})Q(s_{t+1})}[\log P(s_{t+1}\vert s_t)] \\ &= \log Q(s_t) + 1 - \log P(o_t\vert s_t) - \mathbb{E}_{Q(s_{t-1})}[\log P(s_t\vert s_{t-1})] - \mathbb{E}_{Q(s_{t+1})}[\log P(s_{t+1}\vert s_t)] \end{align}\)

Setting the derivative to zero, we have:

\(\log Q(s_t) \propto \log P(o_t\vert s_t) + \mathbb{E}_{Q(s_{t-1})}[\log P(s_t\vert s_{t-1})] + \mathbb{E}_{Q(s_{t+1})}[\log P(s_{t+1}\vert s_t)]\)

We know that for t > τ, we have not observed any oₜ, thus the first term doesn't exist and we represent it using an indicator function.

LatentObservations

Discussion about this post