メインコンテンツまでスキップ

Dynamic Programming for Discrete-Time Systems

This document presents the mathematical framework of dynamic programming for discrete-time optimal control problems. We derive the recursive Bellman equations that form the foundation of optimal control theory.

Discrete-Time System Formulation

System Dynamics

Consider a general discrete-time system:

xk+1=f(xk,uk)x_{k+1} = f(x_k, u_k)

Where:

  • xkRnx_k \in \mathbb{R}^n: system state at time kk
  • ukRmu_k \in \mathbb{R}^m: control input at time kk
  • f(,)f(\cdot, \cdot): system dynamics (linear or nonlinear)

Performance Function

The general optimal control objective is to minimize:

J=h(xN)+k=0N1g(xk,uk)J = h(x_N) + \sum_{k=0}^{N-1} g(x_k, u_k)

Where:

  • h(xN)h(x_N): terminal cost function
  • g(xk,uk)g(x_k, u_k): stage cost function
  • x0x_0: initial state (given)
  • xNx_N: terminal state
  • NN: finite time horizon

Control Task

Find the optimal control sequence:

u0,u1,,uN1u_0^*, u_1^*, \ldots, u_{N-1}^*

that minimizes the performance function JJ subject to the system dynamics and constraints.

Backward Multi-Stage Dynamic Programming

The key insight of dynamic programming is to solve the problem backwards in time, starting from the final stage and working towards the initial time.

Stage N: Terminal Stage (k=Nk = N)

At the terminal stage, there are no more control decisions to make:

JNN(xN)=h(xN)J_{N \to N}(x_N) = h(x_N)

Since this is the final stage:

JNN(xN)=h(xN)J_{N \to N}^*(x_N) = h(x_N)

The optimal cost-to-go from the terminal stage is simply the terminal cost.

Stage N-1: One Step to Go (k=N1Nk = N-1 \to N)

For the second-to-last stage, we consider both the immediate cost and the terminal cost:

JN1N(xN,xN1,uN1)=h(xN)+k=N1N1g(xk,uk)=JNN(xN)+g(xN1,uN1)\begin{aligned} J_{N-1 \to N}(x_N, x_{N-1}, u_{N-1}) &= h(x_N) + \sum_{k=N-1}^{N-1} g(x_k, u_k) \\ &= J_{N \to N}(x_N) + g(x_{N-1}, u_{N-1}) \end{aligned}

Since xN=f(xN1,uN1)x_N = f(x_{N-1}, u_{N-1}), we can express the cost in terms of the decision variables:

JN1N(xN1,uN1)=JNN(f(xN1,uN1))+g(xN1,uN1)J_{N-1 \to N}(x_{N-1}, u_{N-1}) = J_{N \to N}^*(f(x_{N-1}, u_{N-1})) + g(x_{N-1}, u_{N-1})

Optimal cost-to-go:

JN1N(xN1)=minuN1[JNN(f(xN1,uN1))+g(xN1,uN1)]J_{N-1 \to N}^*(x_{N-1}) = \min_{u_{N-1}} \left[ J_{N \to N}^*(f(x_{N-1}, u_{N-1})) + g(x_{N-1}, u_{N-1}) \right]

Optimal control: The optimal control uN1u_{N-1}^* satisfies:

JN1N(xN1,uN1)uN1=0\frac{\partial J_{N-1 \to N}(x_{N-1}, u_{N-1})}{\partial u_{N-1}} = 0

Stage N-2: Two Steps to Go (k=N2Nk = N-2 \to N)

For the third-to-last stage:

JN2N(xN,xN1,xN2,uN1,uN2)=h(xN)+g(xN1,uN1)+g(xN2,uN2)\begin{aligned} &J_{N-2 \to N}(x_N, x_{N-1}, x_{N-2}, u_{N-1}, u_{N-2}) \\ &= h(x_N) + g(x_{N-1}, u_{N-1}) + g(x_{N-2}, u_{N-2}) \end{aligned}

This can be rewritten as:

JN2N(xN2,uN1,uN2)=JN1N(xN1,uN1)+g(xN2,uN2)\begin{aligned} &J_{N-2 \to N}(x_{N-2}, u_{N-1}, u_{N-2}) \\ &= J_{N-1 \to N}(x_{N-1}, u_{N-1}) + g(x_{N-2}, u_{N-2}) \end{aligned}

Key insight from Bellman's principle: Whatever the initial states and controls before stage N2N-2, the remaining control sequence from N2N-2 to NN must be optimal. Therefore, uN1u_{N-1}^* is already determined from the previous stage.

JN2N(xN2)=minuN2[JN1N(f(xN2,uN2))+g(xN2,uN2)]\begin{aligned} &J_{N-2 \to N}^*(x_{N-2}) \\ &= \min_{u_{N-2}} \left[ J_{N-1 \to N}^*(f(x_{N-2}, u_{N-2})) + g(x_{N-2}, u_{N-2}) \right] \end{aligned}

Optimal control:

JN2N(xN2,uN2)uN2=0uN2\frac{\partial J_{N-2 \to N}(x_{N-2}, u_{N-2})}{\partial u_{N-2}} = 0 \Rightarrow u_{N-2}^*

General Bellman Recursion

Bellman Equation

For any stage kk (where k=Njk = N-j for j=0,1,,Nj = 0, 1, \ldots, N):

Jk(xk)=minuk[Jk+1(f(xk,uk))+g(xk,uk)]\begin{aligned} J_k^*(x_k) &= \min_{u_k} \left[ J_{k+1}^*(f(x_k, u_k)) + g(x_k, u_k) \right] \end{aligned}

With boundary condition:

JN(xN)=h(xN)J_N^*(x_N) = h(x_N)

Optimal Control Policy

The optimal control at each stage is:

uk(xk)=argminuk[Jk+1(f(xk,uk))+g(xk,uk)]u_k^*(x_k) = \arg\min_{u_k} \left[ J_{k+1}^*(f(x_k, u_k)) + g(x_k, u_k) \right]

Value Function Interpretation

  • Jk(xk)J_k^*(x_k): Value function - minimum cost-to-go from state xkx_k at time kk
  • uk(xk)u_k^*(x_k): Policy function - optimal control as a function of current state

Properties of Dynamic Programming

1. Principle of Optimality

The Bellman equation embodies the principle of optimality:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

2. Backward Induction

The solution proceeds backward in time:

  1. Start with terminal condition at k=Nk = N
  2. Solve for k=N1,N2,,1,0k = N-1, N-2, \ldots, 1, 0
  3. At each stage, use previously computed value function

3. Markov Property

The optimal control depends only on the current state, not on the path taken to reach that state:

uk=uk(xk)u_k^* = u_k^*(x_k)

Algorithm Summary

Step 1: Initialize terminal condition

JN(xN)=h(xN)J_N^*(x_N) = h(x_N)

Step 2: For k=N1,N2,,0k = N-1, N-2, \ldots, 0, compute:

Jk(xk)=minuk[Jk+1(f(xk,uk))+g(xk,uk)]J_k^*(x_k) = \min_{u_k} \left[ J_{k+1}^*(f(x_k, u_k)) + g(x_k, u_k) \right] uk(xk)=argminuk[Jk+1(f(xk,uk))+g(xk,uk)]u_k^*(x_k) = \arg\min_{u_k} \left[ J_{k+1}^*(f(x_k, u_k)) + g(x_k, u_k) \right]

Step 3: Forward simulation using optimal policy Starting from x0x_0, apply:

uk=uk(xk),xk+1=f(xk,uk(xk))u_k = u_k^*(x_k), \quad x_{k+1} = f(x_k, u_k^*(x_k))

Examples of Cost Functions

1. Quadratic Cost (LQR)

System: xk+1=Axk+Bukx_{k+1} = Ax_k + Bu_k

Cost:

J=xNTSxN+k=0N1(xkTQxk+ukTRuk)J = x_N^T S x_N + \sum_{k=0}^{N-1} (x_k^T Q x_k + u_k^T R u_k)

Bellman equation:

Jk(xk)=minuk[Jk+1(Axk+Buk)+xkTQxk+ukTRuk]J_k^*(x_k) = \min_{u_k} \left[ J_{k+1}^*(Ax_k + Bu_k) + x_k^T Q x_k + u_k^T R u_k \right]

2. Minimum Time

Cost: J=NJ = N (number of steps)

Stage cost: g(xk,uk)=1g(x_k, u_k) = 1

Terminal cost: h(xN)=0h(x_N) = 0

3. Minimum Energy

Cost:

J=k=0N1ukTRukJ = \sum_{k=0}^{N-1} u_k^T R u_k

Stage cost: g(xk,uk)=ukTRukg(x_k, u_k) = u_k^T R u_k

Terminal cost: h(xN)=0h(x_N) = 0

Computational Considerations

Advantages

  1. Global optimality: Guaranteed to find global optimum for the discretized problem
  2. Systematic approach: Provides complete optimal policy
  3. Handles nonlinearity: Works for nonlinear systems
  4. Constraint handling: Can incorporate state and input constraints

Limitations

  1. Curse of dimensionality: Computational complexity grows exponentially with state dimension
  2. Discretization: Continuous problems must be discretized
  3. Storage requirements: Must store value function over entire state space
  4. Offline computation: Value function must be computed before implementation

Extensions

1. Stochastic Dynamic Programming

For systems with uncertainty:

xk+1=f(xk,uk,wk)x_{k+1} = f(x_k, u_k, w_k)

The Bellman equation becomes:

Jk(xk)=minuk[Ewk[Jk+1(f(xk,uk,wk))]+g(xk,uk)]J_k^*(x_k) = \min_{u_k} \left[ \mathbb{E}_{w_k}[J_{k+1}^*(f(x_k, u_k, w_k))] + g(x_k, u_k) \right]

2. Infinite Horizon

For NN \to \infty, the value function becomes stationary:

J(x)=minu[J(f(x,u))+g(x,u)]J^*(x) = \min_u \left[ J^*(f(x, u)) + g(x, u) \right]

3. Approximate Dynamic Programming

Use function approximation to handle high-dimensional problems:

Jk(xk)J^k(xk,θk)J_k^*(x_k) \approx \hat{J}_k(x_k, \theta_k)

where θk\theta_k are parameters to be learned.

Connection to Other Methods

1. Model Predictive Control (MPC)

MPC applies dynamic programming over a receding horizon.

2. Reinforcement Learning

RL algorithms like Value Iteration implement dynamic programming with unknown models.

3. Pontryagin's Maximum Principle

For continuous-time problems, PMP provides necessary conditions that are related to the Bellman equation.

References

  1. Optimal Control by DR_CAN
  2. Wang, T. (2023). 控制之美 (卷2). Tsinghua University Press.
  3. Bellman, R. (1957). Dynamic Programming. Princeton University Press.
  4. Bertsekas, D. P. (2017). Dynamic Programming and Optimal Control (4th ed.). Athena Scientific.
  5. Rakovic, S. V., & Levine, W. S. (2018). Handbook of Model Predictive Control.