Johannes Karl Arnold

Markov Decision Problems


Markov Decision Processes consist of:

MDP quantities are

Value Iteration

Optimal Quantities

Policy Extraction

Compute the policy given the optimal (\(q\)-)values: \begin{align} \pi^* (s) &= \argmax_a \sum_{s^\prime} T(s, a, s^\prime) \left[ R(s, a, s^\prime) + \gamma V^* (s^\prime) \right]\\ &= \argmax_a Q^* (s,a) \end{align}

Example: statistical policy of when to continue hitting or stand in a game of blackjack


  1. “Value” and “utility” are typically regarded as synonyms. ↩︎