Week 11: Exercises

1. Simple Environment

Consider the following simple environment:

An agent starts in state S. At every time step they have two choices: to move up or down. If they enter state ST1 or state, the episode ends. They agent receives a reward of 1 if it enters state ST2 ; otherwise there are no rewards.

a) Consider three possible discount rates γ = 1, γ = 0, γ = 0.5. For each discount rate, what is the value of the optimal state-action function Q*(s, a) for each state and action? Which discount rate(s) will lead to a state-action function Q* that is useful in practice?

b) Choose the discount rate γ = 0.5, and suppose that we use tabular Q-learning to attempt to learn an optimal policy for this environment. We will initially set all Q-values to 0. We'll use a greedy policy for exploring the environment, breaking any ties by moving downward. We'll use a learning rate α = 1. Will the agent successfully learn? If so, after how many episodes will it learn an optimal policy?

c) Now consider the same policy but breaking any ties by moving upward. Will the agent successfully learn? If so, after how many episodes will it learn an optimal policy?

d) Consider the same policy, but breaking any ties by randomly upward or downward. What is the expected number of episodes after which the agent will learn an optimal policy?

2. Blackjack

In some casinos you can play the card game Blackjack. The goal of the game is to score as many points as possible without exceeding 21. Every face card (Jack, Queen, King) counts as 10. An Ace may count as 1 or 11.

The game is played against a dealer with a fixed strategy. At the beginning the dealer has two cards, one face up, one face down. The player also receives two cards initially. After that, the player may receive more cards (hit) until they either stick (decide to stop taking cards) or go bust (exceeding 21). If the player goes bust, he loses. Otherwise, the dealer takes more cards: according to his fixed strategy, he sticks if his sum is 17 or greater (counting an Ace as 11 unless it would bring the sum over 21), and otherwise takes another card. If the dealer busts, he loses. Otherwise the game is a win, lose, or draw according to the final sums of the dealer's and player's cards.

We will assume that cards are sampled from an infinite deck without replacement, so card counting is not possible.

a) Consider a game of Blackjack as an episode in an environment, with rewards of +1, 0, or -1 for a win, a draw, or a loss. Is the game a finite Markov decision process? Is it deterministic or stochastic?

b) Is the game fully observable? If not, can we transform it into an equivalent stochastic game that is fully observable?

c) Here is an implementation of Blackjack in about 60 lines of Python. Study the code to understand how it works. Which do you think will perform better - a strategy that always sticks, or a strategy that randomly sticks or hits with a 50% probability of each? Run the code to find out.

d) Suppose we wish to use tabular Q-learning to learn an optimal policy for the game. How many states exist? (Do not forget to include a final state in which the game has ended.) How many state-action pairs exist?

e) Write a function encode(cards: list[int], dealer: int) that takes a list of the cards the player is holding plus the dealer's visible card, and returns an encoded state that we can use for learning.

f) Write a function sars(states, actions: list[bool], outcome: int) that takes a list of states, a list of actions, and an outcome, all returned by the game() function. The function should return a list of experience tuples (s, a, r, s'), where s and s' are encoded states, a is an action, and r is a reward that was received.

g) Write a function train(episodes: int) that learns a policy using tabular Q-learning, training for the given number of episodes. After training, the function should call tournament() to play a tournament with 10,000 games to evaluate the policy that was learned. Try to learn a policy that will win against the dealer at least 45% of the time. You will need to make various decisions: how will you initialize the Q-values? What exploration policy, discount rate, and learning rate will you use? Will the learning rate change over time? Hint: Write a helper function q_strategy(q) that takes a table of Q-values and returns a strategy that uses those Q-values to select moves.

3. Deep Q-Networks in Classic Control Environments

a) Gymnasium includes a set of classic control environments for reinforcement learning experiments. Examine these environments. Which of these might DQN be able to solve? Which of them might be easiest or most difficult? Why?

b) The CleanRL library contains implementations of DQN and other reinforcement learning algorithms. Clone this library to your machine, and follow the installation instructions to install the necessary Python packages.

Inside CleanRL, dqn.py implements DQN in about 250 lines of code. I have simplified this code a bit to reduce it to about 160 lines in dqn_simple.py. Download this file.

Run dqn_simple.py to attempt to solve the classic control environments. Use Tensorboard to view statistics about the training runs, and watch videos of the results. Which environments is DQN able to solve?

c) How many hidden layers are there in the neural network in dqn_simple.py? What is the total number of parameters (weights) in the neural network?

d) What are the precise meanings of td_loss, q_values, and SPS in the Tensorboard statistics? Explain the observed shape of the curves of these statistics.

e) Modify the code to remove a layer from the neural network. How does this affect learning?

f) Modify the code so that a target network is not used. How does this affect learning?

4. Deep Q-Networks for Atari Games

a) Gymnasium includes many Atari games. Choose a few of these games and play them using the Stella Atari emulator. Of the games you chose, which do you think might be easiest or most difficult for DQN to learn to play? Why?

b) Choose one of these games and start a training run using CleanRL's dqn_atari.py on one of the machines in our Gpulab high-performance cluster. A full training run might take many hours. Is DQN able to make any progress at all in training during the first 5 or 10 minutes?

c) Study the differences between dqn.py and dqn_atari.py. How is dqn_atari.py specialized for Atari games, exactly?