|
3 | 3 | "nbformat_minor": 0, |
4 | 4 | "metadata": { |
5 | 5 | "colab": { |
6 | | - "name": "RL_Solution.ipynb", |
| 6 | + "name": "RL.ipynb", |
7 | 7 | "provenance": [], |
8 | 8 | "collapsed_sections": [ |
9 | 9 | "jrI6q7RmWQam" |
|
28 | 28 | " Visit MIT Deep Learning</a></td>\n", |
29 | 29 | " <td align=\"center\"><a target=\"_blank\" href=\"https://colab.research.google.com/github/aamini/introtodeeplearning/blob/master/lab3/RL.ipynb\">\n", |
30 | 30 | " <img src=\"https://i.ibb.co/2P3SLwK/colab.png\" style=\"padding-bottom:5px;\" />Run in Google Colab</a></td>\n", |
31 | | - " <td align=\"center\"><a target=\"_blank\" href=\"https://github.com/aamini/introtodeeplearning/blob/master/lab3/solutions/RL_Solution.ipynb\">\n", |
| 31 | + " <td align=\"center\"><a target=\"_blank\" href=\"https://github.com/aamini/introtodeeplearning/blob/master/lab3/RL.ipynb\">\n", |
32 | 32 | " <img src=\"https://i.ibb.co/xfJbPmL/github.png\" height=\"70px\" style=\"padding-bottom:5px;\" />View Source on GitHub</a></td>\n", |
33 | 33 | "</table>\n", |
34 | 34 | "\n", |
|
232 | 232 | " # First Dense layer\n", |
233 | 233 | " tf.keras.layers.Dense(units=32, activation='relu'),\n", |
234 | 234 | "\n", |
235 | | - " # TODO: Define the last Dense layer, which will provide the network's output.\n", |
236 | | - " # Think about the space the agent needs to act in!\n", |
237 | | - " tf.keras.layers.Dense(units=n_actions, activation=None) # TODO\n", |
238 | | - " # [TODO Dense layer to output action probabilities]\n", |
| 235 | + " '''TODO: Define the last Dense layer, which will provide the network's output.\n", |
| 236 | + " # Think about the space the agent needs to act in!'''\n", |
| 237 | + " # [TODO: Dense layer]\n", |
239 | 238 | " ])\n", |
240 | 239 | " return model\n", |
241 | 240 | "\n", |
|
276 | 275 | " # add batch dimension to the observation if only a single example was provided\n", |
277 | 276 | " observation = np.expand_dims(observation, axis=0) if single else observation\n", |
278 | 277 | "\n", |
279 | | - " '''TODO: feed the observations through the model to predict the log probabilities of each possible action.'''\n", |
280 | | - " logits = model.predict(observation) # TODO\n", |
281 | | - " # logits = model.predict('''TODO''')\n", |
| 278 | + " '''TODO: feed the observations through the model to predict the log \n", |
| 279 | + " probabilities of each possible action.'''\n", |
| 280 | + " logits = model.predict('''TODO''')\n", |
282 | 281 | " \n", |
283 | 282 | " '''TODO: Choose an action from the categorical distribution defined by the log \n", |
284 | 283 | " probabilities of each possible action.'''\n", |
285 | | - " action = tf.random.categorical(logits, num_samples=1)\n", |
286 | | - " # action = ['''TODO''']\n", |
| 284 | + " action = ['''TODO''']\n", |
287 | 285 | "\n", |
288 | 286 | " action = action.numpy().flatten()\n", |
289 | 287 | "\n", |
|
331 | 329 | " # Add observations, actions, rewards to memory\n", |
332 | 330 | " def add_to_memory(self, new_observation, new_action, new_reward): \n", |
333 | 331 | " self.observations.append(new_observation)\n", |
| 332 | + "\n", |
334 | 333 | " '''TODO: update the list of actions with new action'''\n", |
335 | | - " self.actions.append(new_action) # TODO\n", |
336 | | - " # ['''TODO''']\n", |
| 334 | + " # TODO: your update code here\n", |
| 335 | + "\n", |
337 | 336 | " '''TODO: update the list of rewards with new reward'''\n", |
338 | | - " self.rewards.append(new_reward) # TODO\n", |
339 | | - " # ['''TODO''']\n", |
| 337 | + " # TODO: your update code here\n", |
340 | 338 | "\n", |
341 | 339 | "# Helper function to combine a list of Memory objects into a single Memory.\n", |
342 | 340 | "# This will be very useful for batching.\n", |
|
363 | 361 | "source": [ |
364 | 362 | "## 3.3 Reward function\n", |
365 | 363 | "\n", |
366 | | - "We're almost ready to begin the learning algorithm for our agent! The next step is to compute the rewards of our agent as it acts in the environment. Since we (and the agent) is uncertain about if and when the game or task will end (i.e., when the pole will fall), it is useful to emphasize getting rewards **now** rather than later in the future -- this is the idea of discounting. This is a similar concept to discounting money in the case of interest. Recall from lecture, we use reward discount to give more preference at getting rewards now rather than later in the future. The idea of discounting rewards is similar to discounting money in the case of interest.\n", |
| 364 | + "We're almost ready to begin the learning algorithm for our agent! The next step is to compute the rewards of our agent as it acts in the environment. Since we (and the agent) is uncertain about if and when the game or task will end (i.e., when the pole will fall), it is useful to emphasize getting rewards **now** rather than later in the future -- this is the idea of discounting. This is a similar concept to discounting money in the case of interest. ecall from lecture, we use reward discount to give more preference at getting rewards now rather than later in the future. The idea of discounting rewards is similar to discounting money in the case of interest.\n", |
367 | 365 | "\n", |
368 | 366 | "To compute the expected cumulative reward, known as the **return**, at a given timestep in a learning episode, we sum the discounted rewards expected at that time step $t$, within a learning episode, and projecting into the future. We define the return (cumulative reward) at a time step $t$, $R_{t}$ as:\n", |
369 | 367 | "\n", |
|
439 | 437 | "def compute_loss(logits, actions, rewards): \n", |
440 | 438 | " '''TODO: complete the function call to compute the negative log probabilities'''\n", |
441 | 439 | " neg_logprob = tf.nn.sparse_softmax_cross_entropy_with_logits(\n", |
442 | | - " logits=logits, labels=actions) # TODO\n", |
443 | | - " # neg_logprob = tf.nn.sparse_softmax_cross_entropy_with_logits(\n", |
444 | | - " # logits='''TODO''', labels='''TODO''')\n", |
| 440 | + " logits='''TODO''', labels='''TODO''')\n", |
445 | 441 | " \n", |
446 | 442 | " '''TODO: scale the negative log probability by the rewards'''\n", |
447 | | - " loss = tf.reduce_mean( neg_logprob * rewards ) # TODO\n", |
448 | | - " # loss = tf.reduce_mean('''TODO''')\n", |
| 443 | + " loss = tf.reduce_mean('''TODO''')\n", |
| 444 | + " \n", |
449 | 445 | " return loss" |
450 | 446 | ], |
451 | 447 | "execution_count": null, |
|
474 | 470 | " logits = model(observations)\n", |
475 | 471 | "\n", |
476 | 472 | " '''TODO: call the compute_loss function to compute the loss'''\n", |
477 | | - " loss = compute_loss(logits, actions, discounted_rewards) # TODO\n", |
478 | | - " # loss = compute_loss('''TODO''', '''TODO''', '''TODO''')\n", |
| 473 | + " loss = compute_loss('''TODO''', '''TODO''', '''TODO''')\n", |
479 | 474 | "\n", |
480 | 475 | " '''TODO: run backpropagation to minimize the loss using the tape.gradient method'''\n", |
481 | | - " grads = tape.gradient(loss, model.trainable_variables) # TODO\n", |
482 | | - " # grads = tape.gradient('''TODO''', model.trainable_variables)\n", |
| 476 | + " grads = tape.gradient('''TODO''', model.trainable_variables)\n", |
483 | 477 | " optimizer.apply_gradients(zip(grads, model.trainable_variables))\n" |
484 | 478 | ], |
485 | 479 | "execution_count": null, |
|
696 | 690 | " # First, 32 5x5 filters and 2x2 stride\n", |
697 | 691 | " Conv2D(filters=32, kernel_size=5, strides=2),\n", |
698 | 692 | "\n", |
699 | | - " # TODO: define convolutional layers with 48 5x5 filters and 2x2 stride\n", |
700 | | - " Conv2D(filters=48, kernel_size=5, strides=2), # TODO\n", |
701 | | - " # Conv2D('''TODO'''),\n", |
| 693 | + " '''TODO: define convolutional layers with 48 5x5 filters and 2x2 stride'''\n", |
| 694 | + " # [your Conv layer here]\n", |
702 | 695 | "\n", |
703 | | - " # TODO: define two convolutional layers with 64 3x3 filters and 2x2 stride\n", |
704 | | - " Conv2D(filters=64, kernel_size=3, strides=2), # TODO\n", |
705 | | - " Conv2D(filters=64, kernel_size=3, strides=2),\n", |
706 | | - " # Conv2D('''TODO'''),\n", |
| 696 | + " '''TODO: define two convolutional layers with 64 3x3 filters and 2x2 stride'''\n", |
| 697 | + " # [your Conv layers here]\n", |
707 | 698 | "\n", |
708 | 699 | " Flatten(),\n", |
709 | 700 | " \n", |
710 | 701 | " # Fully connected layer and output\n", |
711 | 702 | " Dense(units=128, activation='relu'),\n", |
712 | | - " # TODO: define the output dimension of the last Dense layer. \n", |
713 | | - " # Pay attention to the space the agent needs to act in\n", |
714 | | - " Dense(units=n_actions, activation=None) # TODO\n", |
715 | | - " # Dense('''TODO''')\n", |
| 703 | + "\n", |
| 704 | + " '''TODO: define the output dimension of the last Dense layer. \n", |
| 705 | + " Pay attention to the space the agent needs to act in'''\n", |
| 706 | + " # [TODO: your Dense layer here]\n", |
716 | 707 | " \n", |
717 | 708 | " ])\n", |
718 | 709 | " return model\n", |
|
904 | 895 | "\n", |
905 | 896 | " '''TODO: determine the observation change.\n", |
906 | 897 | " Hint: this is the difference between the past two frames'''\n", |
907 | | - " frame_diff = mdl.lab3.pong_change(previous_frame, current_frame) # TODO\n", |
908 | | - " # frame_diff = # TODO\n", |
| 898 | + " frame_diff = # TODO\n", |
909 | 899 | "\n", |
910 | 900 | " '''TODO: choose an action for the pong model, using the frame difference, and evaluate'''\n", |
911 | | - " action = choose_action(model, frame_diff) # TODO \n", |
912 | | - " # action = # TODO\n", |
| 901 | + " action = # TODO\n", |
| 902 | + "\n", |
913 | 903 | " # Take the chosen action\n", |
914 | 904 | " next_observation, reward, done, info = env.step(action)\n", |
915 | 905 | "\n", |
916 | 906 | " '''TODO: save the observed frame difference, the action that was taken, and the resulting reward!'''\n", |
917 | | - " memory.add_to_memory(frame_diff, action, reward) # TODO\n", |
| 907 | + " memory.add_to_memory('''TODO''', '''TODO''', '''TODO''')\n", |
918 | 908 | "\n", |
919 | 909 | " previous_frame = current_frame\n", |
920 | 910 | " \n", |
|
1104 | 1094 | "* How does the complexity of Pong relative to CartPole alter the rate at which the agent learns and its performance? \n", |
1105 | 1095 | "* What are some things you could change about the agent or the learning process to potentially improve performance?\n", |
1106 | 1096 | "\n", |
1107 | | - "Try to optimize your **Pong** model and algorithm to achieve improved performance. **MIT students and affiliates will be eligible for prizes during the IAP offering.** To enter the competition, please [email us](mailto:introtodeeplearning-staff@mit.edu) with your name and the following:\n", |
| 1097 | + "Try to optimize your **Pong** model and algorithm to achieve improved performance. **MIT students and affiliates will be eligible for prizes during the IAP offering.** To enter the competition, MIT students and affiliates should upload the following to the course Canvas:\n", |
1108 | 1098 | "\n", |
1109 | 1099 | "* Jupyter notebook with the code you used to generate your results, **with the Pong training executed**;\n", |
1110 | 1100 | "* saved video of your Pong agent competing;\n", |
|
0 commit comments