Incorrect calculation of generalized advantage estimates in PPO

The following code in `PPOAgent.compute_advantages` ignores value predictions for final observations in the trajectory and instead passes one-before-last values to the `generalized_advantage_estimation` function twice:

```python
    # Arg value_preds was appended with final next_step value. Make tensors
    #   next_value_preds by stripping first and last elements respectively.
    value_preds = value_preds[:, :-1]
    if self._use_gae:
      advantages = value_ops.generalized_advantage_estimation(
          values=value_preds,
          final_value=value_preds[:, -1],
          rewards=rewards,
          discounts=discounts,
          td_lambda=self._lambda,
          time_major=False,
      )
```

Instead, `final_value` should be extracted before `value_preds` are stripped, e.g.:

```python
    final_value_preds = value_preds[:, -1]
    value_preds = value_preds[:, :-1]
    if self._use_gae:
      advantages = value_ops.generalized_advantage_estimation(
          values=value_preds,
          final_value=final_value_preds,
          rewards=rewards,
          discounts=discounts,
          td_lambda=self._lambda,
          time_major=False,
      )
```

Also, the comment about `next_value_preds` doesn't match the code so it could be improved.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect calculation of generalized advantage estimates in PPO #953

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Incorrect calculation of generalized advantage estimates in PPO #953

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions