Skip to content

Reduce backward memory to ~half#218

Open
jasam-sheja wants to merge 5 commits intomapillary:mainfrom
jasam-sheja:inplace_opt
Open

Reduce backward memory to ~half#218
jasam-sheja wants to merge 5 commits intomapillary:mainfrom
jasam-sheja:inplace_opt

Conversation

@jasam-sheja
Copy link
Copy Markdown

@jasam-sheja jasam-sheja commented Dec 10, 2021

Reuse the grad and input tensors in the backward pass instead of creating new ones.
Mainly reuse y_act for xhat and dy_act for dy.
Ensure every function support in-place operation. (Elu is modified accordingly)
Ensure the tensors allow in-place operation (dy_act has to be contiguous)

Needs more testing. However, there are no unit tests.

- reuse y_act_ and dy_act_
- use inplace calculations in `forward_cpu` and `backward_cpu`
- make sure dy_act doesn't have memory overlaping
- reflect the inplace operations in the doc and comments
Copilot AI review requested due to automatic review settings March 18, 2026 08:23
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces memory usage during the backward pass of InPlaceABN by reusing existing activation/gradient buffers (overwriting y_act with xhat and dy_act with dy) and adjusting code paths to support in-place behavior (including an ELU backward tweak and ensuring dy_act is contiguous).

Changes:

  • Reuse y_act/dy_act as xhat/dy in backward-reduce (CPU/CUDA) to cut temporary allocations.
  • Switch several CPU forward/backward intermediate computations to in-place ops to reduce transient allocations.
  • Update Python/C++ binding notes and make dy_act contiguous to support in-place writes.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/inplace_abn_cuda.cu Reuses y_act/dy_act buffers for backward outputs in CUDA implementation.
src/inplace_abn_cpu.cpp Reuses y_act_/dy_act_ buffers and increases in-place usage for CPU forward/backward.
src/inplace_abn.cpp Updates pybind docstring to mention in-place behavior for backward_reduce.
inplace_abn/functions.py Makes dy_act contiguous and notes that backward-reduce overwrites tensors in-place.
include/inplace_abn.h Adjusts ELU backward ordering to support in-place overwrite safely.
Comments suppressed due to low confidence (1)

src/inplace_abn_cuda.cu:156

  • y_act/dy_act are being reused as xhat/dy, but the CUDA kernel is launched with at::RestrictPtrTraits for both input and output accessors. When y_act_accessor aliases xhat_accessor (and dy_act_accessor aliases dy_accessor), this violates the __restrict__ aliasing assumption and can lead to miscompilation/incorrect results. To support in-place reuse safely, use non-restrict pointer traits (e.g., at::DefaultPtrTraits) for these accessors / kernel params, or keep separate output tensors (or provide a separate non-restrict kernel for the in-place path).
  auto &xhat = y_act; // reuse
  auto &dy = dy_act; // reuse
  auto sum_dy = at::empty({chn}, acc_options);
  auto sum_xhat_dy = at::empty({chn}, acc_options);

  // Make accessors
  auto y_act_accessor = y_act.packed_accessor<scalar_t, 3, at::RestrictPtrTraits, index_t>();
  auto dy_act_accessor = dy_act.packed_accessor<scalar_t, 3, at::RestrictPtrTraits, index_t>();
  auto xhat_accessor = xhat.packed_accessor<scalar_t, 3, at::RestrictPtrTraits, index_t>();
  auto dy_accessor = dy.packed_accessor<scalar_t, 3, at::RestrictPtrTraits, index_t>();
  auto weight_accessor = packed_accessor_or_dummy<prmscalar_t, 1, at::RestrictPtrTraits, index_t>(weight);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/inplace_abn_cpu.cpp
Comment on lines 25 to 35
std::tuple<at::Tensor, at::Tensor, at::Tensor, at::Tensor> backward_reduce_impl(
const at::Tensor& y_act_,
const at::Tensor& dy_act_,
const std::optional<at::Tensor>& weight_,
const std::optional<at::Tensor>& bias_,
float eps,
float activation_param) {
// Initialize output tensors
auto xhat_ = at::empty_like(y_act_);
auto dy_ = at::empty_like(y_act_);
auto &xhat_ = y_act_; // reuse
auto &dy_ = dy_act_; // reuse
auto sum_dy_ = at::zeros({y_act_.size(1)}, y_act_.options());
Comment thread inplace_abn/functions.py

# Call backward_reduce if we need to compute at least one of the gradients
if any(ctx.needs_input_grad):
# remove memory overlaping to allow for in-place operation
Comment thread src/inplace_abn.cpp

// Backward methods
m.def("backward_reduce", &backward_reduce, "First step of the backward pass");
m.def("backward_reduce", &backward_reduce, "First step of the backward pass. This is an in-place operation w.r.t. y_act, dy_act,");
Comment thread inplace_abn/functions.py
Comment on lines 117 to 125
y_act, var, count, weight, bias = ctx.saved_tensors

# Call backward_reduce if we need to compute at least one of the gradients
if any(ctx.needs_input_grad):
# remove memory overlaping to allow for in-place operation
dy_act = dy_act.contiguous()
# This overwrites y_act with xhat and dy_act with dy
xhat, dy, sum_dy_local, sum_xhat_dy_local = _backend.backward_reduce(
y_act,
dy_act,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants