Skip to content

Bug/pd np version#24

Open
jamesETsmith wants to merge 2 commits intodevelopfrom
bug/pd_np_version
Open

Bug/pd np version#24
jamesETsmith wants to merge 2 commits intodevelopfrom
bug/pd_np_version

Conversation

@jamesETsmith
Copy link
Copy Markdown
Collaborator

Description

I noticed in some local testing that the following tests were failing:

  • test_OnDiskDataset_heterogeneous[csv-True-True]
  • test_OnDiskDataset_heterogeneous[csv-True-False]
  • test_OnDiskDataset_heterogeneous[csv-False-True]
  • test_OnDiskDataset_heterogeneous[csv-False-False]
  • test_OnDiskDataset_preprocess_graph_with_single_type
  • test_record_stream_graph_negative

Details

The OnDiskDataset issues are caused by behavior in the pandas > 3 releases (which explains why they didn't show up in the last release of DGL. You can't modify values for certain things in place, so we just have to tweak two lines:

# Before (in-place, fails on read-only arrays):
src += node_type_offset[node_type_to_id[src_type]]
dst += node_type_offset[node_type_to_id[dst_type]]

# After (out-of-place, works regardless of read-only flag):
src = src + node_type_offset[node_type_to_id[src_type]]
dst = dst + node_type_offset[node_type_to_id[dst_type]]

The test_record_stream_graph_negative test, I'm less sure about and it seems like it's relying on the underlying allocator to reuse blocks of memory, which doesn't seem to happen for us in my local testing. I'm testing with the same dependencies we supported in the last release so I'm not sure that anything changed on the rocm side. It seems like a potentially flaky test, and I'm not sure it's worth our time to dig into. @GMNGeoffrey what do you think?

Test results

Quick summary of the test results:

571:[DGL-ROCm-CI/Build+Test]   | ============================= test session starts ==============================
580:[DGL-ROCm-CI/Build+Test]   | ============================ slowest 100 durations =============================
584:[DGL-ROCm-CI/Build+Test]   | ========================= 1 passed, 1 warning in 3.09s =========================
585:[DGL-ROCm-CI/Build+Test]   | ============================= test session starts ==============================
677:[DGL-ROCm-CI/Build+Test]   | ============================ slowest 100 durations =============================
778:[DGL-ROCm-CI/Build+Test]   | ================ 2729 passed, 205 skipped, 7 warnings in 37.17s ================
779:[DGL-ROCm-CI/Build+Test]   | ============================= test session starts ==============================
960:[DGL-ROCm-CI/Build+Test]   | ============================ slowest 100 durations =============================
1061:[DGL-ROCm-CI/Build+Test]   | ========= 7452 passed, 653 skipped, 5477 warnings in 817.97s (0:13:37) =========

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

…llocator behavior of the remote_stream CUDA impl which may differ from the ROCm impl.
@jamesETsmith jamesETsmith self-assigned this Mar 25, 2026
@jamesETsmith jamesETsmith added the bug Something isn't working label Mar 25, 2026
@jamesETsmith
Copy link
Copy Markdown
Collaborator Author

Just a nudge on this @GMNGeoffrey and @gcapodagAMD

Copy link
Copy Markdown
Collaborator

@GMNGeoffrey GMNGeoffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed this before

# Install basic tools
RUN apt-get update \
&& apt install -y clang-format-15 ninja-build tree
&& apt install -y clang-format-15 git ninja-build tree
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strangely enough, git was missing and installing the graphbolt deps would crash because hipcollections would pull in it's own deps like libhipcxx with git and crash. I'm not sure why this wasn't popping up before. Maybe I didn't catch it with #16?

Comment on lines 196 to 198
# g3 will reuse g2's memory block, resulting a wrong result
g3 = rand_graph(10, 20, device=F.ctx())
g3.create_formats_()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this looks like it's verifying that a bug exists or something... I agree it's probably not worth trying to replicate it, although it's not clear why it wasn't failing before...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed about it not being clear. Something smells fishy, but I think it's small enough that we can skip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants