Skip to content

Feature: DeltaSpin for LCAO and PW base and DFTU for PW, both collinear and noncollinear spin (code change and UTs)#7384

Open
dyzheng wants to merge 34 commits into
deepmodeling:developfrom
dyzheng:feat/port-dftu-ds-code-v4
Open

Feature: DeltaSpin for LCAO and PW base and DFTU for PW, both collinear and noncollinear spin (code change and UTs)#7384
dyzheng wants to merge 34 commits into
deepmodeling:developfrom
dyzheng:feat/port-dftu-ds-code-v4

Conversation

@dyzheng
Copy link
Copy Markdown
Collaborator

@dyzheng dyzheng commented May 26, 2026

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

  • A unit test is added for each new feature or bug fix.

What's changed?

  • Example: My changes might affect the performance of the application under certain conditions, and I have tested the impact on various scenarios...

Any changes of core modules? (ignore if not applicable)

  • Example: I have added a new virtual function in the esolver base class in order to ...

dyzheng and others added 7 commits May 26, 2026 16:02
…ing develop)

Direct sync of source/ directory from feat/dftu-pw-port-v2
which has been merged with origin/develop.
101 files changed, +10659/-1262 lines.
Reverted infrastructure files to origin/develop state to clean up the PR.
Files reverted:
- source/source_cell/read_atoms_helper.cpp
- source/source_base/parallel_global.cpp
- source/source_base/timer.cpp
- source/source_base/kernels/cuda/math_kernel_op.cu
- source/source_base/module_device/device_check.h
- source/source_base/module_container/base/macros/cuda.h
- source/source_base/module_container/base/third_party/cusolver.h
- source/source_esolver/lcao_others.cpp
- source/source_hsolver/kernels/cuda/diag_cusolvermp.cu
- source/source_io/module_unk/berryphase.cpp
- source/source_lcao/module_rt/solve_propagation.cpp
- source/source_main/main.cpp
- source/source_lcao/module_ri/RPA_LRI.hpp

Kept:
- DFT+U/DeltaSpin core functionality (esolver, kernels with npol, dftu modules)
- Input parameters
- Build/Test system updates
Restored the following files from port-v2 as they contain essential
functional changes and bug fixes:

- source/source_esolver/lcao_others.cpp: Passes sc_direction_only parameter to init_deltaspin_lcao.
- source/source_base/parallel_global.cpp: Fixes MPI_Comm_free logic to avoid freeing MPI_COMM_WORLD.
- source/source_main/main.cpp: Reorders fftw_cleanup_threads before MPI_Finalize to avoid segfault.
- source/source_base/kernels/cuda/math_kernel_op.cu: Adds cublas_handle null check.

Other infrastructure files (timer.cpp, device_check.h, berryphase.cpp, etc.) remain reverted to develop.
Restoring all code-related changes as previous partial revert caused
compilation errors in CI.
Reverted 15 files with modifications that are completely unrelated to
the DFT+U and DeltaSpin functionality:

- Pure comment/formatting changes: timer.cpp, diago_david.cpp,
  esolver_sdft_pw.cpp, hsolver_lcao.cpp, berryphase.cpp, RPA_LRI.hpp,
  solve_propagation.cpp, diag_cusolvermp.cu, read_sep_test.cpp,
  cusolver.h (3 comment-only lines restored, CUDA version guards kept)
- Unrelated code refactor: elecstate_pw.cpp (vkb.nc -> vkbnc rename)
- Unused variable: forces.cpp (forcepaw declaration)
- Unrelated test: nscf_utils_test.cpp (deleted) and its CMakeLists entry
- operator_lcao.cpp: restored original comments and removed extra blank lines
@mohanchen mohanchen requested a review from dzzz2001 May 28, 2026 08:41
Copy link
Copy Markdown
Collaborator

@dzzz2001 dzzz2001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detailed inline review covering blockers and should-fix items, organized by sub-area (DeltaSpin core, DFTU core, PW kernels + onsite, infra/wiring, tests). Each inline comment is anchored to the most representative diff line. Happy to discuss any of these.

Comment thread source/source_pw/module_pwdft/kernels/cuda/onsite_op.cu
Comment thread source/source_pw/module_pwdft/kernels/cuda/force_op.cu Outdated
Comment thread source/source_pw/module_pwdft/kernels/cuda/stress_op.cu
Comment thread source/source_base/module_container/base/third_party/cusolver.h Outdated
Comment thread source/source_pw/module_pwdft/kernels/force_op.cpp
Comment thread source/source_base/module_container/base/third_party/cusolver.h Outdated
Comment thread source/source_pw/module_pwdft/onsite_proj.cpp Outdated
Comment thread source/source_pw/module_pwdft/kernels/force_op.h Outdated
Comment thread source/source_pw/module_pwdft/kernels/stress_op.h Outdated
Comment thread source/source_estate/elecstate_lcao.h Outdated
@mohanchen mohanchen added Features Needed The features are indeed needed, and developers should have sophisticated knowledge Refactor Refactor ABACUS codes DFT+U Issues related to DFT plus U function collinear/non-collinear/SOC Issues related to SOC labels Jun 1, 2026
dyzheng added 16 commits June 3, 2026 20:14
The second __global__ void onsite_op(...) variant (DFT+U kernel taking
vu_iat) was missing the npol==1 branch, while the CPU version had it.
For npol==1 this caused OOB reads on becp, writes to ps[psind+1], and
indexing off-diagonal vu entries that don't exist.

Also hoisted iat/vu_iat/orb_l/tlp1 lookups outside the loop (like the
first DeltaSpin variant) for consistency.

Addresses reviewer comment:
deepmodeling#7384 (comment)
Mirror of the CUDA fix for the ROCm/HIP backend. The second
__global__ void onsite_op(...) variant (DFT+U kernel taking vu_iat)
was missing the npol==1 branch.

For npol==1 this caused OOB reads on becp, writes to ps[psind+1],
and indexing off-diagonal vu entries that don't exist.

Addresses reviewer comment:
deepmodeling#7384 (comment)
The npol==1 branch in cal_force_nl_op (DeltaSpin force) and
cal_stress_nl_op (DeltaSpin stress) was missing the spin_sign
that op_pw_proj.cpp applies when building tmp_lambda_coeff.

For nspin==2, the force/stress from spin-down k-points (isk==1)
was added with the wrong sign, so force != -dE/dR and stress
tensor inconsistent with dE/dε.

Addresses reviewer comments:
deepmodeling#7384 (comment)
deepmodeling#7384 (comment)
In direction_only mode, the code was permanently mutating
this->target_mag_[ia] += parallel * dir on every inner step
without ever restoring it. After the first inner step,
target_mag_ no longer reflected user input, and the corruption
persisted across SCF iterations.

Fixed by using a local copy (target_mag_adj) for the adjusted
target direction, leaving the original target_mag_ unchanged.

Addresses reviewer comment:
deepmodeling#7384 (comment)
The early continue for atoms without correlated orbitals ran before
locale[iat].resize(), iatlnmipol2iwt[iat].resize(), and
eff_pot_pw_index[iat] assignment, leaving these per-atom structures
default-constructed (size 0). While current readers are guarded by
orbital_corr[it] == -1, any future code accessing these for an
arbitrary atom would segfault.

Fixed by moving the per-atom resize/assignment before the guard,
so only the Hubbard-specific work (pot_index accumulation and
locale.create() block) is skipped for non-correlated atoms.

Addresses reviewer comment:
deepmodeling#7384 (comment)
The [DS-DIAG] cout fired on every STRU read, on every MPI rank,
cluttering every production log.

Addresses reviewer comment:
deepmodeling#7384 (comment)
Two problems fixed:
1. Timer was empty (start/end called twice with no work between).
   Moved end() to after the actual work.
2. No MPI synchronization of mixed uom_in. After a few Pulay
   steps, occupation matrices on different ranks would silently
   diverge. Added Parallel_Common::bcast_double after mixing
   (same pattern as mix_dmr).

Addresses reviewer comment:
deepmodeling#7384 (comment)
These files were marked "INCOMPLETE" and "will not compile as-is".
They reference strategy_type_, strategy_, LambdaStrategyType,
set_strategy_type, and set_strategy_params — none declared in
spin_constrain.h. lambda_update_strategies.{h,cpp} (~700 lines)
were labelled "NOT compiled into the library."

If compiled, the build breaks. If not, they ship as dead source
that confuses every future reader.

Dropped from this PR; can be re-introduced in a follow-up when
finished, registered with CMake, and unit-tested.

Addresses reviewer comment:
deepmodeling#7384 (comment)
The CUDA < 11.0 fallback called cusolverDn{S,D,C,Z}trtri_bufferSize
and cusolverDn{S,D,C,Z}trtri — these symbols do not exist in any
cuSOLVER release. Only cusolverDnXtrtri (CUDA 11.0+, already used
by the >= 11000 branch) and cublas{S,D,C,Z}trtri_batched exist.

Dropped the fallback entirely and require CUDA >= 11.0, consistent
with module_container/base/macros/cuda.h:25-31 which already guards
GetTypeCuda<int64_t> on CUDA_VERSION >= 11000.

Addresses reviewer comment:
deepmodeling#7384 (comment)
D_I was filled row-major as D_I[lm * nbands_global + jb_global]
but passed to column-major BLAS zgemm with lda=r. The column-major
view of D_I with leading dimension r corresponds to D_col[lm + jb*r],
which is the transpose of the data actually stored. Result: computed
D · D^H instead of D^H · D.

Fixed to fill D_I column-major: D_I[lm + jb_global * r].

Addresses reviewer comment:
deepmodeling#7384 (comment)
DeltaSpin<...>::set_current_spin(int) was shadowing (not overriding)
OperatorLCAO::set_current_spin, which was not declared virtual. The
dispatch chain in OperatorLCAO::set_current_spin walks next_op via
dynamic_cast<OperatorLCAO<TK,TR>*> and calls the base method, so
the derived reset this->sc_hr_done = false never fired when DeltaSpin
was wired into the operator stack via next_op.

Made the base method virtual so the override in DeltaSpin is properly
dispatched, ensuring sc_hr_done is reset on spin switch.

Addresses reviewer comment:
deepmodeling#7384 (comment)
…o.cpp

The old code used get_locale(iat0, target_L, 0, 0, m0_all, m1_all)
which treated locale[iat][L][0][0] as a (2*tlp1)x(2*tlp1) matrix
indexed as a 2x2 block of Pauli channels. But every writer of
nspin=4 locale (cal_occ_pw in dftu_pw.cpp and the LCAO writers)
stores the four Pauli channels as four stacked tlp1^2 blocks
at offsets 0, tlp1^2, 2*tlp1^2, 3*tlp1^2 inside c[].

The old code read from the wrong stacked block, causing silent
corruption on restarts with init_chg=file + nspin=4.

Fixed to use get_locale_flat which reads the stacked blocks
directly in the same format they were written.

Addresses reviewer comment:
deepmodeling#7384 (comment)
Functionally harmless (idempotent) but a clear copy-paste accident
that shouldn't pass review.

Addresses reviewer comment:
deepmodeling#7384 (comment)
…files

Pulling parameter.h into kernel-op headers means every TU that
includes these headers transitively includes the whole parameter
parser — meaningful compile-time bloat. The parameter values
needed in the kernel implementation can be passed as function
arguments (as npol already is).

Addresses reviewer comments:
deepmodeling#7384 (comment)
deepmodeling#7384 (comment)
These includes are only used in the .cpp implementation file.
Moving them out of the header avoids compile-time bloat.

Addresses reviewer comment:
deepmodeling#7384 (comment)
… test CMakeLists

The lambda_update_strategies.{h,cpp} and lambda_update_strategies_test.cpp
files were removed (incomplete implementation). Remove the corresponding
AddTest entry and the TODO comment about lambda_loop_helper_test
(which also depended on the deleted files).

Addresses reviewer comment:
deepmodeling#7384 (comment)
dyzheng added 7 commits June 3, 2026 20:32
…dy, invalidate_becp)

The becp cache was set/invalidated but never read anywhere.
force_onsite / stress_onsite still recompute becp unconditionally.
Remove the dead state entirely to avoid confusion.

If caching is desired in the future, the consumer needs to be wired
to check is_becp_ready() and skip recomputation.

Addresses reviewer comment:
deepmodeling#7384 (comment)
cal_VU_pot_pw was a (void)spin; stub that did nothing. The public
declaration in dftu.h advertised 'calculate the local DFT+U
effective potential matrix for PW base' but the function was never
called anywhere. VU potential for PW is computed via cal_eff_pot_mat
in the onsite projector path.

Remove the dead declaration and stub to avoid confusion.

Addresses reviewer comment:
deepmodeling#7384 (comment)
The GPU kernels for DFT+U and DeltaSpin force/stress computation
hardcoded npol=2 (noncollinear) indexing: ib2 = ib*2, 4-component
Pauli matrix coefficients, and spinor becp/dbecp access patterns.
This caused incorrect results or crashes for collinear (npol=1) runs.

Add npol as a template parameter to cal_force_onsite and
cal_stress_onsite kernels, with runtime dispatch via if (npol == 1)
vs if (npol == 2) branching inside the kernel body.

For npol==1 DeltaSpin, compute spin_sign from isk[ik] on the host
side and pass it as a kernel argument, avoiding the need for a device
isk array.

Applies to both CUDA (.cu) and ROCm (.hip.cu) implementations.

Addresses reviewer comments (Blockers 2 and 3):
deepmodeling#7384
The nspin validation check was commented out, allowing DeltaSpin
to be used with nspin=1 which is physically meaningless. Restore
the check that requires nspin==2 or nspin==4 when sc_mag_switch
is true.

Addresses reviewer comment:
deepmodeling#7384
For nspin==2 (collinear), the magnetic moment constraint direction is
z, but the autoset code was setting m_loc_[ia].x instead of
m_loc_[ia].z. This caused target_mag_[iat].z to be zero in
set_target_mag(), making DeltaSpin constraints ineffective.

Addresses reviewer comment:
deepmodeling#7384
For nspin==1, the PW occupation matrix already includes spin degeneracy
(factor of 2) from the k-point weights. The Hubbard energy formula
E_U = U * weight_eu * occ^2 yields U * 4 * occ_half^2 with weight_eu=1.0,
but the correct result should be U * n^2 / 2 (matching LCAO).
Use weight_eu=0.25 for nspin==1 to get U * occ_full^2 * 0.25
= U * (2*occ_half)^2 * 0.25 = U * occ_half^2, matching the LCAO
half-occupation convention before set_double_energy().

Addresses reviewer comment:
deepmodeling#7384
…rrors

The DeltaSpin npol==1 spin_sign calculation referenced this->isk/this->ik
on structs that don't have these members. Fix by:
- Adding const int* isk parameter to DeltaSpin force/stress kernel signatures
  in force_op.h, stress_op.h, and all CPU/GPU implementations
- Passing kv_->isk.data() from Onsite_Proj_tools cal_force_dspin and
  cal_stress_dspin callers
- Adding missing parallel_common.h include to charge_mixing.cpp
- Fixing get_locale_flat call in dftu_lcao.cpp (was called per-element
  inside loop but function fills entire array at once)

Build verified passing for CPU-only build.

Addresses reviewer comments:
deepmodeling#7384
@dyzheng dyzheng force-pushed the feat/port-dftu-ds-code-v4 branch from ab3736b to 5539972 Compare June 3, 2026 14:07
The old non-templated DeltaSpin kernel body was not fully replaced
during the npol template refactoring, leaving ~27 lines of orphaned
code outside any function, causing 57 CUDA compilation errors.

Build verified passing with CUDA enabled.
@dyzheng dyzheng requested a review from dzzz2001 June 3, 2026 14:44
dyzheng added 2 commits June 3, 2026 23:13
…errors

The virtual set_current_spin method had its definition only in
operator_lcao.cpp which was compiled into hamilt_lcao but not linked
by test executables, causing undefined reference errors.

Provide a simple inline implementation in the header that sets
current_spin. Remove the duplicate out-of-line definition from
operator_lcao.cpp. The next_op chain propagation was redundant
since init() already handles next_op traversal.

Also fixed orphaned DeltaSpin kernel code in CUDA stress_op.cu.

Build verified passing with CUDA and BUILD_TESTING=ON.
The autoset_magnetization function was changed to set m_loc_[ia].z
instead of m_loc_[ia].x for nspin==2, because set_target_mag reads
target_mag_[iat].z for collinear spin constraints. Update the test
to match the corrected behavior.

Addresses CI test failure in MODULE_CELL_unitcell_test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

collinear/non-collinear/SOC Issues related to SOC DFT+U Issues related to DFT plus U function Features Needed The features are indeed needed, and developers should have sophisticated knowledge Refactor Refactor ABACUS codes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants