Feature: DeltaSpin for LCAO and PW base and DFTU for PW, both collinear and noncollinear spin (code change and UTs)#7384
Open
dyzheng wants to merge 34 commits into
Open
Conversation
…ing develop) Direct sync of source/ directory from feat/dftu-pw-port-v2 which has been merged with origin/develop. 101 files changed, +10659/-1262 lines.
Reverted infrastructure files to origin/develop state to clean up the PR. Files reverted: - source/source_cell/read_atoms_helper.cpp - source/source_base/parallel_global.cpp - source/source_base/timer.cpp - source/source_base/kernels/cuda/math_kernel_op.cu - source/source_base/module_device/device_check.h - source/source_base/module_container/base/macros/cuda.h - source/source_base/module_container/base/third_party/cusolver.h - source/source_esolver/lcao_others.cpp - source/source_hsolver/kernels/cuda/diag_cusolvermp.cu - source/source_io/module_unk/berryphase.cpp - source/source_lcao/module_rt/solve_propagation.cpp - source/source_main/main.cpp - source/source_lcao/module_ri/RPA_LRI.hpp Kept: - DFT+U/DeltaSpin core functionality (esolver, kernels with npol, dftu modules) - Input parameters - Build/Test system updates
Restored the following files from port-v2 as they contain essential functional changes and bug fixes: - source/source_esolver/lcao_others.cpp: Passes sc_direction_only parameter to init_deltaspin_lcao. - source/source_base/parallel_global.cpp: Fixes MPI_Comm_free logic to avoid freeing MPI_COMM_WORLD. - source/source_main/main.cpp: Reorders fftw_cleanup_threads before MPI_Finalize to avoid segfault. - source/source_base/kernels/cuda/math_kernel_op.cu: Adds cublas_handle null check. Other infrastructure files (timer.cpp, device_check.h, berryphase.cpp, etc.) remain reverted to develop.
Restoring all code-related changes as previous partial revert caused compilation errors in CI.
Reverted 15 files with modifications that are completely unrelated to the DFT+U and DeltaSpin functionality: - Pure comment/formatting changes: timer.cpp, diago_david.cpp, esolver_sdft_pw.cpp, hsolver_lcao.cpp, berryphase.cpp, RPA_LRI.hpp, solve_propagation.cpp, diag_cusolvermp.cu, read_sep_test.cpp, cusolver.h (3 comment-only lines restored, CUDA version guards kept) - Unrelated code refactor: elecstate_pw.cpp (vkb.nc -> vkbnc rename) - Unused variable: forces.cpp (forcepaw declaration) - Unrelated test: nscf_utils_test.cpp (deleted) and its CMakeLists entry - operator_lcao.cpp: restored original comments and removed extra blank lines
dzzz2001
reviewed
May 28, 2026
Collaborator
dzzz2001
left a comment
There was a problem hiding this comment.
Detailed inline review covering blockers and should-fix items, organized by sub-area (DeltaSpin core, DFTU core, PW kernels + onsite, infra/wiring, tests). Each inline comment is anchored to the most representative diff line. Happy to discuss any of these.
… feat/port-dftu-ds-code-v4
The second __global__ void onsite_op(...) variant (DFT+U kernel taking vu_iat) was missing the npol==1 branch, while the CPU version had it. For npol==1 this caused OOB reads on becp, writes to ps[psind+1], and indexing off-diagonal vu entries that don't exist. Also hoisted iat/vu_iat/orb_l/tlp1 lookups outside the loop (like the first DeltaSpin variant) for consistency. Addresses reviewer comment: deepmodeling#7384 (comment)
Mirror of the CUDA fix for the ROCm/HIP backend. The second __global__ void onsite_op(...) variant (DFT+U kernel taking vu_iat) was missing the npol==1 branch. For npol==1 this caused OOB reads on becp, writes to ps[psind+1], and indexing off-diagonal vu entries that don't exist. Addresses reviewer comment: deepmodeling#7384 (comment)
The npol==1 branch in cal_force_nl_op (DeltaSpin force) and cal_stress_nl_op (DeltaSpin stress) was missing the spin_sign that op_pw_proj.cpp applies when building tmp_lambda_coeff. For nspin==2, the force/stress from spin-down k-points (isk==1) was added with the wrong sign, so force != -dE/dR and stress tensor inconsistent with dE/dε. Addresses reviewer comments: deepmodeling#7384 (comment) deepmodeling#7384 (comment)
In direction_only mode, the code was permanently mutating this->target_mag_[ia] += parallel * dir on every inner step without ever restoring it. After the first inner step, target_mag_ no longer reflected user input, and the corruption persisted across SCF iterations. Fixed by using a local copy (target_mag_adj) for the adjusted target direction, leaving the original target_mag_ unchanged. Addresses reviewer comment: deepmodeling#7384 (comment)
The early continue for atoms without correlated orbitals ran before locale[iat].resize(), iatlnmipol2iwt[iat].resize(), and eff_pot_pw_index[iat] assignment, leaving these per-atom structures default-constructed (size 0). While current readers are guarded by orbital_corr[it] == -1, any future code accessing these for an arbitrary atom would segfault. Fixed by moving the per-atom resize/assignment before the guard, so only the Hubbard-specific work (pot_index accumulation and locale.create() block) is skipped for non-correlated atoms. Addresses reviewer comment: deepmodeling#7384 (comment)
The [DS-DIAG] cout fired on every STRU read, on every MPI rank, cluttering every production log. Addresses reviewer comment: deepmodeling#7384 (comment)
Two problems fixed: 1. Timer was empty (start/end called twice with no work between). Moved end() to after the actual work. 2. No MPI synchronization of mixed uom_in. After a few Pulay steps, occupation matrices on different ranks would silently diverge. Added Parallel_Common::bcast_double after mixing (same pattern as mix_dmr). Addresses reviewer comment: deepmodeling#7384 (comment)
These files were marked "INCOMPLETE" and "will not compile as-is".
They reference strategy_type_, strategy_, LambdaStrategyType,
set_strategy_type, and set_strategy_params — none declared in
spin_constrain.h. lambda_update_strategies.{h,cpp} (~700 lines)
were labelled "NOT compiled into the library."
If compiled, the build breaks. If not, they ship as dead source
that confuses every future reader.
Dropped from this PR; can be re-introduced in a follow-up when
finished, registered with CMake, and unit-tested.
Addresses reviewer comment:
deepmodeling#7384 (comment)
The CUDA < 11.0 fallback called cusolverDn{S,D,C,Z}trtri_bufferSize
and cusolverDn{S,D,C,Z}trtri — these symbols do not exist in any
cuSOLVER release. Only cusolverDnXtrtri (CUDA 11.0+, already used
by the >= 11000 branch) and cublas{S,D,C,Z}trtri_batched exist.
Dropped the fallback entirely and require CUDA >= 11.0, consistent
with module_container/base/macros/cuda.h:25-31 which already guards
GetTypeCuda<int64_t> on CUDA_VERSION >= 11000.
Addresses reviewer comment:
deepmodeling#7384 (comment)
D_I was filled row-major as D_I[lm * nbands_global + jb_global] but passed to column-major BLAS zgemm with lda=r. The column-major view of D_I with leading dimension r corresponds to D_col[lm + jb*r], which is the transpose of the data actually stored. Result: computed D · D^H instead of D^H · D. Fixed to fill D_I column-major: D_I[lm + jb_global * r]. Addresses reviewer comment: deepmodeling#7384 (comment)
DeltaSpin<...>::set_current_spin(int) was shadowing (not overriding) OperatorLCAO::set_current_spin, which was not declared virtual. The dispatch chain in OperatorLCAO::set_current_spin walks next_op via dynamic_cast<OperatorLCAO<TK,TR>*> and calls the base method, so the derived reset this->sc_hr_done = false never fired when DeltaSpin was wired into the operator stack via next_op. Made the base method virtual so the override in DeltaSpin is properly dispatched, ensuring sc_hr_done is reset on spin switch. Addresses reviewer comment: deepmodeling#7384 (comment)
…o.cpp The old code used get_locale(iat0, target_L, 0, 0, m0_all, m1_all) which treated locale[iat][L][0][0] as a (2*tlp1)x(2*tlp1) matrix indexed as a 2x2 block of Pauli channels. But every writer of nspin=4 locale (cal_occ_pw in dftu_pw.cpp and the LCAO writers) stores the four Pauli channels as four stacked tlp1^2 blocks at offsets 0, tlp1^2, 2*tlp1^2, 3*tlp1^2 inside c[]. The old code read from the wrong stacked block, causing silent corruption on restarts with init_chg=file + nspin=4. Fixed to use get_locale_flat which reads the stacked blocks directly in the same format they were written. Addresses reviewer comment: deepmodeling#7384 (comment)
Functionally harmless (idempotent) but a clear copy-paste accident that shouldn't pass review. Addresses reviewer comment: deepmodeling#7384 (comment)
…files Pulling parameter.h into kernel-op headers means every TU that includes these headers transitively includes the whole parameter parser — meaningful compile-time bloat. The parameter values needed in the kernel implementation can be passed as function arguments (as npol already is). Addresses reviewer comments: deepmodeling#7384 (comment) deepmodeling#7384 (comment)
These includes are only used in the .cpp implementation file. Moving them out of the header avoids compile-time bloat. Addresses reviewer comment: deepmodeling#7384 (comment)
… test CMakeLists
The lambda_update_strategies.{h,cpp} and lambda_update_strategies_test.cpp
files were removed (incomplete implementation). Remove the corresponding
AddTest entry and the TODO comment about lambda_loop_helper_test
(which also depended on the deleted files).
Addresses reviewer comment:
deepmodeling#7384 (comment)
…dy, invalidate_becp) The becp cache was set/invalidated but never read anywhere. force_onsite / stress_onsite still recompute becp unconditionally. Remove the dead state entirely to avoid confusion. If caching is desired in the future, the consumer needs to be wired to check is_becp_ready() and skip recomputation. Addresses reviewer comment: deepmodeling#7384 (comment)
cal_VU_pot_pw was a (void)spin; stub that did nothing. The public declaration in dftu.h advertised 'calculate the local DFT+U effective potential matrix for PW base' but the function was never called anywhere. VU potential for PW is computed via cal_eff_pot_mat in the onsite projector path. Remove the dead declaration and stub to avoid confusion. Addresses reviewer comment: deepmodeling#7384 (comment)
The GPU kernels for DFT+U and DeltaSpin force/stress computation hardcoded npol=2 (noncollinear) indexing: ib2 = ib*2, 4-component Pauli matrix coefficients, and spinor becp/dbecp access patterns. This caused incorrect results or crashes for collinear (npol=1) runs. Add npol as a template parameter to cal_force_onsite and cal_stress_onsite kernels, with runtime dispatch via if (npol == 1) vs if (npol == 2) branching inside the kernel body. For npol==1 DeltaSpin, compute spin_sign from isk[ik] on the host side and pass it as a kernel argument, avoiding the need for a device isk array. Applies to both CUDA (.cu) and ROCm (.hip.cu) implementations. Addresses reviewer comments (Blockers 2 and 3): deepmodeling#7384
The nspin validation check was commented out, allowing DeltaSpin to be used with nspin=1 which is physically meaningless. Restore the check that requires nspin==2 or nspin==4 when sc_mag_switch is true. Addresses reviewer comment: deepmodeling#7384
For nspin==2 (collinear), the magnetic moment constraint direction is z, but the autoset code was setting m_loc_[ia].x instead of m_loc_[ia].z. This caused target_mag_[iat].z to be zero in set_target_mag(), making DeltaSpin constraints ineffective. Addresses reviewer comment: deepmodeling#7384
For nspin==1, the PW occupation matrix already includes spin degeneracy (factor of 2) from the k-point weights. The Hubbard energy formula E_U = U * weight_eu * occ^2 yields U * 4 * occ_half^2 with weight_eu=1.0, but the correct result should be U * n^2 / 2 (matching LCAO). Use weight_eu=0.25 for nspin==1 to get U * occ_full^2 * 0.25 = U * (2*occ_half)^2 * 0.25 = U * occ_half^2, matching the LCAO half-occupation convention before set_double_energy(). Addresses reviewer comment: deepmodeling#7384
…rrors The DeltaSpin npol==1 spin_sign calculation referenced this->isk/this->ik on structs that don't have these members. Fix by: - Adding const int* isk parameter to DeltaSpin force/stress kernel signatures in force_op.h, stress_op.h, and all CPU/GPU implementations - Passing kv_->isk.data() from Onsite_Proj_tools cal_force_dspin and cal_stress_dspin callers - Adding missing parallel_common.h include to charge_mixing.cpp - Fixing get_locale_flat call in dftu_lcao.cpp (was called per-element inside loop but function fills entire array at once) Build verified passing for CPU-only build. Addresses reviewer comments: deepmodeling#7384
ab3736b to
5539972
Compare
The old non-templated DeltaSpin kernel body was not fully replaced during the npol template refactoring, leaving ~27 lines of orphaned code outside any function, causing 57 CUDA compilation errors. Build verified passing with CUDA enabled.
…errors The virtual set_current_spin method had its definition only in operator_lcao.cpp which was compiled into hamilt_lcao but not linked by test executables, causing undefined reference errors. Provide a simple inline implementation in the header that sets current_spin. Remove the duplicate out-of-line definition from operator_lcao.cpp. The next_op chain propagation was redundant since init() already handles next_op traversal. Also fixed orphaned DeltaSpin kernel code in CUDA stress_op.cu. Build verified passing with CUDA and BUILD_TESTING=ON.
The autoset_magnetization function was changed to set m_loc_[ia].z instead of m_loc_[ia].x for nspin==2, because set_target_mag reads target_mag_[iat].z for collinear spin constraints. Update the test to match the corrected behavior. Addresses CI test failure in MODULE_CELL_unitcell_test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reminder
Linked Issue
Fix #...
Unit Tests and/or Case Tests for my changes
What's changed?
Any changes of core modules? (ignore if not applicable)