Skip to content

[1/N] add fp8 fp32 scale support for custom RL model#368

Open
yiakwy-xpu-ml-framework-team wants to merge 2 commits into
antirez:mainfrom
yiakwy-xpu-ml-framework-team:add_fp8_fp32_scale_support
Open

[1/N] add fp8 fp32 scale support for custom RL model#368
yiakwy-xpu-ml-framework-team wants to merge 2 commits into
antirez:mainfrom
yiakwy-xpu-ml-framework-team:add_fp8_fp32_scale_support

Conversation

@yiakwy-xpu-ml-framework-team

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Jun 9, 2026

Copy link
Copy Markdown

Background

We added fp8 RL+SFT version of Deepseek V4 in week 0 support and suppressed DeepSeek V4 baseline in all major dimensions from our internal evaluation.

Hence we want to add 2 bit support for DeepSeek V4 with our Expert Pruning technology:
截屏2026-06-09 15 40 50

Noted, in H100/H800, we usually don't use E8M0 for scale, since it will introduce runtime overhead. FP32 scale is the best.

@yiakwy-xpu-ml-framework-team

Copy link
Copy Markdown
Author

@antirez could you have a look at it ?

@antirez

antirez commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Hi, the PR itself has a few quality issues but especially it is not clear why it would be useful for the proejct as a whole given that we convert from DS4 hugging face formats.

@yiakwy-xpu-ml-framework-team

yiakwy-xpu-ml-framework-team commented Jun 10, 2026

Copy link
Copy Markdown
Author
截屏2026-06-10 09 14 48

Quantization is successful.

@antirez Thank you for the quick response, let me explain.

  • our sft/RL model of deepseek v4 has embedding layer (bf16 or int32), while deepseek model has embedding with type int64

  • since we are running in Hopper platform , our expert weight stored with E4M3 FP8 weight and weight scale stored with FP32 for best performance (which can verified in SGLang):

    sglang fp8 serving

    Customer DSV4 sglang fp8 serving in Hopper platform with identity injectioin, private/public knowledge injection and enhanced security shield module
  • Huggingface model is not SGLang compatible version, while our version is; and huggingface does not consider convert SFT/RL model from Bf16 to FP8 variants

The model is tuned specifically to handle Candonese, Chinese madarin and English efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants