Hi there,
First of all, thank you for your work on obs-rvc. Implementing RVC in Rust is a challenging path, as mentioned in your README, but I believe it's a vital step for high-performance real-time voice conversion.
I have been developing a separate Rust-based RVC engine and wanted to share some technical findings regarding stability and "fluency." I must admit that my deep-level knowledge of the RVC architecture is still limited, and I am eager to learn from the expertise of predecessors like you. I would truly appreciate any guidance or feedback you could offer.
Here are the specific architectural insights I've implemented:
1. The "320-sample Alignment" for HuBERT (ONNX)
I discovered that the HuBERT ONNX model (16kHz) internally processes data in units of 320 samples (20ms). If the input source length (including padding) is not a strict multiple of 320, ONNX Runtime often throws ReduceSum or Where dimension mismatch errors, or suffers from quality degradation. Strictly aligning the source window to 320 \times n samples significantly improved the stability.
2. Multi-stage RMVPE Fallback Logic
To handle near-silence or low-gain inputs, I implemented a recursive fallback strategy. If the initial threshold fails, the decoder retries with lower thresholds (down to 0.01). This reduces "pitch flickering" while maintaining detection in quiet segments.
3. F0 Stabilization (Despiking & Interpolation)
I noticed that the "metallic" noise often stems from 1-frame pitch spikes or very short unvoiced gaps. Implementing a simple despiking logic (removing 1-frame voiced islands) and linear interpolation for short gaps (≤ 2 frames) made the voice output much smoother.
I am still refining the output quality—currently fighting with some scaling and gain issues—but I wanted to share these findings as they seem to be common bottlenecks in Rust implementations. I would be honored to discuss these points further and borrow your wisdom to make Rust-based RVC more practical.
Best regards,
Hi there,
First of all, thank you for your work on obs-rvc. Implementing RVC in Rust is a challenging path, as mentioned in your README, but I believe it's a vital step for high-performance real-time voice conversion.
I have been developing a separate Rust-based RVC engine and wanted to share some technical findings regarding stability and "fluency." I must admit that my deep-level knowledge of the RVC architecture is still limited, and I am eager to learn from the expertise of predecessors like you. I would truly appreciate any guidance or feedback you could offer.
Here are the specific architectural insights I've implemented:
1. The "320-sample Alignment" for HuBERT (ONNX)
I discovered that the HuBERT ONNX model (16kHz) internally processes data in units of 320 samples (20ms). If the input source length (including padding) is not a strict multiple of 320, ONNX Runtime often throws ReduceSum or Where dimension mismatch errors, or suffers from quality degradation. Strictly aligning the source window to 320 \times n samples significantly improved the stability.
2. Multi-stage RMVPE Fallback Logic
To handle near-silence or low-gain inputs, I implemented a recursive fallback strategy. If the initial threshold fails, the decoder retries with lower thresholds (down to 0.01). This reduces "pitch flickering" while maintaining detection in quiet segments.
3. F0 Stabilization (Despiking & Interpolation)
I noticed that the "metallic" noise often stems from 1-frame pitch spikes or very short unvoiced gaps. Implementing a simple despiking logic (removing 1-frame voiced islands) and linear interpolation for short gaps (≤ 2 frames) made the voice output much smoother.
I am still refining the output quality—currently fighting with some scaling and gain issues—but I wanted to share these findings as they seem to be common bottlenecks in Rust implementations. I would be honored to discuss these points further and borrow your wisdom to make Rust-based RVC more practical.
Best regards,