Audio playback hitches every ~500ms during voice reception
Summary
When receiving live voice from a remote OPV station, the playback audio exhibits a perceptible hitch in its getalong approximately twice per second. The hitches do not appear in recorded playback at all. Playback in the UI bubble sounds great. Only live audio is affected. This issue is reproducible on every received call from KB5MU. Voice intelligibility is preserved but the audio is unpleasant to listen to for long periods. Definitely no bueno.
Root Cause Was Unexpected
The current _audio_playback_loop (in enhanced_receiver.py) feeds ALSA from a FIFO queue without any timing awareness. The drain logic is:
chunks = []
try:
while True:
audio_packet = self.playback_queue.get_nowait()
chunks.append(audio_packet['pcm_data'])
...
except Empty:
pass
if chunks:
pcm_data = b''.join(chunks)
else:
pcm_data = silence_packet # <-- silence inserted here
consecutive_silence += 1
This was put in because of underruns making popping sound that was pretty terrible. This logic, though, is kind of sketch. It conflates at least three distinct conditions into one "queue is empty" response. It doesn't really check which problem we are having. I didn't really think that was super important. However, I now am educated more about live audio playing. Here's the things that all result in the same outcome:
- Real voice continuity — packets arriving roughly every 40 ms, and the queue oscillates between 0 and 1
- Dummy frames from TX — the remote transmitter sends an all-zeros frame in slots where no voice is available, and the receiver discards them at
_handle_audio_packet line 299, and the playback thread sees the resulting gap as "queue empty"
- Late or lost packets — true network jitter or packet loss
I wrote a buffer that handles everything the same, but each case should be handled differently. The current code treats them all as "feed a 40 ms silence buffer to ALSA" which causes audible discontinuities. Empirically, the dummy-frame case dominates. We can clearly see what's happening in the --verbose setting in Interlocutor. The log analysis shows dummy frames discarded at ~12-packet (480 ms) intervals during active voice, matching the perceived hitch rate.
Why the existing RTP header is the answer, even though @MustBeArt is Skeptical
OPV voice packets carry a standard 12-byte RTP header before the Opus payload. The header contains:
- 16-bit sequence number which detects loss and reordering
- 32-bit timestamp which gives sample-domain time of the first sample in this packet
- 32-bit SSRC which is the synchronization source identifier
The receiver currently strips these bytes and discards them (line 470: rtp_payload = udp_payload[12:] # Skip RTP header). The values are present in every packet but currently unused.
With RTP timestamps available to the playback thread, the three cases above can be distinguished cleanly:
- Consecutive voice: timestamps differ by exactly the frame size (1920 samples at 48 kHz)
- Dummy-induced gap: timestamps differ by 2× or more frame sizes — exact silence duration is known and can be inserted as a single clean buffer
- Late packet: expected timestamp has not arrived; playback can wait briefly before declaring loss
This is the standard RTP receiver model, according to Top Men (RFC people).
Is TX-side RTP populated correctly in the first place?
Yes! Confirmed by reading radio_protocol.py:RTPHeader (lines 689-783):
-
Sequence numbers: randomly initialized, increments by 1 per frame
-
Timestamps: initialized from wall clock, increments by 1920 samples per frame (matching OPULENT_VOICE_SAMPLES_PER_FRAME)
-
SSRC: derived from hash of the station ID and zero guarded
-
Marker bit: set on first packet (talkspurt start)
-
Payload type: 96 (dynamic, identifies Opus)
Confirmed by reading the TX path: RTPAudioFrameBuilder.create_rtp_audio_frame() calls self.rtp_header.create_header() for every voice frame. The resulting 12-byte header is prepended to the Opus payload before transmission. Similar pattern for all the headers of all the protocols we use.
The RX-side parser already exists as RTPHeader.parse_header() (lines 750-775). It is currently unused. enhanced_receiver.py strips the 12 header bytes at line 470 and discards them without parsing. This refactor will call this existing parser instead of discarding the header.
OPV's SSRC convention diverges from the standard
Per radio_protocol.py:793-795, OPV's SSRC is derived deterministically from the callsign.
pythonssrc = hash(str(station_identifier)) % (2**32)
if ssrc == 0:
ssrc = 1
This is a deliberate divergence from standard RTP, where SSRC is randomly generated per session. OPV's choice makes SSRC a stable per-station identifier, useful for receiver-side identification (sort of), cross-session continuity (definitely), and per-station statistics aggregation (definitely). This is all good stuff. The probability of callsign-hash collisions in 2^32 space is negligible for the amateur radio population. Seriously not a problem.
Implication for the refactor: SSRC change in the RTP stream means a different operator is talking, not just "same operator restarted." We can use SSRC change as a strong signal to reset the playout anchor and per-source state. Jitter estimators should be keyed on SSRC, allowing per-station statistics to accumulate stably across transmissions.
Proposed phased solution
The refactor is staged so each phase is independently testable, and any phase can be deployed without depending on the next. Learned this from how nice things went over on Arcanus.
Phase 1: Parse and propagate RTP header fields (no behavioral change)
Modify _handle_audio_packet to parse the 12-byte RTP header, extracting:
- Version, padding, extension, CSRC count (validation)
- Marker bit
- Payload type
- Sequence number
- Timestamp
- SSRC
Attach these fields to the packet dict passed to queue_audio_for_playback. The playback thread does not yet act on them. This phase only ensures the values flow through the system and are available for inspection. Then we test!
Gating test:
- All existing log lines still emitted and no behavior change
- Add a debug log line per packet showing parsed
(seq, ts) values
- Verify across a 30-second voice call that
seq increments by 1 between consecutive packets (or 2+ when a dummy is interposed), ts increments by 1920 between consecutive packets (or by 2×1920 for dummy gaps), and no regression in audio playback behavior. Same hitches as before, no new ones. The M1 Mac is the transmitter that is doing this. So use that one.
Phase 2: Timestamp-aware playout scheduling
Now, change the playback thread's drain logic from "play whatever is queued" to "play the packet whose RTP timestamp says now is its time." The thread maintains an anchor point. It is the local wall-clock time corresponding to the first packet's RTP timestamp. All subsequent packets are scheduled relative to this anchor:
playout_time(packet) = anchor_local_time + (packet.rtp_ts - anchor_rtp_ts) / sample_rate + target_delay
Replace the silence-on-empty-queue logic with three cases:
- Next packet's playout time is in the future? Yes? Wait (with a queue.get timeout, not a sleep)
- Next packet's playout time is now? Then play it
- Expected packet has not arrived by its playout time + tolerance? Well, go ahead and declare loss, insert one frame of PLC (initially: zeros; later: Opus PLC)
Initial target_delay value: 80 ms (two frame durations). Sufficient cushion for typical local-network jitter without adding excessive latency. Can be tuned later. Less than the current 120 ms buffer thing.
Gating tests (all must pass):
- Continuous voice test: 30-second call with no dummy frames (voice activity throughout). Must hear no hitches. Audio quality subjectively equivalent to a recorded playback of the same content.
- Dummy interspersed test: 30-second call with KB5MU's typical dummy pattern (~12 voice frames between dummies). Must hear no hitches. Brief silences during dummy intervals must be smooth and inaudible as discontinuities. Use the M1 computer for this.
- Latency measurement: End-to-end latency from microphone-input on TX side to speaker-output on RX side must increase by no more than
target_delay (80 ms) compared to current implementation.
- Stream end recovery: After a transmission ends and a new one begins, the second transmission must play correctly without the playback thread getting stuck waiting for old timestamps.
- No regressions in existing log output: all existing debug lines still fire correctly. Web UI notifications still work.
Phase 3: Adaptive playout delay (RFC 3550 §6.4.1)
Implement the standard RTP jitter estimator:
J(i) = J(i-1) + (|D(i-1, i)| - J(i-1)) / 16
Now we are using RTP in anger! Where D is the difference between expected and actual inter-arrival time, computed from RTP timestamps and local clock. This produces an exponentially-weighted moving average of observed jitter. EMA! Just like in the modem.
Use the jitter estimate to adjust target_delay adaptively:
- Stable link:
target_delay shrinks toward minimum (e.g., 40 ms = one frame)
- Jittery link:
target_delay grows up to a configured maximum (e.g., 200 ms = five frames)
Update target_delay no more than once per second to avoid oscillation.
Gating tests:
- Stable link test: With a low-jitter local link,
target_delay should converge to a low value (≤2 frames) within 5 seconds of sustained voice. Latency in the steady state should be lower than Phase 2's fixed 80 ms.
- Jittery link test: Inject artificial jitter (e.g., delay packets randomly between 0-100 ms before processing).
target_delay must grow to absorb the jitter. No additional hitches compared to Phase 2.
- Step-change test: Start with stable link, then introduce sustained 100 ms jitter.
target_delay must adapt within 5 seconds. Hitches during the adaptation window are acceptable; hitches after adaptation has stabilized are not. Like the cases in our lunar lander CTF.
- No regressions in Phase 2 gating tests All four still pass! You get to pass, and you get to pass, and you get to pass.
What this issue does not fix
- The TX-side dummy frame rate. transmitters sending dummy frames every ~480 ms even during active voice can't be fixed by RTP in the receiver. This is a separate question. Whatever the answer, the receiver-side refactor described here will handle dummy frames cleanly regardless of their rate.
- Codec-internal PLC (Opus FEC). The Opus codec has built-in packet loss concealment via
decode_fec=1. Integration with this is a useful future enhancement but not required for the basic refactor.
- Time-scale modification (audio acceleration / stretching). WebRTC's NetEq dynamically adjusts playback rate to maintain target latency. This is genuinely advanced stuff and not needed for the current problem.
- Multi-source mixing. Currently Interlocutor handles one remote talker at a time. Multi-source playout (multiple SSRCs, mixing) might be something we want to look at but I can't think of a compelling use case for that.
References
Standards (we've got them!)
RFC 3550 RTP: A Transport Protocol for Real-Time Applications. H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. July 2003.
- Section 5.1: RTP Fixed Header Fields
- Section 6.4.1: SR: Sender Report RTCP Packet (defines the inter-arrival jitter computation we use in Phase 3)
- Appendix A.8: Estimating the Interarrival Jitter (canonical algorithm reference)
- https://datatracker.ietf.org/doc/html/rfc3550
Book
RTP: Audio and Video for the Internet Colin Perkins
Reference implementation
Speex JitterBuffer — Jean-Marc Valin (also the author of Opus).
Modern stuff just like this
WebRTC NetEq Google's adaptive jitter buffer for WebRTC voice/video.
- Implements time-scale modification (PLC + acceleration) to maintain target latency.
- Out of scope for this issue but if we ever want to implement PLC and other stuff then this is the thing to look at
Acceptance criteria for closing this issue
This issue is closed when everything gates ok. No hitches in a long conversation even on hardware that throws dummy frames. End to end latency documented so no surprises. Someone reviews the implementation to make sure we didn't goof it up. Documentation up to date.
The phasing exists so we can ship working improvements incrementally rather than gating everything on the full adaptive design, and it isn't left in a broken half-assed messed up state.
Audio playback hitches every ~500ms during voice reception
Summary
When receiving live voice from a remote OPV station, the playback audio exhibits a perceptible hitch in its getalong approximately twice per second. The hitches do not appear in recorded playback at all. Playback in the UI bubble sounds great. Only live audio is affected. This issue is reproducible on every received call from KB5MU. Voice intelligibility is preserved but the audio is unpleasant to listen to for long periods. Definitely no bueno.
Root Cause Was Unexpected
The current
_audio_playback_loop(inenhanced_receiver.py) feeds ALSA from a FIFO queue without any timing awareness. The drain logic is:This was put in because of underruns making popping sound that was pretty terrible. This logic, though, is kind of sketch. It conflates at least three distinct conditions into one "queue is empty" response. It doesn't really check which problem we are having. I didn't really think that was super important. However, I now am educated more about live audio playing. Here's the things that all result in the same outcome:
_handle_audio_packetline 299, and the playback thread sees the resulting gap as "queue empty"I wrote a buffer that handles everything the same, but each case should be handled differently. The current code treats them all as "feed a 40 ms silence buffer to ALSA" which causes audible discontinuities. Empirically, the dummy-frame case dominates. We can clearly see what's happening in the
--verbosesetting in Interlocutor. The log analysis shows dummy frames discarded at ~12-packet (480 ms) intervals during active voice, matching the perceived hitch rate.Why the existing RTP header is the answer, even though @MustBeArt is Skeptical
OPV voice packets carry a standard 12-byte RTP header before the Opus payload. The header contains:
The receiver currently strips these bytes and discards them (line 470:
rtp_payload = udp_payload[12:] # Skip RTP header). The values are present in every packet but currently unused.With RTP timestamps available to the playback thread, the three cases above can be distinguished cleanly:
This is the standard RTP receiver model, according to Top Men (RFC people).
Is TX-side RTP populated correctly in the first place?
Yes! Confirmed by reading radio_protocol.py:RTPHeader (lines 689-783):
Sequence numbers: randomly initialized, increments by 1 per frame
Timestamps: initialized from wall clock, increments by 1920 samples per frame (matching OPULENT_VOICE_SAMPLES_PER_FRAME)
SSRC: derived from hash of the station ID and zero guarded
Marker bit: set on first packet (talkspurt start)
Payload type: 96 (dynamic, identifies Opus)
Confirmed by reading the TX path: RTPAudioFrameBuilder.create_rtp_audio_frame() calls self.rtp_header.create_header() for every voice frame. The resulting 12-byte header is prepended to the Opus payload before transmission. Similar pattern for all the headers of all the protocols we use.
The RX-side parser already exists as RTPHeader.parse_header() (lines 750-775). It is currently unused. enhanced_receiver.py strips the 12 header bytes at line 470 and discards them without parsing. This refactor will call this existing parser instead of discarding the header.
OPV's SSRC convention diverges from the standard
Per radio_protocol.py:793-795, OPV's SSRC is derived deterministically from the callsign.
This is a deliberate divergence from standard RTP, where SSRC is randomly generated per session. OPV's choice makes SSRC a stable per-station identifier, useful for receiver-side identification (sort of), cross-session continuity (definitely), and per-station statistics aggregation (definitely). This is all good stuff. The probability of callsign-hash collisions in 2^32 space is negligible for the amateur radio population. Seriously not a problem.
Implication for the refactor: SSRC change in the RTP stream means a different operator is talking, not just "same operator restarted." We can use SSRC change as a strong signal to reset the playout anchor and per-source state. Jitter estimators should be keyed on SSRC, allowing per-station statistics to accumulate stably across transmissions.
Proposed phased solution
The refactor is staged so each phase is independently testable, and any phase can be deployed without depending on the next. Learned this from how nice things went over on Arcanus.
Phase 1: Parse and propagate RTP header fields (no behavioral change)
Modify
_handle_audio_packetto parse the 12-byte RTP header, extracting:Attach these fields to the packet dict passed to
queue_audio_for_playback. The playback thread does not yet act on them. This phase only ensures the values flow through the system and are available for inspection. Then we test!Gating test:
(seq, ts)valuesseqincrements by 1 between consecutive packets (or 2+ when a dummy is interposed),tsincrements by 1920 between consecutive packets (or by 2×1920 for dummy gaps), and no regression in audio playback behavior. Same hitches as before, no new ones. The M1 Mac is the transmitter that is doing this. So use that one.Phase 2: Timestamp-aware playout scheduling
Now, change the playback thread's drain logic from "play whatever is queued" to "play the packet whose RTP timestamp says now is its time." The thread maintains an anchor point. It is the local wall-clock time corresponding to the first packet's RTP timestamp. All subsequent packets are scheduled relative to this anchor:
Replace the silence-on-empty-queue logic with three cases:
Initial
target_delayvalue: 80 ms (two frame durations). Sufficient cushion for typical local-network jitter without adding excessive latency. Can be tuned later. Less than the current 120 ms buffer thing.Gating tests (all must pass):
target_delay(80 ms) compared to current implementation.Phase 3: Adaptive playout delay (RFC 3550 §6.4.1)
Implement the standard RTP jitter estimator:
Now we are using RTP in anger! Where
Dis the difference between expected and actual inter-arrival time, computed from RTP timestamps and local clock. This produces an exponentially-weighted moving average of observed jitter. EMA! Just like in the modem.Use the jitter estimate to adjust
target_delayadaptively:target_delayshrinks toward minimum (e.g., 40 ms = one frame)target_delaygrows up to a configured maximum (e.g., 200 ms = five frames)Update
target_delayno more than once per second to avoid oscillation.Gating tests:
target_delayshould converge to a low value (≤2 frames) within 5 seconds of sustained voice. Latency in the steady state should be lower than Phase 2's fixed 80 ms.target_delaymust grow to absorb the jitter. No additional hitches compared to Phase 2.target_delaymust adapt within 5 seconds. Hitches during the adaptation window are acceptable; hitches after adaptation has stabilized are not. Like the cases in our lunar lander CTF.What this issue does not fix
decode_fec=1. Integration with this is a useful future enhancement but not required for the basic refactor.References
Standards (we've got them!)
RFC 3550 RTP: A Transport Protocol for Real-Time Applications. H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. July 2003.
Book
RTP: Audio and Video for the Internet Colin Perkins
Reference implementation
Speex JitterBuffer — Jean-Marc Valin (also the author of Opus).
Modern stuff just like this
WebRTC NetEq Google's adaptive jitter buffer for WebRTC voice/video.
Acceptance criteria for closing this issue
This issue is closed when everything gates ok. No hitches in a long conversation even on hardware that throws dummy frames. End to end latency documented so no surprises. Someone reviews the implementation to make sure we didn't goof it up. Documentation up to date.
The phasing exists so we can ship working improvements incrementally rather than gating everything on the full adaptive design, and it isn't left in a broken half-assed messed up state.