Skip to content

feat: Support video transcript chunking in the Arm Knowledge Base#100

Closed
NeethuESim wants to merge 3 commits into
arm:mainfrom
NeethuESim:STESOL-526-video-transcript-chunking
Closed

feat: Support video transcript chunking in the Arm Knowledge Base#100
NeethuESim wants to merge 3 commits into
arm:mainfrom
NeethuESim:STESOL-526-video-transcript-chunking

Conversation

@NeethuESim

Copy link
Copy Markdown
Collaborator

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for transcript-backed video sources in the embedding-generation pipeline by introducing an optional Transcript Source URL column in the sources CSV. When provided, chunk generation fetches/chunks the transcript content while preserving the primary URL as the user-facing retrieval link.

Changes:

  • Extend sources CSV read/load/save logic to include Transcript Source URL / transcript_source_url.
  • Add transcript-aware chunking path (create_transcript_chunks) and plumb transcript URLs through main().
  • Update documentation and expand tests to cover transcript URL persistence + transcript chunking behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
embedding-generation/generate-chunks.py Adds transcript URL handling in CSV flows and transcript-backed chunk creation.
embedding-generation/tests/test_generate_chunks.py Updates CSV expectations and adds tests for transcript persistence + transcript chunking dispatch.
embedding-generation/README.md Documents the new CSV column and how transcript-backed sources behave.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread embedding-generation/generate-chunks.py Outdated
Comment thread embedding-generation/tests/test_generate_chunks.py Outdated
Comment thread embedding-generation/generate-chunks.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@NeethuESim NeethuESim marked this pull request as ready for review June 29, 2026 20:55
@NeethuESim

Copy link
Copy Markdown
Collaborator Author

Closing this PR for now. Will open another after finishing https://jira.arm.com/browse/STESOL-542

@NeethuESim NeethuESim closed this Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants