Skip to content

Selectively include developer.arm.com content#95

Open
apickard wants to merge 7 commits into
mainfrom
Search-DeveloperArmCom
Open

Selectively include developer.arm.com content#95
apickard wants to merge 7 commits into
mainfrom
Search-DeveloperArmCom

Conversation

@apickard

Copy link
Copy Markdown
Collaborator

Uses the developer.arm.com search api to retrieve search results, filters them to just the relevant ones, and then generates embeddings from them. Implemented as separate script generate-vectors.py that has the same command line as generate-chunks.py (it needs one argument, the vector csv file). Functions common to both generate-chunks.py and generate-vectors.py have been moved into generate_common.py.

Copilot AI review requested due to automatic review settings June 18, 2026 17:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new discovery pipeline for selectively including developer.arm.com content by querying Arm’s search API, filtering results, and integrating the discovered sources into the existing embedding-generation workflow by factoring shared logic into a common module.

Changes:

  • Introduces generate_common.py to share source tracking, retryable HTTP session, and chunk save/tracking utilities between scripts.
  • Adds generate-vectors.py to discover/filter developer.arm.com search results and register them into the sources CSV.
  • Expands dependencies and data inputs (adds playwright; appends new Arm Developer entries to vector-db-sources.csv) and updates tests/fixtures accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
embedding-generation/vector-db-sources.csv Adds new Arm Developer SME-related sources to the ingestion list.
embedding-generation/tests/test_generate_chunks.py Updates tests to target functionality moved into generate_common.py.
embedding-generation/tests/conftest.py Adds a fixture/module loader for generate_common.py with state reset.
embedding-generation/requirements.txt Adds playwright dependency for browser-based capture of search requests.
embedding-generation/generate-vectors.py New script to capture/replay Arm search API results and register relevant sources.
embedding-generation/generate-chunks.py Refactors to import shared utilities from generate_common.py.
embedding-generation/generate_common.py New shared module containing retry session, source tracking, and chunk persistence logic.
embedding-generation/Dockerfile Copies generate_common.py into the build context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +253 to +264
except Exception as err:
print(f"Other error occurred: {err}")
with open('info/errors.csv', 'a', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow([url, str(err)])
return None
except Exception as err:
print(f"Other error occurred: {err}")
with open('info/errors.csv', 'a', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow([url,str(err)])
return False
Comment on lines +353 to +356
# Overwrite csv with new info
with open(details_file, mode='w', newline='') as file:
csv_writer = csv.writer(file, delimiter=',')
csv_writer.writerows(new_rows)
Comment on lines +202 to +204
response = http_session.get(url, timeout=60)
soup = BeautifulSoup(response.text, 'html.parser')

Comment on lines +280 to +282
keywords = list(set( [searchterm] +
[key for key_list in (page["keywords"] or []) for key in key_list.split(sep="|")] +
[key for key_list in (page["products"] or []) for key in key_list.split(sep="|")[2:]]))
Comment on lines +311 to +318
# 0) Initialize files
os.makedirs(yaml_dir, exist_ok=True) # create if doesn't exist
details_dir = os.path.dirname(details_file)
if details_dir:
os.makedirs(details_dir, exist_ok=True)
for filename in os.listdir(yaml_dir):
if filename.startswith('chunk_') and filename.endswith('.yaml'):
os.remove(os.path.join(yaml_dir, filename))
Comment on lines +323 to +328
# 0) Obtain full database information:
# a) Learning Paths & Install Guides
if not skip_discovery:
# Developer.Arm.Com
createDeveloperArmComChunks(emit_chunks=False)

sentence-transformers>=5.4
pypdf
rank-bm25
playwright
Comment on lines +218 to +222
def item_is_relevant(item) -> bool:
if not item.get("url"):
return False
match item["type"]:
case "Guide":
print("Found "+str(len(all_rows))+" results")
return all_rows

def processDeveloperArmCom(url, title, type, keywords, emit_chunks=True):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants