Enrich 6.12.0 including cloudfront/firehose adapter by istreeter · Pull Request #1823 · snowplow/documentation

Ian Streeter (istreeter) · 2026-06-10T10:52:57Z

Bumps Enrich component version to 6.12.0
Adds documentation for cloudfront adapter, because users can start using this immediately
Documentation for agent classification enrichment is deferred to Agent classification enrichment #1822 because it needs console support.

cloudflare-workers-and-pages · 2026-06-10T10:54:55Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	documentation	`537f205`	Commit Preview URL Branch Preview URL	Jun 10 2026, 11:07 AM

claude · 2026-06-10T10:55:41Z

Documentation style review

Overall this is a well-structured, clear page that follows the repo conventions closely — good frontmatter, correct file structure (cloudfront/index.md), internal links ending in /index.md, sentence-case headings, present tense, and proper terminology (Collector, Enrich, page view event). The two external AWS links both resolve (HTTP 200). A few items to address:

1. Avoid "we" — refer to Snowplow as "Snowplow" (or rephrase)

The style guide states: "Refer to Snowplow as Snowplow, not we or our." Two instances:

For this reason, we recommend omitting c-ip from the fields you enable in CloudFront.

We recommend not including c-ip. See the note on IP address collection above.

Suggested fixes — drop the first person, e.g. "For this reason, omit c-ip from the fields you enable in CloudFront" and "Snowplow recommends not including c-ip." (Applies to both occurrences.)

2. Minor: borderline marketing phrasing

which integrates naturally with your existing web analytics

The style guide bans marketing language. "naturally" is not on the explicit banned-word list, but "integrates naturally" reads as a value claim. Consider a more concrete statement, e.g. "which appears alongside your existing web analytics events." Low priority.

Nothing else material — tables, admonitions, bold usage for UI elements (Standard logging, JSON, None), and link formatting all look correct.

Nick (stanch) · 2026-06-11T11:19:38Z

+    return {"records": output}
+```
+
+You can extend `AGENT_UA_SUBSTRINGS` to include additional user agents, or replace the filtering logic to suit your use case — for example, to also drop requests for static assets based on the `cs-uri-stem` field.


Could you add an example of URI filtering, same as what I did for Cloudflare?

Benjamin BENOIST (@benjben)

It's not easy to come up with URI filtering that would work for every website as they are all different.
I guess customers would need to fine-tune the filtering depending on their setup.

There are 2 approaches:

Deny all and allow a whitelist.

Allow all and deny a blacklist.

I went with the latter as I feel it is easier to determine what is a static asset that we don't want (e.g. css).

Please let me know if you don't agree.

What gets kept

HTML pages: /about.html, /index.htm

Site root: /

Dynamic backends: /login.php, /Page.aspx, /checkout.jsp, /cart.do

Extension-less paths: /about, /api/users, /blog/post-slug

Documents that act as content: /whitepaper.pdf, /report.docx, /feed.xml, /manifest.json, /robots.txt (debatable — these aren't in the asset list, so they pass)

What gets dropped

Styles: .css

Scripts: .js, .mjs, .map

Images: .png, .jpg, .jpeg, .gif, .svg, .webp, .ico, .bmp, .tif, .tiff, .avif, .heic

Fonts: .woff, .woff2, .ttf, .otf, .eot

Audio/Video: .mp3, .wav, .ogg, .m4a, .flac, .mp4, .webm, .mov, .m4v, .avi, .mkv, .flv

Archives: .zip, .tar, .gz, .tgz, .bz2, .7z, .rar

A few deliberate choices worth flagging

.pdf, .json, .xml, .txt are NOT in the asset list. PDFs are often customer-facing content (downloads they care about); JSON/XML are often API responses. Customers who want these dropped can edit the tuple. The conservative default is "keep it" — false-positive page views are easier to deal with than missing real traffic.

.map is dropped, since source-maps are dev tooling — but if a customer publishes user-facing tools that use .map extension for something else, they should remove it.

Case-insensitive matching means /IMG.JPG is dropped just like /img.jpg.

Path-style edge cases (/path;jsessionid=…html) match correctly via endswith against the lowercased stem.

If you want to be even more conservative (drop fewer things), the safest minimal set is (".css", ".js", ".png", ".jpg", ".jpeg", ".gif", ".svg", ".ico", ".woff", ".woff2").

""" Firehose data-transformation Lambda for the CloudFront ingestion adapter. Keeps only access-log records that satisfy BOTH conditions: 1. `cs(User-Agent)` looks like an AI agent (Claude, ChatGPT, etc.) 2. `cs-uri-stem` is NOT an asset request (CSS, JS, images, fonts, media) Page hits and any non-asset URL pass through: static HTML (.html / .htm), the site root (/), dynamic backends (.php, .aspx, .jsp), REST/SPA routes (extension-less, e.g. /about or /api/users). Only unambiguous static assets are dropped. Deployment: - Runtime: Python 3.12 (or any 3.x supported by Lambda) - Handler: firehose_filter_agent_traffic.lambda_handler - Timeout: 60s is plenty (no I/O, just parse + match) - Memory: 128 MB - Attach to the Firehose stream's "Transform source records with AWS Lambda" step. """ import base64 import json import logging logger = logging.getLogger() logger.setLevel(logging.INFO) # Case-insensitive substrings used to identify AI-agent traffic in the User-Agent # header. Extend as needed - each addition is a one-line tuple entry. AGENT_UA_SUBSTRINGS = ( "claude", # Claude Code, Claude Desktop, anthropic-* clients "chatgpt", # ChatGPT app "gptbot", # OpenAI's crawler "openai", # other OpenAI clients "gemini", # Google Gemini "perplexity", # Perplexity "copilot", # GitHub Copilot / Microsoft Copilot ) # Case-insensitive suffixes identifying static assets. URIs ending in any of # these are considered asset fetches (CSS, JS, images, fonts, media, archives, # source maps, favicons) and dropped. Everything else - including HTML pages, # dynamic backends (.php, .aspx, .jsp), extension-less REST/SPA routes, and the # site root `/` - passes through as a candidate page hit. ASSET_EXTENSIONS = ( # Styles ".css", # Scripts ".js", ".mjs", ".map", # Images ".png", ".jpg", ".jpeg", ".gif", ".svg", ".webp", ".ico", ".bmp", ".tif", ".tiff", ".avif", ".heic", # Fonts ".woff", ".woff2", ".ttf", ".otf", ".eot", # Audio / Video ".mp3", ".wav", ".ogg", ".m4a", ".flac", ".mp4", ".webm", ".mov", ".m4v", ".avi", ".mkv", ".flv", # Archives / downloads ".zip", ".tar", ".gz", ".tgz", ".bz2", ".7z", ".rar", ) USER_AGENT_FIELD = "cs(User-Agent)" URI_STEM_FIELD = "cs-uri-stem" def is_agent(user_agent): """Return True iff `user_agent` looks like an AI agent.""" if not user_agent or user_agent == "-": return False lower = user_agent.lower() return any(needle in lower for needle in AGENT_UA_SUBSTRINGS) def is_asset_request(uri_stem): """Return True iff `uri_stem` ends with a known static-asset extension.""" if not uri_stem or uri_stem == "-": return False return uri_stem.lower().endswith(ASSET_EXTENSIONS) def classify(record): """Decide what to do with a single Firehose record.""" try: payload = base64.b64decode(record["data"]) log_entry = json.loads(payload) except Exception as e: # malformed Firehose record - flag for inspection logger.warning("Could not parse record %s: %s", record.get("recordId"), e) return "ProcessingFailed" ua = log_entry.get(USER_AGENT_FIELD, "") stem = log_entry.get(URI_STEM_FIELD, "") if is_agent(ua) and not is_asset_request(stem): return "Ok" return "Dropped" def lambda_handler(event, context): """Firehose passes a batch of records; we tag each as Ok / Dropped / ProcessingFailed.""" output = [] kept = dropped = failed = 0 for record in event["records"]: result = classify(record) if result == "Ok": kept += 1 elif result == "Dropped": dropped += 1 else: failed += 1 # Firehose requires `data` on every output record, even for Dropped/ProcessingFailed # (it is ignored in those cases). We never rewrite the bytes, so just echo them back. output.append({ "recordId": record["recordId"], "result": result, "data": record["data"], }) logger.info("processed=%d kept=%d dropped=%d failed=%d", len(output), kept, dropped, failed) return {"records": output}

Enrich 6.12.0 including cloudfront/firehose adapter

0ad419d

Address claude criticism

537f205

Nick (stanch) reviewed Jun 11, 2026

View reviewed changes

Benjamin BENOIST (benjben) approved these changes Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enrich 6.12.0 including cloudfront/firehose adapter#1823

Enrich 6.12.0 including cloudfront/firehose adapter#1823
Ian Streeter (istreeter) wants to merge 2 commits into
mainfrom
enrich-6.12.0

Ian Streeter (istreeter) commented Jun 10, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 10, 2026

Uh oh!

Nick (stanch) Jun 11, 2026

Uh oh!

Nick (stanch) Jun 11, 2026

Uh oh!

Benjamin BENOIST (benjben) Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Ian Streeter (istreeter) commented Jun 10, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

claude Bot commented Jun 10, 2026

Uh oh!

Nick (stanch) Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Nick (stanch) Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Benjamin BENOIST (benjben) Jun 11, 2026

Choose a reason for hiding this comment

What gets kept

What gets dropped

A few deliberate choices worth flagging

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cloudflare-workers-and-pages Bot commented Jun 10, 2026 •

edited

Loading