Enrich 6.12.0 including cloudfront/firehose adapter#1823
Enrich 6.12.0 including cloudfront/firehose adapter#1823Ian Streeter (istreeter) wants to merge 2 commits into
Conversation
Ian Streeter (istreeter)
commented
Jun 10, 2026
- Bumps Enrich component version to 6.12.0
- Adds documentation for cloudfront adapter, because users can start using this immediately
- Documentation for agent classification enrichment is deferred to Agent classification enrichment #1822 because it needs console support.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
documentation | 537f205 | Commit Preview URL Branch Preview URL |
Jun 10 2026, 11:07 AM |
|
Documentation style review Overall this is a well-structured, clear page that follows the repo conventions closely — good frontmatter, correct file structure (cloudfront/index.md), internal links ending in /index.md, sentence-case headings, present tense, and proper terminology (Collector, Enrich, page view event). The two external AWS links both resolve (HTTP 200). A few items to address: 1. Avoid "we" — refer to Snowplow as "Snowplow" (or rephrase) The style guide states: "Refer to Snowplow as Snowplow, not we or our." Two instances:
Suggested fixes — drop the first person, e.g. "For this reason, omit c-ip from the fields you enable in CloudFront" and "Snowplow recommends not including c-ip." (Applies to both occurrences.) 2. Minor: borderline marketing phrasing
The style guide bans marketing language. "naturally" is not on the explicit banned-word list, but "integrates naturally" reads as a value claim. Consider a more concrete statement, e.g. "which appears alongside your existing web analytics events." Low priority. Nothing else material — tables, admonitions, bold usage for UI elements (Standard logging, JSON, None), and link formatting all look correct. |
| return {"records": output} | ||
| ``` | ||
|
|
||
| You can extend `AGENT_UA_SUBSTRINGS` to include additional user agents, or replace the filtering logic to suit your use case — for example, to also drop requests for static assets based on the `cs-uri-stem` field. |
There was a problem hiding this comment.
Could you add an example of URI filtering, same as what I did for Cloudflare?
There was a problem hiding this comment.
There was a problem hiding this comment.
It's not easy to come up with URI filtering that would work for every website as they are all different.
I guess customers would need to fine-tune the filtering depending on their setup.
There are 2 approaches:
- Deny all and allow a whitelist.
- Allow all and deny a blacklist.
I went with the latter as I feel it is easier to determine what is a static asset that we don't want (e.g. css).
Please let me know if you don't agree.
What gets kept
- HTML pages:
/about.html,/index.htm - Site root:
/ - Dynamic backends:
/login.php,/Page.aspx,/checkout.jsp,/cart.do - Extension-less paths:
/about,/api/users,/blog/post-slug - Documents that act as content:
/whitepaper.pdf,/report.docx,/feed.xml,/manifest.json,/robots.txt(debatable — these aren't in the asset list, so they pass)
What gets dropped
- Styles:
.css - Scripts:
.js,.mjs,.map - Images:
.png,.jpg,.jpeg,.gif,.svg,.webp,.ico,.bmp,.tif,.tiff,.avif,.heic - Fonts:
.woff,.woff2,.ttf,.otf,.eot - Audio/Video:
.mp3,.wav,.ogg,.m4a,.flac,.mp4,.webm,.mov,.m4v,.avi,.mkv,.flv - Archives:
.zip,.tar,.gz,.tgz,.bz2,.7z,.rar
A few deliberate choices worth flagging
.pdf,.json,.xml,.txtare NOT in the asset list. PDFs are often customer-facing content (downloads they care about); JSON/XML are often API responses. Customers who want these dropped can edit the tuple. The conservative default is "keep it" — false-positive page views are easier to deal with than missing real traffic..mapis dropped, since source-maps are dev tooling — but if a customer publishes user-facing tools that use.mapextension for something else, they should remove it.- Case-insensitive matching means
/IMG.JPGis dropped just like/img.jpg. - Path-style edge cases (
/path;jsessionid=…html) match correctly viaendswithagainst the lowercased stem.
If you want to be even more conservative (drop fewer things), the safest minimal set is (".css", ".js", ".png", ".jpg", ".jpeg", ".gif", ".svg", ".ico", ".woff", ".woff2").
"""
Firehose data-transformation Lambda for the CloudFront ingestion adapter.
Keeps only access-log records that satisfy BOTH conditions:
1. `cs(User-Agent)` looks like an AI agent (Claude, ChatGPT, etc.)
2. `cs-uri-stem` is NOT an asset request (CSS, JS, images, fonts, media)
Page hits and any non-asset URL pass through: static HTML (.html / .htm), the
site root (/), dynamic backends (.php, .aspx, .jsp), REST/SPA routes
(extension-less, e.g. /about or /api/users). Only unambiguous static assets are
dropped.
Deployment:
- Runtime: Python 3.12 (or any 3.x supported by Lambda)
- Handler: firehose_filter_agent_traffic.lambda_handler
- Timeout: 60s is plenty (no I/O, just parse + match)
- Memory: 128 MB
- Attach to the Firehose stream's "Transform source records with AWS Lambda" step.
"""
import base64
import json
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Case-insensitive substrings used to identify AI-agent traffic in the User-Agent
# header. Extend as needed - each addition is a one-line tuple entry.
AGENT_UA_SUBSTRINGS = (
"claude", # Claude Code, Claude Desktop, anthropic-* clients
"chatgpt", # ChatGPT app
"gptbot", # OpenAI's crawler
"openai", # other OpenAI clients
"gemini", # Google Gemini
"perplexity", # Perplexity
"copilot", # GitHub Copilot / Microsoft Copilot
)
# Case-insensitive suffixes identifying static assets. URIs ending in any of
# these are considered asset fetches (CSS, JS, images, fonts, media, archives,
# source maps, favicons) and dropped. Everything else - including HTML pages,
# dynamic backends (.php, .aspx, .jsp), extension-less REST/SPA routes, and the
# site root `/` - passes through as a candidate page hit.
ASSET_EXTENSIONS = (
# Styles
".css",
# Scripts
".js", ".mjs", ".map",
# Images
".png", ".jpg", ".jpeg", ".gif", ".svg", ".webp", ".ico", ".bmp",
".tif", ".tiff", ".avif", ".heic",
# Fonts
".woff", ".woff2", ".ttf", ".otf", ".eot",
# Audio / Video
".mp3", ".wav", ".ogg", ".m4a", ".flac",
".mp4", ".webm", ".mov", ".m4v", ".avi", ".mkv", ".flv",
# Archives / downloads
".zip", ".tar", ".gz", ".tgz", ".bz2", ".7z", ".rar",
)
USER_AGENT_FIELD = "cs(User-Agent)"
URI_STEM_FIELD = "cs-uri-stem"
def is_agent(user_agent):
"""Return True iff `user_agent` looks like an AI agent."""
if not user_agent or user_agent == "-":
return False
lower = user_agent.lower()
return any(needle in lower for needle in AGENT_UA_SUBSTRINGS)
def is_asset_request(uri_stem):
"""Return True iff `uri_stem` ends with a known static-asset extension."""
if not uri_stem or uri_stem == "-":
return False
return uri_stem.lower().endswith(ASSET_EXTENSIONS)
def classify(record):
"""Decide what to do with a single Firehose record."""
try:
payload = base64.b64decode(record["data"])
log_entry = json.loads(payload)
except Exception as e: # malformed Firehose record - flag for inspection
logger.warning("Could not parse record %s: %s", record.get("recordId"), e)
return "ProcessingFailed"
ua = log_entry.get(USER_AGENT_FIELD, "")
stem = log_entry.get(URI_STEM_FIELD, "")
if is_agent(ua) and not is_asset_request(stem):
return "Ok"
return "Dropped"
def lambda_handler(event, context):
"""Firehose passes a batch of records; we tag each as Ok / Dropped / ProcessingFailed."""
output = []
kept = dropped = failed = 0
for record in event["records"]:
result = classify(record)
if result == "Ok":
kept += 1
elif result == "Dropped":
dropped += 1
else:
failed += 1
# Firehose requires `data` on every output record, even for Dropped/ProcessingFailed
# (it is ignored in those cases). We never rewrite the bytes, so just echo them back.
output.append({
"recordId": record["recordId"],
"result": result,
"data": record["data"],
})
logger.info("processed=%d kept=%d dropped=%d failed=%d", len(output), kept, dropped, failed)
return {"records": output}