Skip to content

Enrich 6.12.0 including cloudfront/firehose adapter#1823

Open
Ian Streeter (istreeter) wants to merge 2 commits into
mainfrom
enrich-6.12.0
Open

Enrich 6.12.0 including cloudfront/firehose adapter#1823
Ian Streeter (istreeter) wants to merge 2 commits into
mainfrom
enrich-6.12.0

Conversation

@istreeter

Copy link
Copy Markdown
Contributor
  • Bumps Enrich component version to 6.12.0
  • Adds documentation for cloudfront adapter, because users can start using this immediately
  • Documentation for agent classification enrichment is deferred to Agent classification enrichment #1822 because it needs console support.

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 10, 2026

Copy link
Copy Markdown

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
documentation 537f205 Commit Preview URL

Branch Preview URL
Jun 10 2026, 11:07 AM

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Documentation style review

Overall this is a well-structured, clear page that follows the repo conventions closely — good frontmatter, correct file structure (cloudfront/index.md), internal links ending in /index.md, sentence-case headings, present tense, and proper terminology (Collector, Enrich, page view event). The two external AWS links both resolve (HTTP 200). A few items to address:

1. Avoid "we" — refer to Snowplow as "Snowplow" (or rephrase)

The style guide states: "Refer to Snowplow as Snowplow, not we or our." Two instances:

For this reason, we recommend omitting c-ip from the fields you enable in CloudFront.

We recommend not including c-ip. See the note on IP address collection above.

Suggested fixes — drop the first person, e.g. "For this reason, omit c-ip from the fields you enable in CloudFront" and "Snowplow recommends not including c-ip." (Applies to both occurrences.)

2. Minor: borderline marketing phrasing

which integrates naturally with your existing web analytics

The style guide bans marketing language. "naturally" is not on the explicit banned-word list, but "integrates naturally" reads as a value claim. Consider a more concrete statement, e.g. "which appears alongside your existing web analytics events." Low priority.

Nothing else material — tables, admonitions, bold usage for UI elements (Standard logging, JSON, None), and link formatting all look correct.

return {"records": output}
```

You can extend `AGENT_UA_SUBSTRINGS` to include additional user agents, or replace the filtering logic to suit your use case — for example, to also drop requests for static assets based on the `cs-uri-stem` field.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an example of URI filtering, same as what I did for Cloudflare?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not easy to come up with URI filtering that would work for every website as they are all different.
I guess customers would need to fine-tune the filtering depending on their setup.

There are 2 approaches:

  • Deny all and allow a whitelist.
  • Allow all and deny a blacklist.

I went with the latter as I feel it is easier to determine what is a static asset that we don't want (e.g. css).

Please let me know if you don't agree.


What gets kept

  • HTML pages: /about.html, /index.htm
  • Site root: /
  • Dynamic backends: /login.php, /Page.aspx, /checkout.jsp, /cart.do
  • Extension-less paths: /about, /api/users, /blog/post-slug
  • Documents that act as content: /whitepaper.pdf, /report.docx, /feed.xml, /manifest.json, /robots.txt (debatable — these aren't in the asset list, so they pass)

What gets dropped

  • Styles: .css
  • Scripts: .js, .mjs, .map
  • Images: .png, .jpg, .jpeg, .gif, .svg, .webp, .ico, .bmp, .tif, .tiff, .avif, .heic
  • Fonts: .woff, .woff2, .ttf, .otf, .eot
  • Audio/Video: .mp3, .wav, .ogg, .m4a, .flac, .mp4, .webm, .mov, .m4v, .avi, .mkv, .flv
  • Archives: .zip, .tar, .gz, .tgz, .bz2, .7z, .rar

A few deliberate choices worth flagging

  • .pdf, .json, .xml, .txt are NOT in the asset list. PDFs are often customer-facing content (downloads they care about); JSON/XML are often API responses. Customers who want these dropped can edit the tuple. The conservative default is "keep it" — false-positive page views are easier to deal with than missing real traffic.
  • .map is dropped, since source-maps are dev tooling — but if a customer publishes user-facing tools that use .map extension for something else, they should remove it.
  • Case-insensitive matching means /IMG.JPG is dropped just like /img.jpg.
  • Path-style edge cases (/path;jsessionid=…html) match correctly via endswith against the lowercased stem.

If you want to be even more conservative (drop fewer things), the safest minimal set is (".css", ".js", ".png", ".jpg", ".jpeg", ".gif", ".svg", ".ico", ".woff", ".woff2").

"""
Firehose data-transformation Lambda for the CloudFront ingestion adapter.

Keeps only access-log records that satisfy BOTH conditions:
  1. `cs(User-Agent)` looks like an AI agent (Claude, ChatGPT, etc.)
  2. `cs-uri-stem` is NOT an asset request (CSS, JS, images, fonts, media)

Page hits and any non-asset URL pass through: static HTML (.html / .htm), the
site root (/), dynamic backends (.php, .aspx, .jsp), REST/SPA routes
(extension-less, e.g. /about or /api/users). Only unambiguous static assets are
dropped.

Deployment:
- Runtime: Python 3.12 (or any 3.x supported by Lambda)
- Handler: firehose_filter_agent_traffic.lambda_handler
- Timeout: 60s is plenty (no I/O, just parse + match)
- Memory: 128 MB
- Attach to the Firehose stream's "Transform source records with AWS Lambda" step.
"""

import base64
import json
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Case-insensitive substrings used to identify AI-agent traffic in the User-Agent
# header. Extend as needed - each addition is a one-line tuple entry.
AGENT_UA_SUBSTRINGS = (
    "claude",        # Claude Code, Claude Desktop, anthropic-* clients
    "chatgpt",       # ChatGPT app
    "gptbot",        # OpenAI's crawler
    "openai",        # other OpenAI clients
    "gemini",        # Google Gemini
    "perplexity",    # Perplexity
    "copilot",       # GitHub Copilot / Microsoft Copilot
)

# Case-insensitive suffixes identifying static assets. URIs ending in any of
# these are considered asset fetches (CSS, JS, images, fonts, media, archives,
# source maps, favicons) and dropped. Everything else - including HTML pages,
# dynamic backends (.php, .aspx, .jsp), extension-less REST/SPA routes, and the
# site root `/` - passes through as a candidate page hit.
ASSET_EXTENSIONS = (
    # Styles
    ".css",
    # Scripts
    ".js", ".mjs", ".map",
    # Images
    ".png", ".jpg", ".jpeg", ".gif", ".svg", ".webp", ".ico", ".bmp",
    ".tif", ".tiff", ".avif", ".heic",
    # Fonts
    ".woff", ".woff2", ".ttf", ".otf", ".eot",
    # Audio / Video
    ".mp3", ".wav", ".ogg", ".m4a", ".flac",
    ".mp4", ".webm", ".mov", ".m4v", ".avi", ".mkv", ".flv",
    # Archives / downloads
    ".zip", ".tar", ".gz", ".tgz", ".bz2", ".7z", ".rar",
)

USER_AGENT_FIELD = "cs(User-Agent)"
URI_STEM_FIELD = "cs-uri-stem"


def is_agent(user_agent):
    """Return True iff `user_agent` looks like an AI agent."""
    if not user_agent or user_agent == "-":
        return False
    lower = user_agent.lower()
    return any(needle in lower for needle in AGENT_UA_SUBSTRINGS)


def is_asset_request(uri_stem):
    """Return True iff `uri_stem` ends with a known static-asset extension."""
    if not uri_stem or uri_stem == "-":
        return False
    return uri_stem.lower().endswith(ASSET_EXTENSIONS)


def classify(record):
    """Decide what to do with a single Firehose record."""
    try:
        payload = base64.b64decode(record["data"])
        log_entry = json.loads(payload)
    except Exception as e:  # malformed Firehose record - flag for inspection
        logger.warning("Could not parse record %s: %s", record.get("recordId"), e)
        return "ProcessingFailed"

    ua = log_entry.get(USER_AGENT_FIELD, "")
    stem = log_entry.get(URI_STEM_FIELD, "")
    if is_agent(ua) and not is_asset_request(stem):
        return "Ok"
    return "Dropped"


def lambda_handler(event, context):
    """Firehose passes a batch of records; we tag each as Ok / Dropped / ProcessingFailed."""
    output = []
    kept = dropped = failed = 0

    for record in event["records"]:
        result = classify(record)
        if result == "Ok":
            kept += 1
        elif result == "Dropped":
            dropped += 1
        else:
            failed += 1
        # Firehose requires `data` on every output record, even for Dropped/ProcessingFailed
        # (it is ignored in those cases). We never rewrite the bytes, so just echo them back.
        output.append({
            "recordId": record["recordId"],
            "result": result,
            "data": record["data"],
        })

    logger.info("processed=%d kept=%d dropped=%d failed=%d", len(output), kept, dropped, failed)
    return {"records": output}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants