Skip to content
View WatchTree-19's full-sized avatar

Block or report WatchTree-19

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
WatchTree-19/README.md

Hi, I'm Sandeep

Quantitative engineer working on AI SRE eval and benchmarks. Independent Researcher (Columbia University, alumnus), UK-based.

What I'm building

  • Independent writing on AI evaluation methodology, observability, and the structural overlap between quant trading and LLM eval.
  • Asymmetric-information solutions in ML evaluation, surfacing what labs know internally about benchmark noise and drift.
  • Calibration tooling for benchmark drift, distinguishing genuine model improvement from eval movement.

Currently working on

  • A foundational essay on production observability for LLM agents.
  • A weekly paper digest series on alignment, evaluation methodology, and AI safety research.
  • "Benchmark crowding": mapping factor decay in quant finance to benchmark saturation in LLM evaluation.

Around the web

Pinned Loading

  1. llm-judge-calibration llm-judge-calibration Public

    Measure how much your LLM judges actually agree. Inter-judge agreement metrics for LLM-as-a-judge evaluations.

    Python

  2. UKGovernmentBEIS/inspect_ai UKGovernmentBEIS/inspect_ai Public

    Inspect: A framework for large language model evaluations

    Python 2.1k 515

  3. Tracer-Cloud/opensre Tracer-Cloud/opensre Public

    Build your own AI SRE agents. The open source toolkit for the AI era.

    Python 5.3k 673

  4. EleutherAI/lm-evaluation-harness EleutherAI/lm-evaluation-harness Public

    A framework for few-shot evaluation of language models.

    Python 12.6k 3.3k

  5. pola-rs/polars pola-rs/polars Public

    Extremely fast Query Engine for DataFrames, written in Rust

    Rust 38.5k 2.8k

  6. pixie-io/pixie pixie-io/pixie Public

    Instant Kubernetes-Native Application Observability

    C++ 6.4k 498