USATLAS Face-to-Face · Throughput Computing Week 2026

LLM-Assisted Analysis
at the UChicago AF

Agentic tools your analysis can use today

Agentic intelligence for HEP infrastructure
Slides AI-assisted by Claude (Anthropic · Opus 4.8, 1M) · title illustration AI-generated.

The whole talk in one example

“Where do my TopCPToolkit outputs go?”

Your batch jobs emit big ntuples and small histograms. Where should each land on the AF?

A generic LLM

“Transfer the outputs back to submit node”, i.e. $HOME. The big ntuples blow the 100 GB /home quota almost immediately.

An LLM that knows the AF

big ntuples → /data (5 TB); small histograms → /home (100 GB). It knows the quotas and layout and mounts.

The thesis

The model didn’t get smarter. We gave it facility context. That gap, between a generic assistant and one that knows your storage, scheduler, and data, is the key.

Grounding · the vocabulary

The vocabulary, fast

Agent

The engine doing the work: a loop that uses an LLM (the “brain”) to decide each step, call a tool, read the result, and repeat.

MCP server

Model Context Protocol: a standard adapter exposing a service (Rucio, AMI, your notebook) as tools and resourceslive context any agent can call.

Skill

Static context the agent loads on demand: instructions, a reusable recipe (“how we do a pyhf fit here”).

Agent harness

The app that gives the agent hands: runs its commands, reads/writes files, wires MCPs, enforces permissions. e.g. Claude Code, Codex.

Bundle skills + MCPs + agents + hooks → a plugin. Many plugins in one place to discover & install → a marketplace (which can link other marketplaces).

Related work · the framing

The ecosystem framing (Watts)

“It’s not about a smarter LLM — it’s about smarter infrastructure around it.” G. Watts, “Beyond Code Generation,” CHEP 2026
  • His framing: data & tools → MCP → skills → agents; grounded in primary sources; “we need a pip-install.”
  • This talk: one facility’s instance, running now, and the claim that facility context is the decisive layer, built to port.
Gordon Watts' Analysis Ecosystem boat diagram

G. Watts’ CHEP 2026 slides · © Gordon Watts + AI

01

The present

What an analyzer
can use today

What you can use today · managed settings

You bring the agent; we ship the rules

  • You install your own harness on the login nodes; we don’t force one
  • Our config management (Puppet) ships system-wide managed settings: a curated allow-list of safe HEP commands + the ATLAS env
  • Any harness that reads them inherits the facility’s guardrails

Why it matters

The facility, not the user, decides what an agent may do here.

  Open question

Ship this with the facility (Puppet) or via a centralized marketplace? We’ll come back to it.

managed-settings.d/10-uchicago-af.json

{
  "permissions": { "allow": [
    "Bash(condor_watch_q*)", "Bash(condor_q*)",
    "Bash(condor_submit*)", "Bash(condor_release*)",
    "Bash(pixi*)", "Bash(xrdcp*)", "Bash(lsetup*)",
    "Bash(asetup*)", /* …safe HEP cmds */
  ] },
  "env": {
    "ATLAS_LOCAL_ROOT_BASE": "/cvmfs/atlas…",
    "SITE_NAME": "AF_200"
  }
}
Shipped system-wide by our config management (Puppet) to login01–04 · live since 2026‑05‑19. Excerpt; allow-list trimmed.

What you can use today · the key idea

The facility teaches the agent about itself

Auto-loaded into every session: /etc/claude-code/CLAUDE.md

PathQuotaUse for
/home/$USER100 GB, backed upCode, scripts, condor files
/data/$USER5 TB, not backed upROOT files, datasets
/scratchnode-localEphemeral, copy out before exit
# live monitor (DO NOT use: watch condor_q)
condor_watch_q
# why is a job held?
condor_q -hold
# XCache-optimized reads (SITE_NAME=AF_200)
rucio list-file-replicas <scope>:<name> --protocol root

This is the whole thesis

A generic model can’t know that /home is backed up but tiny, that /scratch vanishes, or that XCache exists. We write it down once, and the agent inherits it.

  • Storage layout, quotas, backup policy
  • Scheduler etiquette & debugging recipes
  • Data access: XCache, grid proxies, Rucio
  • pixi for new work; how to get help

  The analyzer’s own CLAUDE.md stacks on top.

What you can use today · distribution

The USATLAS marketplace: installing the know-how

> /plugins marketplace add usatlas/marketplace

Three plugins (each bundles skills, subagents & hooks) at github.com/usatlas/marketplace:

analysis-facilities

Facility skills: HTCondor, JupyterLab, XCache, Rucio, ServiceX, Coffea-Casa, Triton (for UChicago, BNL, SLAC).

atlas

5 subagents + 25+ skills: Rucio/AMI/Open-Data MCPs, pyhf, cabinetry, TRExFitter, TopCPToolkit, FastFrames, Scikit-HEP.

hep-python-tools

Good-practice Python: building command-line tools, self-contained scripts, packaging, testing, and code quality.

  Plugins install in one line; the tools don’t

The agent know-how adds instantly; the tools come via setupATLAS + pixi. No /cvmfs? pixi + conda-forge covers most — incl. ARM builds & supply-chain security (e.g. pixi dependency cooldowns) via HEP Packaging Coordination (cf. Feickert, CHEP 2026).

  Open question

Where should AF knowledge live: shared marketplace or with the facility (Puppet)? Today: both.

What you can use today · data & metadata MCPs

Find datasets & metadata by asking

rucio-mcp: dataset access (find, inspect, replicas, download)

ami-mcp: metadata & cross-sections

also atlasopenmagic, glance via stare

  Run it locally (your x509)

pixi exec rucio-mcp serve --read-only
pixi exec ami-mcp serve
uvx atlasopenmagic-mcp serve

rucio-mcp alone exposes 50+ tools; --read-only blocks every write.

  Or the hosted server (OIDC)

# multi-site, one HTTP endpoint
rucio-mcp.af.uchicago.edu
  · atlas · cms · dune · escape

Browser OIDC login via an OAuth bridge. Not every experiment supports OIDC yet (FTS/RSE-dependent); the bridge manages the flow.

  Still being figured out

OIDC UX still varies by VO: token lifetime & refresh differ; audiences differ (atlas/dune: no custom issuer; cms/escape: extra token-exchange). And on a hung call, who times out: MCP or LLM?

  Observability

rucio-mcp exports Prometheus metrics: LLM usage per-site & per-tool (Grafana).

Auth reuses grid creds (x509) or Keycloak OIDC · OAuth bridge docs · running workflows/jobs is panda-mcp / condor-mcp’s job (not yet at AF).

What you can use today · Jupyter MCP

Drive your AF notebook from anywhere

An MCP server runs inside your JupyterLab pod. Any agent that reaches it (Claude Code, Claude Desktop, even your phone) can insert_execute_code_cell, read_cell (incl. plots), execute_code in the live kernel, use_notebook.

≠ Jupyter AI (agent inside the notebook server), which may lose out to the VS Code workflow many users prefer.

claude mcp add jupyter --transport http \
  .../user/<you>/mcp --header "Authorization: Bearer <tok>"

  The convenience/security tradeoff

Servers have finite lifetimes (equitable sharing) → you re-mint URL + token each respawn. Idea: a Keycloak-OIDC reverse-proxy → one token, pick your server in-browser (but one at a time).

What you can use today · OpenWebUI

A web chat that already knows the AF

Not everyone wants a terminal. OpenWebUI is a browser chat at af.uchicago.edu/chat, backed by a facility knowledge base.

  • Zero install, just a URL
  • Answers based on AF docs & capabilities
  • Same knowledge, friendlier UI/UX

Real exchange · user xju, Feb 2026

Q: “Are there NVIDIA Triton models available at AF?”

A: Yes. Triton in the AF k8s cluster, serving from CVMFS + an S3 repo (s3://triton-models/<user>/); upload yours, then ask admins to enable it.

“this… is not awful and it is correct” — Giordon

OpenWebUI AF assistant landing at af.uchicago.edu/chat

Even the suggested prompts are AF-specific: quota, GPUs, Varnish cache-hit at MWT2…

Assistant built by Ilija Vukotic (slides) · it also pointed xju to xAOD/coffea examples · Triton→athena: Vakho Tsulaia, CHEP 2026.

What you can use today · agents on your behalf

An agent watches the cluster for you

Daily HTCondor cluster report posted to Slack by the AI monitor agent

Every morning a scheduled agent scans HTCondor and posts a report to Slack: state summary, hold-reason breakdown, top users, jobs held >7 days, recommended actions.

Today: 448 held of 14,671 (3.1%), below the 20% alert line. Holds auto-categorized: wall-time, output-transfer, OOM, …

  No HTCondor MCP, just our own skills around the condor CLI.

Two lanes on OpenClaw: Shannon = privileged, trust-earned runbooks (this bot, human-gated); Elwood = user-space, where facility rules apply (framing). A dedicated HTCondor MCP exists (Bockelman); deeper → next talk.

What you can use today · agents on your behalf

…then drafts your fix, you approve

Drafted held-job email posted to Slack, approved by a human, then sent by the agent

When your jobs are stuck it drafts an email to you with a specific fix. Real OOM case: used the actual observed memory vs the limit → raise request_memory; write big outputs to /data.

  Human-in-the-loop

Drafted → a human replies approve in Slack → sent. Never autonomous (note the stand-down banner).

  Good advice because it carries AF knowledge (hold-reason taxonomy, c111/c113, /home…); same lesson as the facility CLAUDE.md.

  Thinking ahead

How do we tag a batch job with its experiment so the agent gives the right context? ATLAS-only today, but for FCC / EIC / DUNE / Belle‑II the same agent can’t assume everyone’s ATLAS.

02

The architecture

Why it works

The core insight

Swap the model freely: context makes it useful here

ModelOpus · GPT-5 · local / NRP-hosted open weights → cost control
HarnessClaude Code · Codex · Copilot
Facility context & ATLAS toolsstorage, scheduler, Rucio, FastFrames, coffea, pyhf

An assistant that’s actually useful hereroutes ntuples to /data, not /home

The first two blocks are interchangeable: pick any model (or self-host to cut cost), any harness. Every win today came from the third: the managed CLAUDE.md, the marketplace plugins, the Rucio/AMI MCPs, the condor bot’s HTCondor.md.

  Pragmatically too: model & harness evolve fast, so we can’t run a cluster and chase that. Facility context is the tractable scope.

Self-hosting option for open-weight models: NRP-hosted LLMs (nrp.ai/llms).

The core insight · the flip side

No context? It should stay quiet.

  What actually happened

Our cluster agents confidently recommended Lustre/MDS tuning for a filesystem that is actually Ceph. They had zero Ceph context. Plausible, fluent, and wrong.

“If we’re supposed to rely on the agents, they need to be accurate — otherwise they can’t be trusted.” Judith, #analysis-facility
“If the agent has no Ceph context, it shouldn’t make recommendations or ping people — we’d just chase irrelevant suggestions.” Farnaz, same thread

The rule we landed on: if it isn’t grounded in real facility context, it stays silent rather than speculates. Context isn’t just what makes the agent useful — it’s what makes it safe to trust.

The architecture · abstraction

The agent speaks intent, not implementation

What the physicist & agent say

(meta)data_tool“get me this dataset + its cross-section”

transform_tool“turn DAODs into histograms”

batch_tool“run this across the cluster”

inference_tool“fit and get a limit”

What the facility wires underneath (swappable)

rucio-mcp · ami-mcp · Open Data

coffea · uproot · FastFrames · ServiceX · Athena

HTCondor · Dask · Slurm · PanDA · REANA · lxbatch · graphed-org

pyhf · cabinetry · TRExFitter · optimistix · SBI

The agent never names “coffea” or “condor.” It asks for an outcome; the facility decides the backend. That indirection is what makes the same agent portable across facilities.

The architecture · the “Elwood” vocabulary

Reasoning engine & playbook

  Reasoning engine

The framework: orchestration, tool routing, execution, guardrails. Model- and experiment-agnostic.

sharedwritten once

  Playbook

Everything facility-/experiment-specific: which MCPs & backends, the system prompt / CLAUDE.md, the knowledge corpus & examples.

swappable

  Think of it as a sports team

A shared glossary keeps the team’s language consistent. (“Elwood” is our internal name, nothing public yet.)

Sports-team analogy: coach=reasoning engine, playbook=playbook, athletes=agents, equipment=services, stadium=gateway

Image: Google Gemini 3.1 Pro

The architecture · portability, concretely

Same engine, different playbook

Only the playbook changes per facility/experiment. The reasoning engine (the “coach”) is untouched. The headline difference is usually data_tool:

playbook/atlas @ UChicago AF

data_tool:   rucio-mcp     # grid
transform:   coffea, FastFrames
batch_tool:  htcondor
inference:   pyhf, TRExFitter

playbook/us-hfcc

data_tool:   eos + https   # browse/dl
transform:   uproot
batch_tool:  local / slurm
inference:   pyhf

The per-tool skills are facility-independent (written once, reused everywhere), so each experiment/facility only writes its thin playbook, not a whole stack. The goal: stop re-developing the same agentic tooling in parallel.

You’ve already met the ATLAS playbook in pieces: the managed CLAUDE.md, the marketplace plugins, the agent’s HTCondor.md.

  Still open: how to link & reuse others’ skills without copying them in: usatlas/marketplace#60 (options exist; no best practice yet).

The architecture · where it’s heading

One conversation, the whole analysis

The target: the analyzer describes the physics; the agent runs the loop: find data, generate code, submit jobs, recover from errors, make histograms, and fit, re-running as needed.

Every box here is a tool we’ve shown today. Stitching them into one supervised loop is the work ahead.

  “Elwood” is our internal project name, nothing public yet.

Elwood end-to-end: chat-driven analysis across AMI/Rucio, code generation, HTCondor, monitoring, error recovery, histograms, and statistics

03

The portable future

Toward HL-LHC
analysis facilities

The portable future · MCP granularity

One MCP, or many? Three topologies

  Live today: rucio-mcp.af.uchicago.edu/site/{atlas·escape·cms·dune}

what we run

One server, many sites

you

rucio-mcp/site/atlas·escape…

  one deploy; add rucio-atlas, rucio-escape separately in your harness

  per-VO auth inside one process

alt

One server per site

rucio-atlas

rucio-escape

  clean isolation; per-site auth & scaling

  N servers to run & maintain

alt

Single gateway

mcp.af…→ sub-MCPs

  one endpoint & identity

  token sprawl / choke point; FastMCP = 1 auth provider/server

Same agent, same tools across sites; only the per-site auth/playbook differs. Granularity vs identity is a genuine open trade-off → discuss.

Let’s discuss · the open questions

What should we decide together?

  Granularity & identity

One MCP per service, or a single gateway delegating identity? (FastMCP allows one auth provider/server; PandaMCP delegates to PanDA.)

  Where do knowledge & skills live?

Facility knowledge: shared marketplace vs with the facility (Puppet). And tool skills: a TopCPToolkit skill in the marketplace, or with the framework? Today: scattered.

  Who maintains the playbook?

Each facility writes its own context, but who reviews it and keeps it true over time?

  What may an agent do, and how sandboxed?

Write/submit/email need a human gate. We isolate with k8s pods; no-k8s sites → bubblewrap?

  Who pays for inference?

At HL-LHC scale: hosted frontier models vs self-hosted open weights on facility GPUs?

  Across AFs & experiments?

A shared commons across AFs. And one facility may serve many experiments w/ heterogeneous frameworks. Keep facility vs experiment context modular: where’s the line?

My bet: the model and harness are the easy part. Context, identity, and trust are the facility’s homework.

  What’s bubblewrap?

The sandboxing tech behind Flatpak: it isolates a process from the rest of the Linux system, with no access to your files, network, or hardware unless you explicitly grant it. A lightweight way to fence in an agent’s tool execution on facilities without Kubernetes.

Thank you

Generic LLM → facility-aware collaborator

Swap the model and the harness freely. The facility context is what makes it useful here, and it ports to the next AF.

A team effort at the UChicago AF

Ilija Vukotic — OpenClaw/Shannon, ES MCP

Fengping Hu — Kubernetes, Keycloak, Jupyter AI

Rob Gardner — vision, marketplace, Genesis

Judith Stephen — HTCondor expertise, runbooks

Aidan Rosberg — RP1 (Infra-as-Config), core dev/maintainer

David Jordan — hardware ops: networking & Kubernetes

Farnaz Golnaraghi — hardware ops: storage

  What’s next: port these lessons (incl. agentic AI) to the Open Data Facility (ODF) and RP1.

Try it / read more

Now, let’s discuss.

Giordon Stark · kratsg · USATLAS F2F @ HTC26 · 2026‑06‑09. In the spirit of Gordon Watts’ CHEP 2026 ecosystem framing.