AI Research

Most Runs Should Be Quiet

Daniel Campos
Daniel Campos
·21 min read
agentsmonitoringsignalevaluationzipf
Most Runs Should Be Quiet

We run a system that deploys AI agents to monitor the web on a schedule. A user describes what they want to watch — "track AMD data center adoption by cloud providers," "monitor competitor pricing changes across vector database vendors" — and the system turns that into a persistent patrol: recurring searches, targeted crawls, structured extraction, all running on a cadence. Each run compares what it found to what it found last time and decides whether the change is worth telling someone about.

Over a recent 30-day window, the system completed 9,062 patrol runs. The median signal score was 7 out of 100. Seventy-five percent of notifications were suppressed.

Most runs were quiet. That is not a bug. That is the core design constraint, and it turns out to drive most of the interesting engineering work.

This post walks through how we got here — the naive approaches that failed, the two-pass scoring architecture we landed on, the bugs that taught us to distrust LLM-generated scores, and the self-healing loop that repairs monitors when they drift. All numbers are from production.

Where We Started

A year ago, the system was much simpler. A user's monitoring intent was turned into a set of search queries wrapped around Brave Search. The results were passed to an LLM that wrote a summary. The summary was always delivered. There was no scoring, no suppression, no concept of "quiet."

The problems were immediate:

  1. Every run produced a notification. A monitor checking daily generated 30 emails per month. Most said some version of "here are the latest results for your query." Users stopped reading them within a week.

  2. The LLM always found something to say. Given a set of search results and a monitoring intent, the model would reliably produce a multi-paragraph summary, even when nothing had changed. It would rephrase last week's news, surface tangentially related content, or describe background context as if it were new. The model was doing its job — generating coherent text from inputs — but that job was not the same as deciding whether anything mattered.

  3. There was no memory. Each run was independent. The system had no record of what it had found on the previous run, so it could not compute a diff. Every result was "new" because there was no baseline.

  4. We had no way to distinguish noise from signal. A monitor tracking semiconductor supply chains would surface the same Reuters article three runs in a row because the search results were stable. The user had to manually decide whether each notification contained something worth reading.

We needed something between "send everything" and "send nothing." We needed a scoring system.

The Problem: Power Laws and the Noise Floor

Meaningful change on the web follows a power law. Most page edits, new search results, and content updates are cosmetic, repeated, or irrelevant to any specific monitoring intent. A tiny fraction carry real informational weight — and you cannot predict in advance which fraction that will be.

A power-law curve showing rare high-significance changes and a long tail of low-significance noise.
Signal distribution across web changes. The vast majority of updates carry no new information. A monitoring system that treats all changes equally will drown in noise.

This creates a tension that is hard to resolve with simple threshold rules:

  • Broad coverage is necessary. You have to look at a lot of sources because the important change might appear anywhere.
  • Broad coverage produces noise. Most of what you find on any given run will be routine.
  • Users lose trust in both directions. Too many alerts and the system becomes an annoyance. Too few and "all clear" starts to feel like "maybe broken."

The system we needed had to convert a noisy stream of web changes into a calibrated signal — quiet when nothing moved, loud when something did, and transparent enough that silence feels like evidence, not absence.

Architecture: What Happens After a Run Completes

When a monitor executes, it runs a pipeline of searches and crawls defined by a workflow spec. The raw output is a manifest: a structured record of every search query issued, every page crawled, every extraction attempted. That manifest is the input to the postprocessing pipeline, which is where scoring happens.

The 10-stage postprocessing pipeline, from manifest parsing through dual signal scoring to the notify-or-suppress decision.
Every completed patrol run passes through this pipeline. Stages 1-4 and 7-9 are deterministic. Stages 5-6 and 10 involve LLM calls. The final signal score is resolved from whichever scoring path (V1 or V2) is available.

The pipeline runs ten stages. The first four are deterministic: parse the manifest, load context from the database (monitoring intent, entity lists, prior run history), compute URL-level changes via set algebra against the previous run, and detect content-level changes by comparing page hashes on retained URLs.

Stage 5 is the first LLM call — generating a structured summary that classifies each finding. Stage 6 cross-checks the findings against the source content. Stages 7 and 8 compute net information gain and evaluate stop conditions.

Stages 9 and 10 are the two scoring paths: a deterministic heuristic (V1) and an LLM-as-judge evaluation with deterministic guards (V2). A resolution function selects the final score that consumers (notifications, digest, UI) use. V2 takes priority when available; V1 is the fallback for older runs or LLM failures.

This dual-path design was not the original plan. We built V1 first, found cases it could not handle, added V2, found that V2 could not be trusted without constraints, and landed on the current architecture. The next sections explain each layer and the failures that motivated them.

Layer 1: The Deterministic Heuristic

The V1 score is a fast, deterministic algorithm that produces a 0-100 signal from manifest evidence alone. No LLM calls. It runs on every execution and resolves in milliseconds.

How It Works

The score is computed additively from a set of positive and negative factors:

FactorRangeCondition
Base: changes detected+20Any net-new or dropped URLs vs. previous run
Base: stop condition met+50Workflow's stop condition evaluated to true
Activity bonus0 to +20+4 per prior execution, capped at 20. Rewards monitors with history.
Change rate0 to +15change_rate × 0.15, capped. Higher URL churn → more likely something moved.
Recency0 to +15+15 if results < 1h old, +10 if < 6h, +5 if < 24h
Alert highlights+15Findings flagged as alert-type by the LLM summary
First-run baseline0 to +30Scaled by finding count: 3+ findings = +30, 1-2 = +20, URLs only = +10, empty = 0
Content changes+15Retained URLs whose page content hash changed since last run
No-change penalty-20 to -40LLM summary says "no new findings." Reduced to -20 if content hashes prove pages changed.
Churn penalty-15 to -25URL rotation without new findings. Escalates to -25 after 10+ executions.
Empty findings-10Non-first run produced zero classified findings

Final score is clamped to 0–100 and mapped to a signal level: urgent (70+), notable (40–69), routine (20–39), or noise (0–19).

The Churn Problem That Drove the Penalty System

The churn penalty did not exist in the first version of V1. The original heuristic was straightforward: +20 for any changes, +15 for recency, done. This worked until monitors started tracking domains with high URL turnover — job boards, news aggregators, social media feeds. These sources rotate their result sets constantly. Every run had "new" URLs, so every run scored 35+ and triggered a notification.

The user experience was terrible. A monitor checking HackerNews daily would email the user every single day with "new results found" even though the actual information landscape had not changed. The URLs were different, but the substance was the same.

Our first fix was simple: if the summary text contains phrases like "no new findings" or "nothing has changed," apply a -40 penalty. This worked for obvious cases but introduced a new failure mode — the LLM's summary language was non-deterministic. The same no-change run might say "No notable developments were identified" on one execution and "The competitive landscape remains stable with continued focus on..." on the next. The first phrasing triggered the penalty. The second evaded it. Identical runs were scoring 5 one day and 52 the next.

We fixed that with a two-pronged approach: check both the headline (which is more formulaic and stable) and the body text for no-change patterns, and also infer the finding count from structured classification fields instead of relying on text pattern matching. That stabilized the scores.

Then we added the escalating churn penalty. The insight was simple: a source that has rotated URLs without producing new findings for 10 consecutive runs is not going to start producing findings on run 11. The -15 penalty was not aggressive enough for chronic noise generators, so after 10 executions of the same pattern, it escalates to -25.

Content Hash Detection: The Silent Update Problem

URL-level set algebra catches obvious changes but misses an important case: a retained URL whose page content changed.

If a company quietly updates its pricing page, the URL stays the same. A naive diff would say "nothing changed." Our system hashes the crawled content of every page and compares hashes across runs for retained URLs. When hashes diverge, the run gets a +15 content-change bonus and the delta summary prompt receives the before/after content as evidence.

We added this in April 2026 after noticing that compliance filing monitors were missing in-place updates. A pharmaceutical company had updated a clinical trial status page — same URL, new data — and the monitor scored it as a no-change run because the URL set was identical. Content hashing caught the class of problem.

Where V1 Falls Short

The heuristic is fast and deterministic, but it has blind spots:

  • It cannot distinguish a genuinely important new URL from a noise URL. Both get +20.
  • It cannot tell whether a 500-word article is a novel announcement or a rehash of last month's story.
  • It has no concept of monitoring intent. The same result set scores the same regardless of what the user is actually watching for.

We needed a second layer that understood meaning, not just change.

Layer 2: LLM-as-Judge Signal Scoring

The V2 signal judge is an LLM call that evaluates how much a completed run matters relative to the user's monitoring intent. It receives the intent, the run's findings, the URL-level changes, and the content-level changes, and returns a structured judgment.

Three Bounded Subscores

The LLM returns three subscores that sum to 100 max:

SubscoreRangeQuestion It Answers
intent_materiality0–60How much did this run change the user's knowledge relative to their exact monitoring intent?
global_attention0–30If this finding is real, how much attention should a human give it? (urgency, financial impact, time sensitivity)
confidence0–10Is this a real signal or noise / broken workflow?

The asymmetric weighting is deliberate. Intent materiality dominates because the same finding can be urgent for one monitor and irrelevant for another. A GPU pricing change is noise for someone monitoring semiconductor R&D but critical for someone monitoring cloud infrastructure costs. We tried equal weighting (33/33/33) initially and found that the global_attention score was too volatile — the LLM would rate any business news as high-attention regardless of whether the user cared about it.

Five Signal Postures

The LLM also selects a signal posture from five non-overlapping bands:

PostureScore BandMeaning
major_update75–100Intent satisfied, high-attention findings, time-critical
meaningful_update45–74Intent partially satisfied, useful but not urgent
all_clear25–44Patrol completed, no relevant change detected
workflow_gap10–24Workflow couldn't observe target surface (blocked crawls, coverage gaps)
noise0–9Irrelevant churn, off-topic results, empty output

The all_clear posture is a positive outcome: the system checked and confirmed that nothing relevant moved. "We looked and found nothing" is fundamentally different from "we didn't look."

The Guards: What Went Wrong Without Them

The first version of V2 shipped without deterministic guards. We trusted the LLM to produce well-calibrated scores within the declared ranges. This was a mistake.

Within the first week of production, we found three categories of failure:

Failure 1: Score inflation on no-change runs. A monitor tracking ASML lithography equipment had not seen any changes for three weeks. The LLM consistently scored these runs between 38 and 44 — technically within the all_clear band — because it found the monitoring intent "relevant" and the existing baseline "comprehensive." A user with a custom suppression threshold of 30 was getting emails every run that said "No new findings detected." They were not happy.

The fix was the no-change ceiling: if a run produced zero NEW or UPDATE findings and the stop condition was not met, cap the score at 25 (floor of all_clear). This was the single most impactful guard. We lowered it from 44 to 25 after the first incident.

Side-by-side comparison showing how the same no-change run scored 42 (above threshold, notification sent) before the ceiling fix and 25 (below threshold, suppressed) after.
The no-change ceiling bug. Before the fix, the LLM could score a no-change run at 42, exceeding a user's custom threshold of 30 and triggering a spurious email. After lowering the ceiling to 25, no-change runs stay below common thresholds.

Failure 2: Subscore range violations. The LLM would occasionally return intent_materiality: 75 (max is 60) or confidence: 15 (max is 10). These out-of-range values inflated total scores unpredictably. The fix was subscore clamping: each value is clamped to its declared range before summing.

Failure 3: Posture-score inconsistency. The LLM would select meaningful_update as its posture but produce subscores summing to 82 (which is major_update territory). Or it would say noise but score 35. The fix was posture band enforcement: the total score is clamped to the band declared by the LLM's own posture selection. If it says meaningful_update, the score stays in 45–74.

We also added a trust cap: a workflow that the LLM marks as untrustworthy (fail) cannot score above 30, and watch cannot exceed 74. A broken workflow producing garbage results should not trigger an urgent notification.

After guards, the posture is re-derived from the final score so the database record is always consistent. No more rows where the posture says meaningful_update but the score says 22.

Scoring a Real Run: The AMD DigitalOcean Signal

Here is what both scoring paths produce for the same run — the moment a monitor tracking AMD data center adoption caught DigitalOcean's GPU announcement:

Two parallel scoring panels showing the same AMD run scored by V1 heuristic (69, capped from 74) and V2 LLM judge (46, posture meaningful_update), with the resolution function selecting V2.
Both paths score the same run independently. V1 sees URL-level changes and recency. V2 evaluates semantic meaning relative to the monitoring intent. The resolution function prefers V2 when available.

V1 scores it at 69 (capped from a raw 74 to stay in the notable band). V2 scores it at 46 — lower, because intent_materiality (32/60) reflects that this is a meaningful but not earth-shattering data center deal. Both agree it should be delivered. The resolution function picks V2 (46).

The V2 score is more useful here because it distinguishes between "a cloud provider adopted AMD GPUs" (materiality: moderate) and "AMD lost a major hyperscaler contract" (materiality: high). V1 cannot make that distinction — both would score similarly based on URL changes and recency.

The Length Bias Fix

In May 2026, we discovered a pervasive length bias in our LLM-as-judge evaluations (correlation coefficient rho = +0.55 between summary length and score). Verbose summaries — regardless of information density — were scoring higher than concise ones. A run that produced a 2,000-word summary with 3 findings scored higher than a run with a 400-word summary covering the same 3 findings.

The fix was adding explicit de-biasing instructions to both the signal judge and quality evaluation prompts: "Do NOT reward length, verbosity, or quantity. Equal-accuracy summaries score identically regardless of word count." We also rewrote the coverage_novelty scoring bands to distinguish independent perspectives from raw volume.

Finding Classification: What Changed and Why It Matters

Raw URL-level diffs tell you that something changed. Finding classification tells you what kind of change it is. The LLM summary prompt classifies each finding:

ClassificationMeaningExample
NEWRecent and relevant to monitoring intent"DigitalOcean announced AMD Instinct GPU adoption"
UPDATEChange to a previously reported finding"AMD-Rackspace partnership expanded to 7 regions"
CONTEXTBackground info or off-intent content"General semiconductor industry overview"

This classification system evolved through a significant failure. Initially, results that did not match the monitoring intent were classified as IRRELEVANT and removed from the summary entirely. The problem: the summary would say "nothing happened" even when real web changes existed — they just were not relevant to the intent. Users saw an empty summary, lost trust, and could not inspect what the system had actually found.

We changed the system to downgrade off-intent findings to CONTEXT instead of deleting them. The signal score still caps at 25 (via the no-change ceiling, since there are no NEW or UPDATE findings), so the user is not spammed. But the historical record is preserved, and users can see what was checked and why it was classified as off-topic. Each filtered item includes a per-item reason in a diagnostic object (intent_relevance_filter).

The Quiet-Run Amnesia Bug

One of the subtler bugs we hit: when run N-1 is a quiet run (signal score 7-8), the delta prompt for run N lost all memory of findings from run N-2 and earlier. This caused stale findings to be re-reported as NEW.

The root cause was that the prior-run context loader was feeding the delta prompt the immediately previous run's summary, which for a quiet run was essentially empty. The system had all the infrastructure for loading multiple prior run summaries — it just was not wired up.

The fix: load recent completed runs, filter out quiet ones (signal ≤ 15), and pass up to 3 substantive prior summaries to the delta prompt. This keeps the LLM's memory proportional to information density rather than temporal proximity. A monitor with 50 completed runs might feed only 8 into the delta prompt.

Self-Healing: When Monitors Drift

A scoring system tells you whether a run was worth escalating. But what happens when the monitor itself is broken — searching for the wrong things, crawling blocked pages, producing chronic noise?

We built an auto-healing system that closes this loop. After postprocessing scores a run, the result is published to a fan-out topic. The auto-healing worker consumes each scored run and evaluates whether the monitor's workflow spec needs repair.

The auto-healing feedback loop: a low-quality run triggers diagnosis, the LLM composer proposes a spec change, the AI edit worker applies it atomically, the next run uses the healed spec, and effectiveness is tracked for future decisions.
A monitor that produced helpfulness 35 for three consecutive runs. The auto-healer diagnoses the problem, proposes a fix, applies it as a new immutable spec version, and tracks whether the fix worked.

How It Decides to Act

The auto-healer uses a conservative trigger strategy. A single low-signal run is normal — most runs should be quiet. The system only intervenes when evidence accumulates:

  • Helpfulness below 40: Single-run quality failure. Trigger.
  • Trustworthiness = "fail": Broken workflow. Trigger immediately.
  • 3+ consecutive runs with signal < 10: Chronic low signal. Something is systematically wrong.
  • Quality ≥ 70 AND signal ≥ 45: Working fine. Skip.
  • Signal < 10 AND quality ≥ 50 AND no chronic streak: Acceptably quiet. Skip.

The chronic streak detection matters most. It distinguishes "this monitor is quiet because nothing happened" (normal) from "this monitor is quiet because it is looking in the wrong places" (broken).

What the Healer Tried — and What We Learned

The auto-healer originally used OpenRouter's Hunter Alpha model (a fine-tuned open model) for generating spec repairs. The quality was poor — it would propose vague changes like "improve the search queries" without concrete modifications. We switched to Claude Haiku 4.5 at temperature 0.2, which produces specific, actionable edits ("Replace query: 'Oxylabs product updates new features 2026' with type: crawl, urls: ['https://oxylabs.io/blog/product-updates']").

The most important design decision was effectiveness tracking. Every automatic edit records the helpfulness score of the run before the edit and the first run after. This creates a concrete feedback signal:

Edit: "Replace broad news queries with targeted crawls of official announcements"
Helpfulness before: 35
Helpfulness after:  62
Verdict: improved

When the healer considers a new edit, it sees the history of prior edits and their outcomes. If a previous edit regressed helpfulness, the healer is instructed to try a fundamentally different approach or do nothing. This prevents the system from oscillating between two bad strategies.

Signal score over time for a monitor, with configuration edits marked on the timeline.
Real production data from the AMD Data Center monitor. Scores 2-11 across versions 5-9 (red dots) while queries were too broad. After spec edits at v11 and v13, scores climb to 46-67. The flat red stretch is not wasted — it is the evidence that motivated the repair.

The version history in the chart is real. The AMD monitor went through 15 spec versions. Versions 5-9 scored 2-11 because the search queries were too broad, pulling general semiconductor news instead of specific data center adoption signals. The auto-healer identified weak_source_authority as a chronic issue, replaced search queries with targeted crawls of vendor announcement pages, and the scores improved to 46-67 within two versions.

A before-and-after diff of a monitor's workflow spec, showing a search replaced with a direct crawl.
A concrete spec edit: the auto-healer replaced a broad search query with a direct crawl of the official product updates blog. Version history shows 6 adaptations across 15 versions for this single monitoring branch.

The Evaluation Surface

Persistent monitoring creates something that one-shot search does not: a trail of comparable decisions.

Every run is an observation: what sources were checked, what changed, what was classified as NEW vs. CONTEXT, what score was assigned, whether the notification was delivered. Over hundreds of runs across thousands of monitors, these observations accumulate into a dataset of monitoring judgments:

  • Which source types reliably produce signal vs. chronic churn?
  • Which intent phrasings lead to high-quality coverage vs. noisy results?
  • Which query patterns generate the best information gain per credit spent?
  • Does the escalating churn penalty (-15 to -25) decay fast enough?
  • Is the no-change ceiling of 25 too aggressive or too permissive?

The auto-healer already exploits this structure in a narrow way — tracking before/after helpfulness across edits. But the broader opportunity is to use the accumulated (intent, observation, judgment, outcome) tuples to improve the scoring system itself. The signal judge's posture definitions, the heuristic's penalty weights, the suppression threshold defaults — all of these are parameters that could be tuned against the growing body of production decisions.

The repeated-patrol architecture means every monitor is generating its own evaluation corpus by just running. We do not have to construct synthetic benchmarks or hand-label datasets. The system's own operational history is the evaluation surface.

What We Ship vs. What We Started With

A year ago: Brave Search wrapper → LLM summary → email every run.

Today: 10-stage postprocessing pipeline, dual-path scoring with deterministic guards, content hash detection, finding classification with four categories, escalating churn penalties, prior-run memory filtering, auto-healing with effectiveness tracking, and configurable per-monitor suppression thresholds.

The core insight did not change: most runs should be quiet. What changed was our understanding of how many things have to work correctly for that silence to be trustworthy.

The silence is not the absence of a feature. It is the output of a system that checked, compared, classified, scored, applied guards, evaluated intent, and decided — with evidence — that nothing was worth your attention.

Most runs should be quiet. The engineering is in making that silence earned.

Share this article

Cite this post

Plain text
Campos, D. (2026, May 12). Most Runs Should Be Quiet. Zipf AI Blog. https://www.zipf.ai/blog/most-runs-should-be-quiet
BibTeX
@online{campos2026runs,
  author    = {Daniel Campos},
  title     = {Most Runs Should Be Quiet},
  year      = {2026},
  month     = {may},
  url       = {https://www.zipf.ai/blog/most-runs-should-be-quiet},
  note      = {Zipf AI Blog}
}
Daniel Campos

Daniel Campos

Building persistent web monitoring for AI agents. Previously at Microsoft Bing, Snowflake, Neeva, Neural Magic, and Walmart.

Skip to main content
Most Runs Should Be Quiet | Zipf AI