AI Research

Why Agents Need Perception, Not Just Memory

Daniel Campos
Daniel Campos
·13 min read
agentsinfrastructuremonitoringperceptionzipf
Why Agents Need Perception, Not Just Memory

Most of the AI infrastructure conversation right now is about memory. Better retrieval. Better embeddings. Better ways to get the right document into a context window at the right moment. RAG pipelines, vector databases, knowledge graphs, hybrid search — these are real engineering improvements, and if your data is mostly static, they work well.

We spent the last year building the other half: perception. A system that does not wait for questions. It watches, on a schedule, remembers what it found last time, compares against a baseline, scores whether the change matters, and stays quiet when it does not. Over a recent 30-day window, the system completed 9,062 patrol runs across 1,349 active monitors and suppressed 75% of notifications. Most runs were quiet. That is the point.

This post explains what makes perception architecturally different from retrieval, what we tried, and what we learned. If you want the full implementation walkthrough — the state machine, the execution engine, the self-healing loop — that is in What a Monitor Actually Does. This post is the why.

What Memory Infrastructure Actually Does

A typical retrieval pipeline for AI agents works like this:

  1. A question arrives.
  2. The system embeds the question into a vector.
  3. It searches an index — a vector database, a full-text search engine, or both — for documents that match.
  4. The top-K results are injected into a prompt.
  5. An LLM generates an answer.

This is a good architecture for a specific class of problem: answering questions about a known corpus. The documents were ingested at some point in the past. The index was built from that snapshot. When a question arrives, the system finds the most relevant slice of that snapshot and produces an answer.

The key structural property is that retrieval is reactive and stateless. Each query starts from scratch. There is no concept of "what the system found last time." There is no comparison against a prior run. The system does not know whether the answer it just gave was also true yesterday, or whether it stopped being true an hour ago.

This is not a criticism. It is a description of the architecture. Retrieval systems are optimized for latency (fast answers) and relevance (right documents). They are not designed to detect change.

Where Memory Fails

The failure mode is subtle: a retrieval system can confidently return an answer that used to be true.

The corpus was indexed at time T. The world changed at time T+1. A question arrives at time T+2. The system searches the index (still reflecting time T), finds a relevant document, and produces a confident answer based on stale data. The user has no way to know the answer is outdated because the system itself does not know.

Concrete examples:

  • A company updates its pricing page. The crawl that originally indexed the page ran last week. The vector database still contains last week's pricing. A user asks "what is Company X's pricing?" and gets a precise, well-cited, wrong answer.

  • A key executive leaves. The information appears first on LinkedIn, then in a press release, then in structured databases. A retrieval system searching the structured database will not find it until days later.

  • A regulatory filing lands after market close. The filing is not yet in any indexed corpus. No amount of retrieval engineering will surface a document that has not been ingested.

The standard response is: "Just re-crawl more frequently." This helps, but it introduces a new set of problems. How frequently? For which sources? You cannot re-crawl the entire web every hour. Even if you could, you would still face the question: which of these re-crawled pages actually changed in a way that matters?

Re-crawling more frequently is a brute-force approach to a problem that requires judgment. The volume of web changes is enormous. Most of them are irrelevant. A system that re-indexes everything and re-answers every question is doing orders of magnitude more work than necessary to catch the small fraction of changes that matter.

What Perception Requires

Perception is architecturally different from retrieval. It is not retrieval with a scheduler bolted on. It requires five capabilities that retrieval systems do not have.

Side-by-side diagram contrasting memory pipelines with perception pipelines.
Memory answers from stored data after a question is asked. Perception keeps returning to live sources, compares against a prior baseline, and feeds corrections back into the system.

1. State Across Runs

A retrieval system is stateless per query. A perception system must remember what it found on the previous run so it can compute a diff.

We store the output of every patrol run — every URL found, every page crawled, every content hash — as a manifest. The next run loads the previous manifest and computes a diff: which URLs are new, which dropped, which were retained. For retained URLs, we compare SHA-256 content hashes to detect in-place page modifications — a pricing page that kept its URL but changed its data will have a different hash. The implementation details include URL normalization, baseline establishment on first runs, and finding classification. The key architectural point is that this is not a feature of retrieval — it is a feature of remembering what was true last time.

2. Scheduling

Retrieval is demand-driven: a question arrives, the system answers. Perception is supply-driven: the system checks on a cadence whether the world changed, regardless of whether anyone asked.

This is a different operational model. It requires scheduling infrastructure (we use EventBridge Scheduler), SQS-backed execution queues, and idempotent execution semantics (if a scheduled run fires twice due to SQS redelivery, the second invocation detects the first and exits).

3. Change Detection

A retrieval system returns results ranked by relevance. A perception system must return results ranked by change — specifically, by how much the world moved since the last check.

We compute this at two levels. URL-level change detection uses set algebra to find new, dropped, and retained URLs. Content-level change detection uses SHA-256 hashes on retained URLs to find pages that were modified in place. Both feed into a structured finding classification system that labels each result as NEW (relevant change), UPDATE (modification to a previously reported finding), or CONTEXT (background information).

4. Judgment

This is the hardest part, and the part that makes perception more than "search on a timer."

The web changes constantly. Most changes do not matter. A perception system must decide which changes deserve attention and suppress the rest. This requires judgment that retrieval systems do not need — because a retrieval system only activates when a human asks, so the human provides the judgment.

A power-law curve showing rare high-significance changes.
Signal distribution across web changes. Meaningful changes follow a power law — rare, unpredictable, and buried in a long tail of irrelevant noise. Broad coverage is necessary, but broad coverage without judgment creates spam.

We call this the Zipfian paradox: the fewer things that matter, the wider you have to look. You cannot predict which tiny fraction of web changes will be important. So you need broad coverage. But broad coverage without scoring produces alert spam.

Our solution is a two-pass signal scoring system — a deterministic heuristic that scores change signals in milliseconds, and an LLM-as-judge that evaluates semantic meaning relative to the user's monitoring intent. The LLM returns three subscores: intent materiality (how relevant is this change to what the user cares about?), global attention (how important is this finding objectively?), and confidence (is this signal or noise?). Deterministic guards clamp the scores to prevent inflation — we found through production experience that the LLM would consistently score no-change runs between 38 and 44 unless we imposed a ceiling.

The details of this scoring system, including the specific failures that motivated each guard, are in Most Runs Should Be Quiet.

5. Delivery

A retrieval system returns answers when asked. A perception system must deliver signals proactively — to Slack, email, CRM systems, webhooks, or downstream AI agents — because the value of a signal degrades with latency. A pricing change noticed three hours late is still useful. A pricing change noticed three days late may not be.

Delivery includes suppression logic. If the signal score falls below a per-monitor threshold, the notification is suppressed. The patrol still ran. The results are still stored. The user can inspect them. But the system decided — based on the scoring pipeline — that nothing crossed the bar for active notification. Over a recent 30-day window, 75% of notifications were suppressed by this logic.

The Capability Gap

CapabilityRetrieval (Memory)Monitoring (Perception)
TriggerUser asks a questionSchedule fires
StateNone across queriesPrior-run manifest, content hashes
Unit of workQuery → ranked resultsPatrol → scored signal
Change awarenessNone (snapshot at ingestion)URL diff + content hash + finding classification
JudgmentRelevance to queryMateriality of change vs. monitoring intent
OutputAnswer on demandDelivered signal with evidence (or suppression)
Failure modeStale answer served confidentlyMissed signal or noisy alert
ValueKnow fasterKnow first

Most agent systems today have sophisticated memory and no perception at all. They answer questions about a frozen snapshot of the world. The more confidently they answer, the more dangerous the staleness becomes, because the confidence masks the fact that the underlying data may have changed.

What We Tried and What Broke

We did not start with the architecture described above. We started with something much simpler, and the failures taught us why perception is its own engineering problem.

Version 1: Search wrapper with email delivery. The original system took a monitoring intent, ran a set of search queries, passed the results to an LLM for a summary, and emailed the summary to the user. Every run produced a notification. There was no scoring, no suppression, no concept of "quiet." The LLM always found something to say — given any set of search results, it would reliably produce a multi-paragraph summary even when nothing had changed. Users stopped reading the emails within a week.

Version 2: Added change detection. We introduced URL-level set algebra so the system could compute what was new, dropped, and retained. This helped — the system could now distinguish "same results as last time" from "genuinely new URLs appeared." But URL-level diffs miss in-place content changes (same URL, different page content), and they cannot distinguish a meaningful new URL from a noise URL. The system was better but still too noisy.

Version 3: Added signal scoring. We built the deterministic heuristic — a fast 0-100 score based on URL changes, recency, content hash divergence, and penalty factors. This allowed suppression: runs scoring below a threshold did not generate notifications. The noise problem improved dramatically. But the heuristic had blind spots. It could not distinguish a genuinely important article from a repost. It had no concept of the user's monitoring intent. Two monitors tracking different things would score the same result set identically.

Version 4: Added LLM-as-judge scoring. We added a second scoring path where an LLM evaluates the run's findings relative to the monitoring intent and returns structured subscores. This gave us intent-aware scoring. But the LLM could not be trusted to produce well-calibrated numbers. Within the first week, we found it inflating scores on no-change runs (the "everything is interesting" failure), violating subscore range constraints, and producing scores inconsistent with its own posture declarations. We added deterministic guards — ceiling caps, range clamping, posture band enforcement — and the calibration stabilized.

Version 5: Added self-healing. Even with good scoring, monitors drift. A vendor blocks crawl requests. A page migrates to a new URL. A search query that produced targeted results starts returning generic news. We built an auto-healing loop: after each run is scored, the system evaluates whether the monitor's workflow spec needs repair. If quality is chronically low, an LLM composer proposes a concrete fix (replace this search query with a direct crawl of the official blog), a separate worker applies the fix as a new immutable spec version, and effectiveness is tracked via before/after helpfulness scores.

Version 6: Added chronicle. When a monitor has been running for weeks, individual run summaries are less useful than the longitudinal arc. We built Chronicle — a system that compresses dozens of patrol executions into a single narrative via deterministic entity tracking and LLM synthesis. This is where monitoring stops being a series of disconnected pings and starts being a system of knowledge about a market, a company, or a thesis.

Each version addressed a failure in the previous one. The architecture we have today is the accumulation of those failures and their fixes.

Why This Is Not a Feature on Top of Retrieval

It is tempting to think of monitoring as "retrieval, but automated." Run the same search every day. Diff the results. Email the user. Done.

We tried that. It was Version 1. The problems were fundamental, not incidental:

  • Without state, you cannot compute a meaningful diff. You need to store and compare manifests across runs.
  • Without judgment, every diff produces a notification. You need a scoring system that understands both change magnitude and intent relevance.
  • Without self-repair, monitors degrade silently. You need a feedback loop that detects and fixes drift.
  • Without longitudinal memory, each run is isolated. You need a system that accumulates understanding over time.

These are not incremental improvements to a retrieval pipeline. They are different capabilities that require different data structures (manifests, baselines, version histories), different operational models (scheduled execution, SQS fan-out, durable job tracking), and different evaluation criteria (signal calibration, suppression rates, self-healing effectiveness).

Perception is not retrieval on a timer. It is its own problem.

The Systems That Will Matter

We do not think retrieval is unimportant. Vector databases are useful. RAG pipelines are useful. Search APIs are useful. Memory is real progress.

But memory is only half the stack.

The agent systems that matter over the next few years will need both:

  • Memory — so agents can recall what they know and answer questions from a known corpus.
  • Perception — so agents can tell when what they know stopped being true, without waiting for someone to ask.

Memory tells an agent what it knew. Perception tells an agent whether it is still true.

The gap today is clear. Most agent infrastructure is heavily invested in memory and has almost no perception at all. The Zipfian paradox — the fewer things that matter, the wider you have to look — means that closing this gap is not a matter of adding a cron job to an existing retrieval system. It is a different category of infrastructure, with its own failure modes, its own evaluation surfaces, and its own hard problems.

The best systems will have both.

Share this article

Cite this post

Plain text
Campos, D. (2026, May 12). Why Agents Need Perception, Not Just Memory. Zipf AI Blog. https://www.zipf.ai/blog/why-agents-need-perception-not-just-memory
BibTeX
@online{campos2026agents,
  author    = {Daniel Campos},
  title     = {Why Agents Need Perception, Not Just Memory},
  year      = {2026},
  month     = {may},
  url       = {https://www.zipf.ai/blog/why-agents-need-perception-not-just-memory},
  note      = {Zipf AI Blog}
}
Daniel Campos

Daniel Campos

Building persistent web monitoring for AI agents. Previously at Microsoft Bing, Snowflake, Neeva, Neural Magic, and Walmart.

Skip to main content
Why Agents Need Perception, Not Just Memory | Zipf AI