← All posts

How I Built a Real-Time Geopolitical Event Pipeline with Embeddings and LLMs

I run a side project called Global Crisis Monitor — a real-time geopolitical intelligence dashboard that turns 100+ news feeds into structured, impact-scored events on a map. The first post covered the why. This one covers the how: every pipeline stage, every threshold, and every tradeoff I made along the way.

The system runs continuously on a Hetzner server orchestrated by Coolify, processes around 1,000 articles per day, and produces structured events that appear on the dashboard typically within 10–15 minutes of an article being published. At this volume, LLM cost for synthesis is on the order of a few dollars per day; embeddings are free (Ollama). If you’re building anything that turns many documents into deduplicated, structured insights — news aggregation, support-ticket triage, or research synthesis — the architecture in this post transfers directly.

TL;DR — A 7-stage pipeline: Fetch (RSS, incremental, watermarked) → Embed (Ollama, 768-dim) → Cluster (pgvector, 0.72 similarity, 20-min hold) → Synthesize (3-pass LLM: triage, extraction, assessment with calibration anchors) → Conflict Link (keyword + embedding, multi-match) → Enrich (GDELT, ReliefWeb, UCDP, Wikipedia) → Publish (Telegram, IndexNow, social). Key lesson: embed and cluster before you send anything to an LLM, or you get duplicates and drift.

The Pipeline at a Glance

The core runs as a 7-stage sequential pipeline. Each stage reads from and writes to a shared PostgreSQL database on the same server, so stages are independently re-runnable without triggering the full pipeline.

1. Fetch → 2. Embed → 3. Cluster → 4. Synthesize → 5. Conflict Link → 6. Enrich → 7. Publish

The ingester is a Hono microservice running on Bun. The dashboard is a Next.js app (App Router, React 19). Both share the same Postgres through Prisma.

Schedule and re-runs. Each stage runs on its own cron; they don’t wait for each other. Fetch runs every few minutes, Embed and Cluster more frequently so new articles move through quickly, Synthesize and Enrich when there’s work to do. Because each stage only processes records in the right status (e.g. Fetch only inserts new articles, Embed only picks fetched articles), you can re-run a stage without re-triggering the whole pipeline — useful after a deploy or to backfill. Concurrency is single-worker per stage: one Fetch run at a time, one Synthesize run at a time. FOR UPDATE SKIP LOCKED (see Data Model) ensures that if you ever run multiple workers, they won’t double-process the same rows.

Below is the full architecture, stage by stage.

Pipeline Health — stage status, processing times, article queue (fetched → embedded → clustered → synthesized), database totals

Global Crisis Monitor dashboard — map, event feed, and latest briefing

Stage 1: Fetch — Incremental RSS Ingestion

The first stage pulls from 100+ RSS feeds: Reuters, BBC, Al Jazeera, The Guardian, UN agencies, Chatham House, ICIJ, ProPublica, and dozens more.

Each Feed record in the database stores watermarks: lastItemPubDate and lastItemUrl. On every run, we only fetch articles published after the watermark. Feeds are tiered:

TierTypeExamplesMax articles per fetch
1Wire/breakingAP, Reuters, AFP15
2StandardAll others5

Each feed is assigned a weight between 0.75 and 0.96. Wire services and UN agencies sit near the top (0.95–0.96); opinion outlets and less-established sources land lower. These weights influence confidence scores downstream.

Deduplication starts here with two checks:

CheckHow it works
Exact URL matchArticle.url has a unique constraint
Content hashSHA-256 of the normalised title catches the same story syndicated across outlets

Title normalisation (trim, lowercase, collapse whitespace) is applied before the hash and again before embedding so syndicated variants match. We don’t hammer feeds: tier limits (5 or 15 articles per fetch) and staggered cron keep us well under any reasonable rate limit; the main backpressure is “only fetch after watermark.”

Failed feeds increment a consecutiveErrors counter and are skipped after 5 consecutive failures, so one broken RSS feed can’t stall the pipeline.

Stage 2: Embed — Self-Hosted Vector Embeddings

Every new article gets a 768-dimensional vector embedding via Ollama running nomic-embed-text locally on the server. I switched from OpenAI’s text-embedding-3-small to a self-hosted model to eliminate per-request costs and latency to an external API.

What gets embedded. We embed a single concatenated string: normalised title + first 500 characters of the article body (or summary if present). That keeps the vector focused on the main claim and early context; full-body embedding would dilute signal and blow token limits. The same normalisation as for the content hash (trim, lowercase, collapse whitespace) is applied before sending to Ollama.

The embeddings are batched (100 texts per Ollama request) and stored directly on the Article record as a vector(768) column in Postgres, leveraging pgvector.

The key insight: embed before you send anything to an LLM. My first approach was to batch articles and send each batch directly to GPT-4 for synthesis. The result was duplicate events — the same incident in different batches became two separate events. Vector embeddings solve this by grouping similar articles first; the LLM then only sees pre-grouped clusters.

Stage 3: Cluster — HNSW-Accelerated Similarity Grouping

This is where the magic happens. Articles about the same real-world event sit close together in vector space, even when the wording differs completely:

“Kyiv power grid struck by overnight barrage” ↔ “Ukraine capital hit by drone and missile attack”

These two headlines produce embeddings with ~0.87 cosine similarity — well above the CLUSTER_THRESHOLD of 0.72.

The clustering algorithm:

StepAction
1Take all articles with status embedded
2For each article, search existing pending clusters using pgvector’s cosine distance (<=> operator) against the cluster centroid
3If similarity ≥ 0.72 → assign to that cluster; update centroid: centroid += (new_embedding - centroid) / n (n = cluster size after adding the article)
4If no match → create a new single-article cluster

I use HNSW indexes on the vector columns for approximate nearest-neighbor search, which keeps cluster assignment sub-millisecond even as the article count grows.

One subtlety I learned the hard way: you need a hold time before synthesis. My initial approach synthesised clusters immediately, which meant an article arriving 5 minutes later — about the same event — would create a new cluster instead of joining the existing one. The fix was a 20-minute hold (CLUSTER_HOLD_MINUTES = 20): clusters sit in pending status for at least 20 minutes before they’re eligible for synthesis, giving time for related articles from different feeds to accumulate.

Stage 4: Synthesize — Multi-Pass LLM Processing

This is the most complex stage. Each mature cluster (one that has passed the 20-minute hold) goes through three LLM passes:

Pass 1: Triage (gpt-4o-mini)

A fast relevance filter. The LLM reads the cluster’s articles and decides: is this a geopolitical/conflict/humanitarian event, or is it sports, entertainment, or a product launch? Irrelevant clusters are discarded. Relevant ones get an initial title and summary.

Clusters are batched in groups of 10 for throughput. If an LLM call fails (timeout, rate limit, or API error), we retry with exponential backoff (up to 3 attempts); if it still fails, the cluster is marked failed and we move on so the pipeline doesn’t block. At ~1k articles/day we stay under OpenAI rate limits; batching and the tier-based model choice (gpt-4o only for 3+ article clusters) keep token usage in check.

Pass 2: Extraction (gpt-4o-mini)

Structured metadata extraction from the cluster:

FieldDescription
CategoriesControlled vocabulary (or new ones if needed)
EntitiesOrganisations, governments, armed groups, with their roles
PeopleNotable figures with titles
LocationsPrimary (where the action happened) or secondary (mentioned but not central)
Temporal dataEvent date, ongoing-since date, deadlines

Pass 3: Assessment (gpt-4o or gpt-4o-mini)

Here’s where model selection gets interesting. Clusters with 3+ articles or tier-1 sources use gpt-4o for assessment; smaller clusters use gpt-4o-mini. The reasoning: multi-source events are higher signal and justify the cost of a stronger model.

The assessment produces:

OutputValues / meaning
Sentiment directionEscalating / De-escalating / Stable / Unknown
Sentiment intensity0–100 (severity of current state)
Impact scores (0–100 each)Humanitarian, Economic, Security, Political, Overall
Analysis notes1–2 sentence analyst assessment

Impact dimensions: Humanitarian (casualties, displacement), Economic (market disruption, sanctions), Security (military, terrorism), Political (government changes, diplomacy). Overall reflects the most significant dimension.

The prompt includes calibration anchors to keep impact scores consistent:

Example eventOverall score
Russia–Ukraine invasion (2022)95
UN sanctions resolution45
Minor border skirmish15
Massive earthquake with international response75

Without anchors, LLMs drift — they either score everything as high-impact or compress all scores into a narrow band. The anchors keep the scale meaningful.

Confidence Scoring

After synthesis, confidence is scaled by source count:

SourcesConfidence multiplier
3+1.0×
20.8×
10.6×

Single-source events are inherently less trustworthy, and the score reflects that.

Deduplication Across Runs

After synthesising a new event, we check it against all existing events via embedding similarity with a DEDUP_THRESHOLD of 0.82. Above that threshold, we merge — updating the existing event with new sources and refreshing metadata — instead of inserting a duplicate.

Events don’t exist in isolation. The conflict-link stage maps them to ongoing crises (“Russia-Ukraine War”, “Israel-Palestine Conflict”, “Iran Conflict Escalation”) and detects new conflicts emerging from patterns.

Conflict Pulse — conflict pages with status, linked events, and timeline

Linking to Existing Conflicts

Two-pass matching:

PassTypeHow it works
1Keyword match (fast)Conflict has locationKeywords and entityKeywords; event text containing any keyword is linked. Events can link to multiple conflicts — I learned this when “Iran Conflict Escalation” showed 0 events because the code had a break after the first match.
2Embedding similarity (slow)If no keyword match, compare event embedding to conflict embeddings; threshold 0.72.

Detecting New Conflicts

When 3+ unlinked events cluster together by topic, the LLM proposes a new conflict: name, region, description, and keywords. These proposals are deduplication-checked (0.82 similarity) against existing conflicts to avoid creating “Sudan Civil War” twice.

Conflict Status Reassessment

Every 6 hours, conflicts are reassessed based on recent event patterns:

StatusMeaning
escalatingIntensity increasing
activeOngoing
de-escalatingIntensity decreasing
dormantNo events in 72 hours → trends toward this

Stage 6: Enrich — External Data Integration

After synthesis and conflict linking, events are enriched from multiple external sources:

SourceWhat it addsImpact adjustment
GDELT 2.0Event volume and toneHigh volume (>500 mentions): +5; strongly negative tone (<-3): +5
ReliefWeb (UN OCHA)Active disasters in the event’s regionOne active disaster: +3; three or more: +5 more
UCDP (Uppsala Conflict Data Program)Battle-related fatalitiesAny fatalities: +5; >100: +5 more
WikipediaEntity enrichmentPeople, locations, organisations → WikiEntity table (extracts, thumbnails, Wikidata IDs)

Impact adjustments from enrichment are clamped to ±15 points to prevent external data from overwhelming the LLM’s assessment. If an external API (GDELT, ReliefWeb, UCDP) is down or times out, we skip enrichment for that event and leave impact as the LLM-assessed value; no retry queue, so we don’t backlog. At current volume we’re well under any public API rate limits.

Event Relations

A separate sub-stage builds causal links between events using heuristics:

FactorScoreCap
Same location+0.4
Each shared entity+0.150.45
Each shared person+0.150.45
Search window7 days

Relation types: escalation_of, continuation_of, related_to, response_to. The dashboard uses these to show “Related Events” on each event page; embedding-based relations (EventRelation) take priority over the heuristic fallback. In the admin Pipeline view, the Link-relations and Classify-relations steps are the sub-jobs that build and type these EventRelation rows (heuristic scoring first, then optional LLM pass to classify the relation type).

Stage 7: Publish — Alerts and Search Indexing

The final stage handles distribution:

Telegram Alerts

Events with impact ≥ 78 trigger a Telegram alert to a monitoring channel. The format is deliberate:

ImpactEmoji
90+🚨
78+

Plus title, location, impact score, and a direct link.

IndexNow

All new events (not just high-impact) are submitted to the IndexNow protocol, which notifies Bing, Yandex, Naver, and Seznam about new URLs. It’s a single POST request with up to 10,000 URLs. Google doesn’t support IndexNow, but the others do — and fast indexing matters for news content.

Briefings

A separate job (every 20 minutes) builds a briefing from the last 24 hours of events: the LLM gets the event set and produces a short narrative summary plus key takeaways. The result is stored as a Briefing record and rendered at /briefings/[slug]; the same content is then used for social publishing.

Social Publishing

Briefings are auto-published to X/Twitter and Threads. The X publisher does character-count budgeting (280 chars minus t.co URL length minus hashtags) to maximise the summary excerpt that fits.

The Dashboard: Next.js with Shared Prisma

The dashboard is a Next.js 16 app (App Router) that reads from the same Postgres. Key surfaces:

RoutePurpose
HomeBento layout: Leaflet map (markers by impact, flash events pulsing), event feed, latest briefing
/newsSearchable archive — TanStack Table + Query, server-side pagination/sort/filter by impact, location, people, entities, date range
/news/[slug]Event detail — JSON-LD NewsArticle, related events, linked conflicts, source attribution
/briefings/[slug]AI-generated briefings — JSON-LD AnalysisNewsArticle
/conflicts/[slug]Conflict pages — JSON-LD LiveBlogPosting, causal narratives, timeline of linked events
AdminInternal: overview, pipeline health, entities, vocabulary (role-protected)

Entity Index — entity detail with metrics, 30-day activity graph, event timeline, and related conflicts

Admin dashboard

Alongside the public site, the same Next.js app has an admin area for pipeline ops and content curation. It’s the same Prisma + Postgres, with role-based access so only I can reach it. Access is session-based (login); secrets (OpenAI key, Telegram bot token, DB URL) live in environment variables on the server, not in the repo. Four main views:

Overview — High-level counts (articles, events, clusters, conflicts), recent activity, and a quick sense of pipeline throughput.

Admin overview — counts, recent activity, pipeline throughput

Pipeline — Per-stage status, last run times, processing duration, and article queue (fetched → embedded → clustered → synthesized). Auto-refreshes so I can see at a glance if a stage is stuck or failing.

Admin pipeline — stage status, run times, article queue

Entities — Browse and edit the entities (people, organisations, locations) that the pipeline and LLM produce. Useful for merging duplicates, fixing names, or adjusting how entities attach to events.

Admin entities — browse and edit people, organisations, locations

Vocabulary — Manage the controlled vocabularies used in synthesis: categories, conflict types, impact labels. When the LLM invents a new category, it shows up here; I can promote it to the official list or fold it into an existing term so future events stay consistent.

Admin vocabulary — categories, conflict types, impact labels

The admin doesn’t run the pipeline (that’s the Hono ingester on cron); it’s for visibility and for cleaning up the data the pipeline produces. For observability beyond the UI: the ingester logs structured JSON (stage, count, duration, errors) to stdout; Coolify captures logs. I don’t run a separate metrics stack, but the admin Pipeline view plus “no events in the last 2 hours” is enough to notice when something’s stuck. Alerts could be added (e.g. Telegram on repeated stage failure) if the pipeline were critical.

SEO

SEO for a news site is non-trivial; the current setup includes:

ComponentWhat it does
Dynamic sitemapAll events, briefings, conflicts, entities, countries — changeFrequency and priority
Google News sitemapnews-sitemap.xml with <news:news> tags for events in the last 48 hours
RSS feeds/news/feed.xml, /briefings/feed.xml — auto-discovery via <link rel="alternate">
Structured dataNewsArticle, AnalysisNewsArticle, LiveBlogPosting with full entity markup
llms.txtMachine-readable site description for LLM crawlers

The Data Model

The Prisma schema has grown to ~20 models. The core ones:

Feed → Article → ArticleCluster → Event → Conflict
                                       → EventRelation
                                       → EventLocation
                                       → EnrichmentData
                                       → WikiEntity

Key design decisions:

DecisionRationale
M2M Events ↔ ConflictsAn event can belong to multiple conflicts
Vector columns everywhereArticles, Events, Conflicts (all 768-dim) have embedding columns for similarity
Pipeline status on Articlefetched → embedded → clustered → synthesized → failed tracks progress
FOR UPDATE SKIP LOCKEDPrevents concurrent runs from processing the same articles; safe for parallel invocation

Backfill and schema changes. When we add a stage or change logic, we re-run from the right place: e.g. after an embedding model change, reset embedded articles to fetched and re-run Embed and downstream. Prisma migrations apply to the shared DB; we run them before deploying a new ingester or dashboard version so the schema is always compatible. There’s no separate backfill script — just “fix status and re-run the stage.”

Data retention. We keep all articles and events; no TTL or automatic pruning. For a side project at ~1k articles/day, storage is negligible. If we needed to reclaim space, we’d archive old articles (e.g. older than 1 year) to cold storage and keep events and clusters for the dashboard.

Numbers and Thresholds

Here’s the full reference table of thresholds I’ve tuned over time:

ParameterValueWhat it controls
CLUSTER_THRESHOLD0.72Min similarity to join a cluster
DEDUP_THRESHOLD0.82Min similarity to merge with existing event
CONFLICT_LINK_THRESHOLD0.72Min similarity to link event → conflict
CONFLICT_DEDUP_THRESHOLD0.82Min similarity for new conflict proposals
CLUSTER_HOLD_MINUTES20Wait time before synthesising a cluster
MAX_CLUSTER_SIZE25Max articles per synthesis call
STRONG_MODEL_THRESHOLD3 articlesClusters ≥3 articles → gpt-4o
ALERT_IMPACT_THRESHOLD78Min impact for Telegram alerts
MIN_EVENTS_FOR_NEW_CONFLICT3Unlinked events needed to propose a conflict

The 0.72 cluster threshold was lowered from 0.78 after switching from OpenAI embeddings to nomic-embed-text. Different embedding models produce different similarity distributions — what was 0.82 similarity in OpenAI space might be 0.75 in nomic space for the same article pair. Tuning these thresholds after an embedding model change is essential.

What I Learned

1. Embed first, LLM second. Sending raw articles to an LLM for grouping produces duplicates, inconsistent clusters, and high costs. Embedding-based clustering gives you deterministic, cheap grouping — the LLM only sees pre-grouped clusters.

2. Hold time matters. Real-world events don’t arrive at the same time across all feeds. Reuters might publish 10 minutes before The Guardian. Without a hold time, you synthesise too early and create duplicate events from late-arriving articles.

3. Calibration anchors prevent score drift. Without concrete examples in the prompt, impact scores drift across runs. One day everything is 80+, the next everything is 40. Anchoring to known events (Russia-Ukraine = 95, minor skirmish = 15) keeps the scale stable.

4. Multi-match, not first-match. Events often relate to multiple ongoing conflicts. A drone strike in the Red Sea might be relevant to both “Yemen-Houthi Conflict” and “Israel-Palestine Conflict”. Linking to only the first match loses context.

5. Enrichment should adjust, not override. External data (GDELT volume, UCDP fatalities) provides valuable calibration signals, but clamping adjustments to ±15 points prevents external noise from overriding the LLM’s assessment.

6. Self-hosted embeddings pay off fast. At ~1,000 articles/day, OpenAI embedding costs were small but nonzero and added latency. Ollama running nomic-embed-text on the same server produces embeddings in milliseconds with zero API cost. The quality difference for clustering purposes is negligible.

What I’d Do Differently Next Time

AreaCurrent behaviourImprovement
Cluster centroidsUpdated as articles joinRecompute from all member embeddings periodically for stability
TriageSome sports/entertainment clusters slip throughMore negative examples in the triage prompt
Conflict embeddingsComputed once at creationRefresh from recent linked events to improve link accuracy

Stack Summary

Runtime

ComponentTech
IngesterHono, Bun
DashboardNext.js 16, React 19
HostingHetzner, Coolify

Data

ComponentTech
DatabasePostgreSQL, pgvector
ORMPrisma 7

AI

ComponentTech
EmbeddingsOllama (nomic-embed-text, 768-dim)
LLMOpenAI (gpt-4o, gpt-4o-mini)

Integrations

ComponentTech
External dataGDELT, ReliefWeb, UCDP, Wikipedia
AlertingTelegram, X, Threads
SEODynamic sitemaps, Google News sitemap, IndexNow, JSON-LD

The pattern — fetch → embed → cluster → synthesise → conflict link → enrich → publish — is general-purpose. If you’re building any system that turns many documents into structured, deduplicated insights — whether that’s news, support tickets, research papers, or internal reports — the architecture transfers directly.

The source of Global Crisis Monitor is private, but the dashboard is live at global-crisis-monitor.com. You can follow the automated feed at @glob_crisis_mtr on X, or subscribe via RSS. If you’re building something similar or adapting this pipeline for another domain, I’d be curious to hear how it goes.