May 9, 2026 | Engineering

Diagrams as Mermaid, Scraping With Permission, and Triage That Knows Your Stack

Magenta filaments flowing forward through three luminous neon gate-rings and converging into a single white-hot verdict point. Visual metaphor for the four-stop research pipeline.

Last week I pasted a URL into the #research channel of my Discord. Thirty seconds later, a verdict came back in the same channel:

Interesting. The article describes a memory system; my brain already has hybrid retrieval, typed knowledge graph, and MCP-driven CRUD. The novel angle is per-tool-call hook granularity, which is worth a closer look.

I read the article. Agreed with the verdict. Closed the tab. Total time invested: about eight minutes.

That short trip from paste to verdict goes through four stops, three of which I haven’t seen written about much. This post walks through them and shares what’s worth lifting.

The setup, briefly: I run a personal AI second brain on an Nvidia Spark at home. It indexes my notes, my project work, my decisions, my Discord conversations, and now my research reading. When I paste a URL, the system fires four steps in sequence:

  1. Permission check (ToS gate)
  2. Scrape
  3. Visual extraction
  4. Triage against what I already have

Each one does something specific. Let’s walk through.


Stop 1: The permission check

Most tooling that scrapes the web treats Terms of Service as an afterthought. Set a generic User-Agent, ignore robots.txt, hope nobody notices. That works until it doesn’t, and in 2026 with the EU AI Act in force and Cloudflare’s pay-per-crawl rolling out, “ignore the rules” is increasingly a strategic mistake as well as an ethical one.

My pipeline asks the question before anything else fires. Each domain gets one of three labels in a verdict cache: allow, forbid, or ambiguous. Labels are auto-derived on first encounter by:

  • Reading robots.txt
  • Scanning likely ToS paths (/terms, /legal, /tos)
  • Looking for explicit anti-scraping language (“automated access prohibited”, “no scraping permitted”)
  • Looking for explicit permission language (“scraping permitted”, “open data”)

If the host explicitly forbids it, the fetch refuses outright. If the host explicitly permits it, stealth mode fires (Cloudflare bypass, real-browser fingerprint). If it’s ambiguous, plain GET with a clear User-Agent.

There’s also a manual override. The first time I pasted a Wikipedia article, the heuristic refused because Wikipedia’s robots.txt has Disallow: * for generic user-agents. Wikipedia’s not telling me, the human, that I cannot read articles. They’re telling search-engine crawlers and AI training scrapers to back off. So I added a manual whitelist for wikipedia.org and a couple of technical blogs I read regularly. Each whitelist entry has a note (“explicit-paste reads only, low-volume”). The audit trail proves the choice was deliberate.

Why this matters: by the end of 2026, “we scraped everything we could grab” is going to age badly. The gate doesn’t slow me down because I’m not crawling at scale, but it makes my scraping legible to anyone who later asks “what did your tools do?”. And legibility compounds with time, the same way good logging does.

If you’re building a research pipeline, build the gate first. Retrofitting it later means rewriting the dispatch logic and back-filling the verdict cache. Costly when you could have spent thirty minutes upfront.

Stop 2: Scrapling

Once the gate clears, the actual fetching happens via Scrapling. I came across it last week, replaced my home-rolled urllib + BeautifulSoup setup with it the same day, and now I’d recommend it to anyone scraping for personal research.

Two things make it worth the recommendation:

Stealth fetcher. This isn’t about evading detection for shady reasons. It’s about getting the page a real browser would get. For research articles on Medium, Towards Data Science, or any modern Substack-flavoured platform, the difference between a stealth fetch and a plain HTTP request is often “full article content” versus “first paragraph and a sign-up modal”. The pages a stealth fetch returns render exactly as a logged-in reader would see them, because they ARE rendered like that.

Clean structured markdown out of the box. Scrapling returns markdown by default, not raw HTML I have to clean up. Headers, links, images, lists, all preserved with their semantics intact. That’s exactly what I want feeding into the next stage.

There’s an MCP server flavour too, which means any MCP-aware client can call it directly without bespoke wrapper code. I run it as a Docker sidecar next to my brain and it just works.

I’d seen plenty of “I built a scraper with Playwright” posts. Scrapling does the assembly for you and exposes the right knobs. Worth the look. Sharing the wealth.

Stop 3: The vision model walks every image tag

This is the stop I’m most excited about, and the one I think other research-pipeline builders will find most useful to lift.

Once Scrapling returns the markdown, the page has a bunch of ![alt](url) references scattered through the prose. Most pipelines either ignore them, caption them with a one-line description after the fact, or skip everything that isn’t text. All three are lossy.

Mine does something different. A vision model (Qwen3-VL-30B running locally on the Spark) gets each image, plus its alt text for context, and produces one of two things based on what the image actually is.

For screenshots, photos, illustrations: a single concrete sentence describing what’s shown.

Image: Chat interface, light grey, blue user bubble “hi” with reply “Hi tomasonjo, how can I help today?”.

For diagrams (flowcharts, sequence diagrams, ER diagrams, state machines, mind maps): a tight caption AND a full reconstruction of the diagram as mermaid code.

**Image:** Unified agentic memory architecture across coding tools.

```mermaid
flowchart TD
    claudeCode["Claude Code"] -->|"settings.json"| sharedHooks["Shared Python Hooks<br/>log_event.py | inject_memory.py"]
    codex["Codex"] -->|"hooks.json"| sharedHooks
    cursor["Cursor"] -->|"hooks.json"| sharedHooks
    sharedHooks -->|"log events"| neo4j[("Neo4j<br/>Sessions | Events | Memories")]
    neo4j -->|"Events to Memories"| dreamPhase["Dream Phase<br/>(offline)"]

“`

The description gets injected back into the markdown at the image’s actual position, so when I (or any agent) reads the scraped article later, the prose flows naturally with image-content-as-text right where the image was.

Why this is a meaningful step up:

  • Searchable. Mermaid code is text. “Find me articles where the architecture diagram includes Neo4j” becomes a real query, not “manually click through every image”.
  • Renderable. Mermaid renders in Obsidian, GitHub, my dashboard, any markdown-capable surface. When I view the saved scrape later, the diagram still appears visually.
  • Lightweight. A mermaid block for an architecture diagram is maybe 400 bytes. The original PNG might be 200KB. Fifty of these per article add up.
  • Re-indexable. My brain’s vector store indexes the mermaid block as text. It surfaces alongside the prose during retrieval.
  • Editable. If I want to use the same structure in my own writing, I’m one copy-paste away from a working starting point.

I think this matters most for technical paper researchers. Papers are often dense with diagrams: system architectures, ablation flows, hyperparameter trees, data pipelines, training loops. Today those diagrams are dead ends as far as your tooling is concerned. PDFs of diagrams are not searchable, are not summarisable in any useful way, and the only way to reference them is to screenshot back to your notes. With mermaid extraction, every diagram in every paper becomes a queryable, renderable artefact in your knowledge base.

A small example. The article that prompted this post had six diagrams. The vision model reconstructed all six as mermaid. The smallest was 80 bytes. The largest was 600. The original PNGs totalled around 1.4 MB. The mermaid versions are 2.3 KB and they survive a re-index, a vault sync, a git commit, and a cold restart of the brain.

The vision model is not perfect. It occasionally invents an edge that wasn’t there, or fluffs a label by one word. But for the diagrams I’ve fed it, the structural reconstruction has been right four times out of five, and the failures are obvious enough to spot in a glance. I’ll take that.

Stop 4: Triage against what I already have

The final stage is where the agent decides whether the article is worth my time.

Most “is this relevant?” agents fail in the same direction: they’re eager to please. Feed them an article about agentic memory, and they tell you “highly relevant, you should implement this!”. The problem is they don’t know what you’ve already built. So an article describing a memory system is, to them, addressing a gap, even if your second brain has already shipped something more comprehensive.

I hit this exactly. The pipeline’s first triage of the article that inspired this post said:

Verdict: highly_relevant
Reason: Polaris Brain currently lacks an externalised memory layer.

Which is plainly false. I’d already built one. The agent didn’t know.

The fix was simple in hindsight: bake the brain’s existing capabilities into the triage system prompt, explicitly. So now the agent sees a snapshot like:

POLARIS BRAIN CAPABILITIES SNAPSHOT:
– Hybrid retrieval pipeline: FTS + vector embeddings + cross-encoder reranker, IDF-weighted summaries, file-type prefix boosts
– Session memory: per-turn trickle capture, session_summariser every 15 minutes, daily reflection, weekly review
– Typed knowledge graph in SQLite: entity + edge extraction (works_at, advises, founded, depends_on)
– Multi-tier LLM routing with cost-aware fallback chains
– Vision-aware retrieval via local Qwen3-VL
– [the full list, 12 lines]

Default to noise OR interesting unless the article shows something this snapshot does NOT include.

And the verdict guidance was tightened: only flag something as highly_relevant if it shows a specific technique you don’t have AND it would meaningfully improve a known weak metric or open a capability you don’t yet support.

Re-running the same article through the upgraded triage produced:

Verdict: interesting
Reason: SQLite-based graph already handles graph queries; MCP-driven CRUD already provides cross-harness portability. The novel angle is per-tool-call hook granularity (not per-session) which Polaris doesn’t have today. Worth a closer look but not blocking.

That’s a useful verdict. Concrete about what’s redundant, concrete about what’s actually novel, modest in its claim. One paragraph long. I can read it, agree or disagree, move on.

The lesson generalises beyond research triage: any agent making a “should we do X” recommendation needs to be grounded in what you’ve already done. The agent’s prompt is its memory of your stack. If that prompt doesn’t enumerate your capabilities, the agent will assume you have nothing and recommend everything. That’s where AI hype-noise comes from.

The fix isn’t elaborate prompting tricks. It’s just being explicit. Write the snapshot. Include it in the prompt. Skew the default toward “noise” not “novel”. The result is honest signal in a sea of recommendation.

Closing the loop

The thing that makes this pipeline feel useful rather than just clever is that the verdict comes back to the same Discord channel I pasted the URL into. I paste, I get a 30-second pause while the system fetches, summarises, describes images, triages. Then a message appears in the same place I posted the link. Verdict, reasoning, link to the full triage report, link to any opened ticket.

Most research tools silo input from output. Paste here, check the dashboard there, remember to read your weekly digest. Mine closes the loop in the surface I’m already in. Small thing. Big behavioural difference.

What’s worth lifting

If you’re building a research pipeline of your own, four things from this post are worth taking:

  1. Build the ToS gate first. Retrofitting it is harder than building it right upfront, and the legibility compounds.
  2. Use Scrapling (or equivalent) for fetching. Don’t roll your own when an assembled tool does it better and cheaper.
  3. Reconstruct diagrams as mermaid during scrape, inline with the prose. Especially if you read technical content. The library compounds.
  4. Ground your triage in what you already have. Eager-to-please agents are noise generators. The capability snapshot in the prompt fixes them.

The first three are about input quality. The fourth is about signal quality. Both matter.


The pipeline runs in production against my own research feed: Discord paste, ToS gate, Scrapling fetch, Qwen3-VL-30B vision describe with mermaid reconstruction, then capability-grounded triage against the brain’s existing stack. Hardware: single NVIDIA DGX Spark.