Skill

somnia-agents-llm-parse-website

Deep-dive reference for the LLM Parse Website agent on Somnia — search a domain (or directly scrape a URL) with a real browser and extract structured data using the on-chain LLM. Covers ExtractString and ExtractANumber, search-vs-direct mode, multi-page extraction, and the auto-injected reasoning / confidence fields. Use when the data lives on a webpage rather than a JSON API — sports scores from news sites, awards results, e-commerce prices, content not exposed via an API.

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/somnia-agents-skills:somnia-agents-llm-parse-website

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

The LLM Parse Website agent (`llm-parse-website`) bridges the gap between Somnia smart contracts and the open web. It either searches a domain or directly scrapes a URL using a real headless browser, converts the page to markdown, and feeds it to the on-chain LLM with a structured-output schema. Use it when the data you need is **on a webpage**, not behind a JSON API.

SKILL.md

349 lines · ~3.6k tokens

Stats

LanguageTypeScript

Stars0

MaintenanceGood

Last CommitMay 8, 2026

Actions

View Source View Plugin View on GitHub View README

Stats

Actions

LLM Parse Website Agent

The LLM Parse Website agent (llm-parse-website) bridges the gap between Somnia smart contracts and the open web. It either searches a domain or directly scrapes a URL using a real headless browser, converts the page to markdown, and feeds it to the on-chain LLM with a structured-output schema. Use it when the data you need is on a webpage, not behind a JSON API.

Read the master somnia-agents skill first for the request lifecycle, gas model, and callback pattern. This document only covers the agent-specific ABI and quirks.

Identity

Field	Value
`agentId`	`12875401142070969085`
Per-agent price	`0.10` (whole tokens — SOMI on Mainnet, STT on Testnet) — most expensive base agent (LLM + browser session)
Default consensus	Majority — page contents change slowly and the LLM is deterministic
Source of truth	`references/agents.json`

Methods

Function	Output type	Use case
`ExtractString(key, description, options, prompt, url, resolveUrl, numPages)`	`string`	Best Picture winner, team name, news headline classification
`ExtractANumber(key, description, min, max, prompt, url, resolveUrl, numPages)`	`uint256`	Sports score, count of items on a page, version number

Two output flavors today. Both share the same input semantics.

Parameters in detail

Input	Type	Notes
`key`	`string`	Field name the LLM sees in the schema (`"best_drama"`, `"senegal_goals"`). Snake_case recommended.
`description`	`string`	Field description for the LLM — explain exactly what value you want. Treated as part of the schema, not the prompt.
`options` (string only)	`string[]`	If non-empty, the model must pick one of these. Empty array = unconstrained.
`min`, `max` (number only)	`uint256`	Bounds on the extracted number. Set both to `0` to disable. Negative values are clamped to 0 (output is `uint256`).
`prompt`	`string`	Natural-language extraction prompt. Also used as the search query when `resolveUrl = true`. Make it search-engine-friendly.
`url`	`string`	Either a base domain (`"goldenglobes.com"`) for search-mode, or a direct URL (`"https://example.com/page"`) for scrape-mode.
`resolveUrl`	`bool`	`true` → run a search (uses `prompt` as the query, `url` as a domain filter). `false` → directly scrape `url` (and only `url`).
`numPages`	`uint8`	Max pages to fetch in search-mode. When `resolveUrl = false`, capped at 1. Higher = more context, more cost. 3 is a sane default.

Search mode (`resolveUrl = true`)

The agent runs a search query like <prompt> site:<url>, fetches up to numPages results, converts each to markdown, and concatenates them as context for the LLM. Use this when you don't know the exact URL — e.g. "the Wikipedia page for the 2026 Africa Cup of Nations final".

ExtractANumber({
  key:          "senegal_goals",
  description:  "Number of goals scored by Senegal in the 18/1/26 AFCON final vs Morocco.",
  min:          0,
  max:          0,                       // bounds disabled
  prompt:       "Africa Cup of Nations final score Senegal Morocco 18 January 2026",
  url:          "espn.com",              // domain filter
  resolveUrl:   true,
  numPages:     3
});

Direct mode (`resolveUrl = false`)

The agent fetches exactly one URL — the full URL must be in url. Use this when you control the source page and don't want search-engine variance affecting consensus. numPages is capped at 1.

ExtractString({
  key:          "best_drama",
  description:  "Title of the film that won Best Motion Picture - Drama at the 2026 Golden Globes.",
  options:      new string[](0),
  prompt:       "Best Picture winner",
  url:          "https://www.goldenglobes.com/winners/2026",
  resolveUrl:   false,
  numPages:     1
});

Recommendation for production: prefer direct mode with a stable URL. Search-mode results can drift over time, hurting consensus stability and reproducibility.

Solidity recipe

// SPDX-License-Identifier: MIT
pragma solidity ^0.8.20;

import {
    IAgentRequester,
    IAgentRequesterHandler,
    Response,
    Request,
    ResponseStatus
} from "./IAgentRequester.sol";

interface IParseWebsiteAgent {
    function ExtractString(
        string  memory key,
        string  memory description,
        string[] calldata options,
        string  memory prompt,
        string  memory url,
        bool   resolveUrl,
        uint8  numPages
    ) external returns (string memory);

    function ExtractANumber(
        string  memory key,
        string  memory description,
        uint256 min,
        uint256 max,
        string  memory prompt,
        string  memory url,
        bool   resolveUrl,
        uint8  numPages
    ) external returns (uint256);
}

contract WebExtractor is IAgentRequesterHandler {
    IAgentRequester public immutable platform;
    uint256 public constant AGENT_ID = 12875401142070969085;
    uint256 public constant SUBCOMMITTEE_SIZE = 3;
    uint256 public constant PRICE_PER_AGENT = 0.10 ether;

    string public latestString;
    uint256 public latestNumber;
    mapping(uint256 => bool) public pendingString;
    mapping(uint256 => bool) public pendingNumber;

    constructor(address platform_) { platform = IAgentRequester(platform_); }

    function getBestDrama() external payable returns (uint256 requestId) {
        string[] memory options = new string[](0);
        bytes memory payload = abi.encodeWithSelector(
            IParseWebsiteAgent.ExtractString.selector,
            "best_drama",
            "Title of the film that won Best Motion Picture - Drama.",
            options,
            "Best Picture winners at the 2026 Golden Globe Awards",
            "goldenglobes.com",
            true,
            uint8(3)
        );
        requestId = _create(payload);
        pendingString[requestId] = true;
    }

    function getSenegalGoals() external payable returns (uint256 requestId) {
        bytes memory payload = abi.encodeWithSelector(
            IParseWebsiteAgent.ExtractANumber.selector,
            "senegal_goals",
            "Number of goals scored by Senegal in the 18/1/26 AFCON final vs Morocco.",
            uint256(0), uint256(0),
            "Africa Cup of Nations final score Senegal Morocco 18 January 2026",
            "espn.com",
            true,
            uint8(3)
        );
        requestId = _create(payload);
        pendingNumber[requestId] = true;
    }

    function _create(bytes memory payload) internal returns (uint256) {
        uint256 deposit = platform.getRequestDeposit() + PRICE_PER_AGENT * SUBCOMMITTEE_SIZE;
        require(msg.value >= deposit, "Underfunded");
        return platform.createRequest{value: deposit}(
            AGENT_ID, address(this), this.handleResponse.selector, payload
        );
    }

    function handleResponse(
        uint256 requestId,
        Response[] memory responses,
        ResponseStatus status,
        Request memory /* details */
    ) external override {
        require(msg.sender == address(platform), "Only platform");
        if (status != ResponseStatus.Success || responses.length == 0) return;

        if (pendingString[requestId]) {
            delete pendingString[requestId];
            latestString = abi.decode(responses[0].result, (string));
        } else if (pendingNumber[requestId]) {
            delete pendingNumber[requestId];
            latestNumber = abi.decode(responses[0].result, (uint256));
        }
    }

    receive() external payable {}
}

TypeScript encoding

import { encodeFunctionData, parseAbi } from 'viem';

const abi = parseAbi([
  'function ExtractString(string key, string description, string[] options, string prompt, string url, bool resolveUrl, uint8 numPages) returns (string)',
  'function ExtractANumber(string key, string description, uint256 min, uint256 max, string prompt, string url, bool resolveUrl, uint8 numPages) returns (uint256)',
]);

const stringPayload = encodeFunctionData({
  abi,
  functionName: 'ExtractString',
  args: [
    'best_drama',
    'Title of the film that won Best Motion Picture - Drama.',
    [],
    'Best Picture winners at the 2026 Golden Globe Awards',
    'goldenglobes.com',
    true,
    3,
  ],
});

const numberPayload = encodeFunctionData({
  abi,
  functionName: 'ExtractANumber',
  args: [
    'senegal_goals',
    'Number of goals scored by Senegal in the 18/1/26 AFCON final vs Morocco.',
    0n, 0n,
    'Africa Cup of Nations final score: number of goals for Senegal',
    'espn.com',
    true,
    3,
  ],
});

Auto-injected schema fields

The agent internally builds a JSON schema for the LLM and adds three auxiliary fields that don't appear in the ABI output but are logged in the receipt:

Field	Type	Meaning
`reasoning`	string	LLM's chain-of-thought for the extraction
`answerable`	bool	Whether the model believes the page actually contains the answer
`confidence_score`	int (0–100)	Model's self-reported confidence

You can't read these on-chain — but they're invaluable for debugging via the receipt API. The agent also runs a map-reduce extraction across multiple chunks / pages and merges by highest confidence. If answerable = false is the consensus across validators, the request typically returns an empty / zero output (or Failed if the model refuses outright).

Use case patterns

Awards / events with stable archive pages

ExtractString(
  key:        "world_cup_winner",
  description:"Country that won the 2026 FIFA World Cup.",
  options:    [],
  prompt:     "FIFA World Cup 2026 winner final result",
  url:        "https://en.wikipedia.org/wiki/2026_FIFA_World_Cup",
  resolveUrl: false,
  numPages:   1
)

Direct mode + Wikipedia is the most reliable combination for one-shot retrieval — the page is stable, well-indexed, and reachable.

Prediction-market-style binary outcome

ExtractString(
  key:        "outcome",
  description:"Did Senegal win the 18/1/26 AFCON final against Morocco?",
  options:    ["yes","no"],
  prompt:     "AFCON 2026 final Senegal Morocco result",
  url:        "espn.com",
  resolveUrl: true,
  numPages:   3
)

Constrained options + multi-page search → robust against any single source being missing or inconsistent.

Numeric extraction with bounds

ExtractANumber(
  key:        "marathon_time_minutes",
  description:"Winning time in minutes (rounded down) of the 2026 Boston Marathon men's race.",
  min:        100, max: 300,
  prompt:     "2026 Boston Marathon men's winner finish time",
  url:        "boston.com",
  resolveUrl: true,
  numPages:   3
)

Bounds prevent obvious mis-parses (min > 0 filters out cases where the model picks a year).

Pitfalls specific to llm-parse-website

Search variance breaks consensus

Search engines rank results based on time, geography, and load. Two validators searching simultaneously can see different result sets. The agent has retry / replacement logic to mitigate this, but for production-critical extractions prefer direct mode with a stable URL — Wikipedia, official archives, the source's own permanent page.

`numPages` cost / quality trade-off

More pages = more chunks = more LLM calls = higher executionCost. The agent does map-reduce extraction across chunks and short-circuits at 90% confidence — so 3 pages is rarely much worse than 5, and often better than 1.

JavaScript-heavy SPAs

The agent uses a real browser and waits for content to render. Pages that lazy-load on scroll or behind a click won't be captured. If the data appears only after interaction, JSON API Request against the underlying XHR endpoint is usually a better fit.

PDFs

The agent partitions URLs into HTML and PDF; PDFs are downloaded and parsed directly. Long PDFs get chunked aggressively, which can degrade extraction quality compared with the equivalent HTML.

Date / time-sensitive content

If the page changes between validator executions (live scoreboards, real-time prices), Majority consensus can fail. Wait for results to be stable before invoking, or use Threshold consensus and aggregate post-hoc.

`key` and `description` matter

The model's structured-output schema uses key as the JSON field name and description as the per-field doc. Vague keys ("x") and prompts that contradict the description produce worse extractions than concrete, narrow phrasing.

Negative numbers

ExtractANumber returns uint256. Pages reporting negative values (temperatures, deltas) get clamped to 0. Use the LLM Inference agent's inferNumber (which returns int256) when signed values matter.

Why is my response `Failed`?

In order of likelihood:

Search returned nothing useful — answerable = false across validators. Switch to direct mode or refine the prompt.
Page is gated (paywall, login, geo-block, anti-bot) — agent gets a useless intermediate response. Inspect the fetch step in the receipt.
Selector mismatch — model couldn't satisfy the schema (e.g. options didn't include the actual answer). Loosen options or drop them.
Floor-only deposit — runners skipped. Send getRequestDeposit() + 0.10 × subSize.

Cross-references

somnia-agents — request lifecycle, deposit math, callback pattern
somnia-agents-invoke — interactive CLI to fire one-off ExtractString / ExtractANumber calls without writing a contract
somnia-agents-json-fetch — when the data is in a JSON API instead of HTML
somnia-agents-llm-inference — when you want free-form inference rather than structured page extraction

somnia-agents-llm-parse-website

Invocation

Context Preview

SKILL.md

somnia-agents-llm-parse-website

Invocation

Context Preview

SKILL.md

LLM Parse Website Agent

Identity

Methods

Parameters in detail

Search mode (resolveUrl = true)

Direct mode (resolveUrl = false)

Solidity recipe

TypeScript encoding

Auto-injected schema fields

Use case patterns

Awards / events with stable archive pages

Prediction-market-style binary outcome

Numeric extraction with bounds

Pitfalls specific to llm-parse-website

Search variance breaks consensus

numPages cost / quality trade-off

JavaScript-heavy SPAs

PDFs

Date / time-sensitive content

key and description matter

Negative numbers

Why is my response Failed?

Cross-references

Similar Skills

LLM Parse Website Agent

Identity

Methods

Parameters in detail

Search mode (resolveUrl = true)

Direct mode (resolveUrl = false)

Solidity recipe

TypeScript encoding

Auto-injected schema fields

Use case patterns

Awards / events with stable archive pages

Prediction-market-style binary outcome

Numeric extraction with bounds

Pitfalls specific to llm-parse-website

Search variance breaks consensus

numPages cost / quality trade-off

JavaScript-heavy SPAs

PDFs

Date / time-sensitive content

key and description matter

Negative numbers

Why is my response Failed?

Cross-references

Similar Skills

Search mode (`resolveUrl = true`)

Direct mode (`resolveUrl = false`)

`numPages` cost / quality trade-off

`key` and `description` matter

Why is my response `Failed`?

Search mode (`resolveUrl = true`)

Direct mode (`resolveUrl = false`)

`numPages` cost / quality trade-off

`key` and `description` matter

Why is my response `Failed`?