From somnia-agents-skills
Deep-dive reference for the LLM Parse Website agent on Somnia — search a domain (or directly scrape a URL) with a real browser and extract structured data using the on-chain LLM. Covers ExtractString and ExtractANumber, search-vs-direct mode, multi-page extraction, and the auto-injected reasoning / confidence fields. Use when the data lives on a webpage rather than a JSON API — sports scores from news sites, awards results, e-commerce prices, content not exposed via an API.
How this skill is triggered — by the user, by Claude, or both
Slash command
/somnia-agents-skills:somnia-agents-llm-parse-websiteThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
The LLM Parse Website agent (`llm-parse-website`) bridges the gap between Somnia smart contracts and the open web. It either searches a domain or directly scrapes a URL using a real headless browser, converts the page to markdown, and feeds it to the on-chain LLM with a structured-output schema. Use it when the data you need is **on a webpage**, not behind a JSON API.
The LLM Parse Website agent (llm-parse-website) bridges the gap between Somnia smart contracts and the open web. It either searches a domain or directly scrapes a URL using a real headless browser, converts the page to markdown, and feeds it to the on-chain LLM with a structured-output schema. Use it when the data you need is on a webpage, not behind a JSON API.
Read the master
somnia-agentsskill first for the request lifecycle, gas model, and callback pattern. This document only covers the agent-specific ABI and quirks.
| Field | Value |
|---|---|
agentId | 12875401142070969085 |
| Per-agent price | 0.10 (whole tokens — SOMI on Mainnet, STT on Testnet) — most expensive base agent (LLM + browser session) |
| Default consensus | Majority — page contents change slowly and the LLM is deterministic |
| Source of truth | references/agents.json |
| Function | Output type | Use case |
|---|---|---|
ExtractString(key, description, options, prompt, url, resolveUrl, numPages) | string | Best Picture winner, team name, news headline classification |
ExtractANumber(key, description, min, max, prompt, url, resolveUrl, numPages) | uint256 | Sports score, count of items on a page, version number |
Two output flavors today. Both share the same input semantics.
| Input | Type | Notes |
|---|---|---|
key | string | Field name the LLM sees in the schema ("best_drama", "senegal_goals"). Snake_case recommended. |
description | string | Field description for the LLM — explain exactly what value you want. Treated as part of the schema, not the prompt. |
options (string only) | string[] | If non-empty, the model must pick one of these. Empty array = unconstrained. |
min, max (number only) | uint256 | Bounds on the extracted number. Set both to 0 to disable. Negative values are clamped to 0 (output is uint256). |
prompt | string | Natural-language extraction prompt. Also used as the search query when resolveUrl = true. Make it search-engine-friendly. |
url | string | Either a base domain ("goldenglobes.com") for search-mode, or a direct URL ("https://example.com/page") for scrape-mode. |
resolveUrl | bool | true → run a search (uses prompt as the query, url as a domain filter). false → directly scrape url (and only url). |
numPages | uint8 | Max pages to fetch in search-mode. When resolveUrl = false, capped at 1. Higher = more context, more cost. 3 is a sane default. |
resolveUrl = true)The agent runs a search query like <prompt> site:<url>, fetches up to numPages results, converts each to markdown, and concatenates them as context for the LLM. Use this when you don't know the exact URL — e.g. "the Wikipedia page for the 2026 Africa Cup of Nations final".
ExtractANumber({
key: "senegal_goals",
description: "Number of goals scored by Senegal in the 18/1/26 AFCON final vs Morocco.",
min: 0,
max: 0, // bounds disabled
prompt: "Africa Cup of Nations final score Senegal Morocco 18 January 2026",
url: "espn.com", // domain filter
resolveUrl: true,
numPages: 3
});
resolveUrl = false)The agent fetches exactly one URL — the full URL must be in url. Use this when you control the source page and don't want search-engine variance affecting consensus. numPages is capped at 1.
ExtractString({
key: "best_drama",
description: "Title of the film that won Best Motion Picture - Drama at the 2026 Golden Globes.",
options: new string[](0),
prompt: "Best Picture winner",
url: "https://www.goldenglobes.com/winners/2026",
resolveUrl: false,
numPages: 1
});
Recommendation for production: prefer direct mode with a stable URL. Search-mode results can drift over time, hurting consensus stability and reproducibility.
// SPDX-License-Identifier: MIT
pragma solidity ^0.8.20;
import {
IAgentRequester,
IAgentRequesterHandler,
Response,
Request,
ResponseStatus
} from "./IAgentRequester.sol";
interface IParseWebsiteAgent {
function ExtractString(
string memory key,
string memory description,
string[] calldata options,
string memory prompt,
string memory url,
bool resolveUrl,
uint8 numPages
) external returns (string memory);
function ExtractANumber(
string memory key,
string memory description,
uint256 min,
uint256 max,
string memory prompt,
string memory url,
bool resolveUrl,
uint8 numPages
) external returns (uint256);
}
contract WebExtractor is IAgentRequesterHandler {
IAgentRequester public immutable platform;
uint256 public constant AGENT_ID = 12875401142070969085;
uint256 public constant SUBCOMMITTEE_SIZE = 3;
uint256 public constant PRICE_PER_AGENT = 0.10 ether;
string public latestString;
uint256 public latestNumber;
mapping(uint256 => bool) public pendingString;
mapping(uint256 => bool) public pendingNumber;
constructor(address platform_) { platform = IAgentRequester(platform_); }
function getBestDrama() external payable returns (uint256 requestId) {
string[] memory options = new string[](0);
bytes memory payload = abi.encodeWithSelector(
IParseWebsiteAgent.ExtractString.selector,
"best_drama",
"Title of the film that won Best Motion Picture - Drama.",
options,
"Best Picture winners at the 2026 Golden Globe Awards",
"goldenglobes.com",
true,
uint8(3)
);
requestId = _create(payload);
pendingString[requestId] = true;
}
function getSenegalGoals() external payable returns (uint256 requestId) {
bytes memory payload = abi.encodeWithSelector(
IParseWebsiteAgent.ExtractANumber.selector,
"senegal_goals",
"Number of goals scored by Senegal in the 18/1/26 AFCON final vs Morocco.",
uint256(0), uint256(0),
"Africa Cup of Nations final score Senegal Morocco 18 January 2026",
"espn.com",
true,
uint8(3)
);
requestId = _create(payload);
pendingNumber[requestId] = true;
}
function _create(bytes memory payload) internal returns (uint256) {
uint256 deposit = platform.getRequestDeposit() + PRICE_PER_AGENT * SUBCOMMITTEE_SIZE;
require(msg.value >= deposit, "Underfunded");
return platform.createRequest{value: deposit}(
AGENT_ID, address(this), this.handleResponse.selector, payload
);
}
function handleResponse(
uint256 requestId,
Response[] memory responses,
ResponseStatus status,
Request memory /* details */
) external override {
require(msg.sender == address(platform), "Only platform");
if (status != ResponseStatus.Success || responses.length == 0) return;
if (pendingString[requestId]) {
delete pendingString[requestId];
latestString = abi.decode(responses[0].result, (string));
} else if (pendingNumber[requestId]) {
delete pendingNumber[requestId];
latestNumber = abi.decode(responses[0].result, (uint256));
}
}
receive() external payable {}
}
import { encodeFunctionData, parseAbi } from 'viem';
const abi = parseAbi([
'function ExtractString(string key, string description, string[] options, string prompt, string url, bool resolveUrl, uint8 numPages) returns (string)',
'function ExtractANumber(string key, string description, uint256 min, uint256 max, string prompt, string url, bool resolveUrl, uint8 numPages) returns (uint256)',
]);
const stringPayload = encodeFunctionData({
abi,
functionName: 'ExtractString',
args: [
'best_drama',
'Title of the film that won Best Motion Picture - Drama.',
[],
'Best Picture winners at the 2026 Golden Globe Awards',
'goldenglobes.com',
true,
3,
],
});
const numberPayload = encodeFunctionData({
abi,
functionName: 'ExtractANumber',
args: [
'senegal_goals',
'Number of goals scored by Senegal in the 18/1/26 AFCON final vs Morocco.',
0n, 0n,
'Africa Cup of Nations final score: number of goals for Senegal',
'espn.com',
true,
3,
],
});
The agent internally builds a JSON schema for the LLM and adds three auxiliary fields that don't appear in the ABI output but are logged in the receipt:
| Field | Type | Meaning |
|---|---|---|
reasoning | string | LLM's chain-of-thought for the extraction |
answerable | bool | Whether the model believes the page actually contains the answer |
confidence_score | int (0–100) | Model's self-reported confidence |
You can't read these on-chain — but they're invaluable for debugging via the receipt API. The agent also runs a map-reduce extraction across multiple chunks / pages and merges by highest confidence. If answerable = false is the consensus across validators, the request typically returns an empty / zero output (or Failed if the model refuses outright).
ExtractString(
key: "world_cup_winner",
description:"Country that won the 2026 FIFA World Cup.",
options: [],
prompt: "FIFA World Cup 2026 winner final result",
url: "https://en.wikipedia.org/wiki/2026_FIFA_World_Cup",
resolveUrl: false,
numPages: 1
)
Direct mode + Wikipedia is the most reliable combination for one-shot retrieval — the page is stable, well-indexed, and reachable.
ExtractString(
key: "outcome",
description:"Did Senegal win the 18/1/26 AFCON final against Morocco?",
options: ["yes","no"],
prompt: "AFCON 2026 final Senegal Morocco result",
url: "espn.com",
resolveUrl: true,
numPages: 3
)
Constrained options + multi-page search → robust against any single source being missing or inconsistent.
ExtractANumber(
key: "marathon_time_minutes",
description:"Winning time in minutes (rounded down) of the 2026 Boston Marathon men's race.",
min: 100, max: 300,
prompt: "2026 Boston Marathon men's winner finish time",
url: "boston.com",
resolveUrl: true,
numPages: 3
)
Bounds prevent obvious mis-parses (min > 0 filters out cases where the model picks a year).
Search engines rank results based on time, geography, and load. Two validators searching simultaneously can see different result sets. The agent has retry / replacement logic to mitigate this, but for production-critical extractions prefer direct mode with a stable URL — Wikipedia, official archives, the source's own permanent page.
numPages cost / quality trade-offMore pages = more chunks = more LLM calls = higher executionCost. The agent does map-reduce extraction across chunks and short-circuits at 90% confidence — so 3 pages is rarely much worse than 5, and often better than 1.
The agent uses a real browser and waits for content to render. Pages that lazy-load on scroll or behind a click won't be captured. If the data appears only after interaction, JSON API Request against the underlying XHR endpoint is usually a better fit.
The agent partitions URLs into HTML and PDF; PDFs are downloaded and parsed directly. Long PDFs get chunked aggressively, which can degrade extraction quality compared with the equivalent HTML.
If the page changes between validator executions (live scoreboards, real-time prices), Majority consensus can fail. Wait for results to be stable before invoking, or use Threshold consensus and aggregate post-hoc.
key and description matterThe model's structured-output schema uses key as the JSON field name and description as the per-field doc. Vague keys ("x") and prompts that contradict the description produce worse extractions than concrete, narrow phrasing.
ExtractANumber returns uint256. Pages reporting negative values (temperatures, deltas) get clamped to 0. Use the LLM Inference agent's inferNumber (which returns int256) when signed values matter.
Failed?In order of likelihood:
answerable = false across validators. Switch to direct mode or refine the prompt.fetch step in the receipt.options didn't include the actual answer). Loosen options or drop them.getRequestDeposit() + 0.10 × subSize.somnia-agents — request lifecycle, deposit math, callback patternsomnia-agents-invoke — interactive CLI to fire one-off ExtractString / ExtractANumber calls without writing a contractsomnia-agents-json-fetch — when the data is in a JSON API instead of HTMLsomnia-agents-llm-inference — when you want free-form inference rather than structured page extractionProvides behavioral guidelines to reduce common LLM coding mistakes, focusing on simplicity, surgical changes, assumption surfacing, and verifiable success criteria.
Searches, retrieves, and installs Agent Skills from prompts.chat registry using MCP tools like search_skills and get_skill. Activates for finding skills, browsing catalogs, or extending Claude.
Creates, edits, and optimizes skills for Claude Code, including drafting, evaluating with test prompts, iterating on performance, and improving skill descriptions for better triggering accuracy.
npx claudepluginhub emrestay/somnia-agents-skills --plugin somnia-agents-skills