Web Scrape

Overview

Visits a web page and pulls out just the readable stuff — the headline, the article, and any tables — while ignoring the menus, ads, and footers. Think of it as handing the agent the clean version of the page.

How it works

Does a plain HTTP fetch on the URL, parses the HTML, and extracts the title, body text, headings, tables, and optionally the link list. Supports a CSS selector so you can target a specific region of the page. It does not run JavaScript — for pages that need it, use the 'browse' tool instead.

Example

When a user asks:

Pull the article text from this blog post.

the agent calls the tool:

web_scrape(url="https://example.com/blog/post")

and gets back: the article's title, body text, and any tables — menus, ads, and footer stripped out.

Use it in a workflow

Wire this tool into a SwarmAI crew. Use the YAML DSL for declarative workflows, or the Java builder API when you want full programmatic control.

YAML DSL

# content-harvest.yaml
name: content-harvest-crew
process: SEQUENTIAL

agents:
  - id: extractor
    role: Content Extractor
    goal: Pull clean article text from public web pages
    tools:
      - web_scrape

tasks:
  - id: content-harvest-task
    agent: extractor
    description: Visit the supplied URL and return just the body text, title, and any tables.

Java

import ai.intelliswarm.swarmai.agent.Agent;
import ai.intelliswarm.swarmai.task.Task;
import ai.intelliswarm.swarmai.swarm.Swarm;
import ai.intelliswarm.swarmai.swarm.SwarmOutput;
import ai.intelliswarm.swarmai.process.ProcessType;
import ai.intelliswarm.swarmai.tool.common.WebScrapeTool;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.beans.factory.annotation.Autowired;

@Autowired ChatClient chatClient;
@Autowired WebScrapeTool webScrapeTool;

Agent extractor = Agent.builder()
    .role("Content Extractor")
    .goal("Pull clean article text from public web pages")
    .chatClient(chatClient)
    .tool(webScrapeTool)
    .build();

Task extractorTask = Task.builder()
    .description("Visit the supplied URL and return just the body text, title, and any tables.")
    .agent(extractor)
    .build();

SwarmOutput result = Swarm.builder()
    .agent(extractor)
    .task(extractorTask)
    .process(ProcessType.SEQUENTIAL)
    .build()
    .kickoff();

What it's good for

Real scenarios where agents put this tool to work.

News article and blog extraction

Pulling structured tables off static pages

CSS-selector targeted scraping in research workflows

Lightweight alternative to 'browse' for non-JS pages

Source

Implementation lives at swarmai-tools/src/main/java/ai/intelliswarm/swarmai/tool/common/WebScrapeTool.java in the swarm-ai repository.

Open web_scrape on GitHub →

Overview

How it works

Example

Use it in a workflow

What it's good for

Source

Product

Company

Connect