How to add browser automation to your AI agent without writing Selenium

The moment your AI agent needs to interact with a website — log in, extract data, fill a form, click a button — you face a choice: build browser infrastructure yourself or find the right abstraction. Most teams start by reaching for Selenium or Playwright directly. Most teams end up regretting it.

Why Selenium is the wrong starting point for agents

Selenium was designed for deterministic test automation. You write scripts that navigate to known pages, interact with known elements, and assert known outcomes. The paths are fixed. The error handling is manual. The session management is your problem.

AI agents are different. They don't follow fixed paths — they reason about what to do next based on what they see. They encounter unexpected page states. They need to handle authentication dynamically. They need to know not just what happened, but why it happened, so the model can adjust.

Wiring Selenium or raw Playwright into an agent loop produces brittle code. Every new page layout, every bot detection challenge, every session expiry becomes a failure mode you have to anticipate and handle. You end up spending more time maintaining the browser plumbing than improving the agent's actual capabilities.

What browser automation looks like in an agent context

The right abstraction for agent browser automation has a few key properties:

High-level actions, not low-level selectors. The agent should tell the runtime "navigate to X and extract the product price," not "find the element with class .price-tag and call .text on it."
Rendered page access. Many important web pages are JavaScript-rendered. Your agent needs to see what a real browser sees, not just the raw HTML returned by a GET request.
Session and auth handling. Cookies, tokens, bot detection — these should be handled by the execution layer, not re-implemented every time.
Structured outputs. The agent needs data it can reason over: extracted text, structured tables, page metadata — not a raw screenshot or DOM blob.
Observability. Every browse action should log what URL was visited, what was extracted, how long it took, and whether it succeeded. The agent loop depends on reliable observations.

The infrastructure you don't want to build

If you do decide to build browser infrastructure for your agent yourself, here's what you're signing up for:

Provisioning headless browser instances (Chromium, typically) in a way that scales and doesn't leak resources
Managing browser sessions across concurrent agent runs without cross-contamination
Handling Cloudflare, reCAPTCHA, and other bot detection mechanisms
Implementing page load detection reliably across single-page apps
Extracting structured content from rendered pages in a way the LLM can use
Graceful timeout and error handling that surfaces useful information to the model
Logging every action in a way that lets you replay and debug

None of this is impossible. All of it takes longer than you expect. And none of it is what makes your agent valuable.

The right approach

The right approach is to treat browser actions as a primitive in your agent's tool set — not a DIY infrastructure project. Your agent calls browse(url, task). The execution layer handles everything else: spinning up the browser, managing the session, extracting structured content, returning observable results, logging everything.

result = await legs.browse(
    url="https://example.com/product/123",
    extract="product name, price, availability"
)
# Returns: {"name": "Widget Pro", "price": "$49", "available": True}

This is what a browser action looks like from the agent's perspective: a high-level intent with a structured result. The agent can reason over the result. It doesn't need to know how the browser was provisioned or how the content was extracted.

When you need more control

Sometimes you need more than just extraction — you need interaction: clicking buttons, filling forms, navigating multi-step flows. The abstraction still holds. The execution layer handles the click, the form fill, the navigation. The agent describes what it wants to accomplish. The runtime handles the mechanics.

The key principle: your agent code should read like intent, not like browser manipulation. The moment your agent code starts talking about selectors and element handles, you've leaked the wrong abstraction into the wrong place.

Agent Legs includes a browser action type that handles full page rendering, session management, structured extraction, and action logging. No Selenium. No Playwright config. One import. Get early access.

How to add browser automation to your AI agent without writing Selenium

Why Selenium is the wrong starting point for agents

What browser automation looks like in an agent context

The infrastructure you don't want to build

The right approach

When you need more control

Your agent has a brain.Give it legs.

Your agent has a brain.
Give it legs.