
Building an open-source Browser Agent on Fireworks AI
By Fireworks AI |5/16/2025
Qwen 3 models are now available with SOTA reasoning, coding and agentic tool use capabilities. Try Qwen 3 now
By Fireworks AI |5/16/2025
Imagine an AI that doesn't just respond to your questions but can actively navigate the web for you - clicking buttons, filling forms, extracting information, and making decisions just like you would. That's the promise of AI agents with browser control capabilities, and it's becoming a reality with tools like Fireworks AI BrowserUse.
In this technical deep dive, we'll explore how large language models (LLMs) can be given the ability to "see" web content and take actions in real-time. We'll examine the architecture that makes this possible and show why Fireworks AI's inference capabilities are particularly well-suited for this challenging task.
Despite the push toward structured APIs, browsers remain the most universal interface to the web's vast information and services. Here's why building agents that can control browsers matters:
This makes browser automation the most robust approach to web interaction, despite being technically more complex than API integration. The challenge has been making it intelligent enough to handle the unpredictability of the modern web.
Creating an agent that can browse the web effectively requires solving several technical challenges simultaneously. Our architecture tackles these by implementing three core capabilities:
For an AI agent to understand a webpage, it needs to "see" the content. Our solution combines multiple techniques:
Here's a simplified illustration of how we capture browser state:
async def get_browser_state(self):
"""Get the current browser state including DOM and screenshot."""
# Capture screenshot and convert to base64
screenshot = await self.page.screenshot(encoding="base64")
# Get DOM elements with indices for interaction
elements = await self._parse_dom_for_llm(self.page)
# Track viewport position
scroll_position = await self.page.evaluate("window.scrollY")
page_height = await self.page.evaluate("document.body.scrollHeight")
return {
"url": self.page.url,
"title": await self.page.title(),
"screenshot": screenshot,
"elements": elements,
"pixels_above": scroll_position,
"pixels_below": page_height - (scroll_position + viewport_height)
}
This method gives the LLM all the context it needs to understand the current page state.
Once the agent can see the page, it needs to determine what to do next. This is where Fireworks AI's advanced reasoning capabilities come into play:
Our implementation uses structured JSON outputs that force the LLM to maintain this reasoning framework, preventing it from defaulting to vague or unactionable responses.
Finally, the agent needs to translate decisions into actual browser manipulations. Our BrowserUseTool handles this by providing a comprehensive set of atomic actions that can be combined to perform complex tasks:
Navigation Actions: Direct URL navigation, back/forward, refresh, and search capabilities.
Element Interaction: Clicking, text input, scrolling, and keyboard input tools that can manipulate any interactive element.
Content Extraction: Specialized functions for extracting structured data from pages based on semantic goals.
Multi-tab Management: The ability to open, close, and switch between multiple tabs for complex workflows.
This action system provides a clean interface that abstracts away the complexity of browser automation, letting the LLM focus on high-level decision making rather than implementation details.
The core of our browser agent is the continuous loop of:
This loop continues until the task is complete or requires human intervention. What makes this approach powerful is that the agent can adapt to unexpected situations - if a website changes its layout, introduces a new step, or behaves differently than expected, the agent can still navigate it successfully because it's responding to what it actually sees rather than following a pre-programmed script.
One of the key technical challenges in building effective browser agents is the quality and speed of the underlying LLM. This is where Fireworks AI provides significant advantages.
Browser automation requires multiple back-and-forth interactions between the agent and the browser. Traditional LLM architectures introduce noticeable latency in this process, making the agent feel sluggish and unresponsive.
Fireworks AI models are optimized for inference speed, dramatically reducing the time between observation and action. In our testing, this resulted in:
This speed improvement isn't just about user experience - it fundamentally changes what the agent can accomplish by enabling it to keep up with modern, highly interactive websites.
Browser automation demands precision. Vague instructions or formatting errors can cause actions to fail. Fireworks AI's JSON mode produces strictly formatted, valid JSON responses that are more reliable for parsing and executing actions.
Our system leverages this capability to enforce a structured thinking pattern where the LLM must generate valid JSON for every decision, with specific fields for evaluation, memory, goals, and actions. The response_format parameter ensures the model always returns properly structured data:
response = await llm_client.chat.completions.create(
model=model_name,
messages=messages,
tools=tools,
response_format={"type": "json_object"},
temperature=0.2
)
This structured approach forces the model to be explicit about its reasoning and intended actions, reducing errors caused by ambiguous instructions.
Webpage content can be incredibly verbose. A single page might contain thousands of DOM elements, and screenshots add substantial token usage. Fireworks models handle long contexts efficiently, allowing our agent to process more information without hitting token limits.
This means the agent can:
Modern websites are highly visual. Text alone often doesn't capture the full context of what's on a page. Fireworks AI's multimodal capabilities enable our agent to process screenshots alongside DOM data, giving it a more complete understanding of the page.
This visual understanding helps with:
Our configuration allows for separate model selection for text reasoning and visual processing:
# Global LLM configuration
[llm]
model = "accounts/fireworks/models/deepseek-v3"
# Vision model configuration
[llm.vision]
model = "accounts/fireworks/models/firellava-13b"
This flexibility lets us optimize for each aspect of the agent's operation, using specialized models where they excel.
Perhaps the most technically nuanced aspect of building effective browser agents is designing prompts that guide the LLM effectively. Our system prompt is a carefully crafted set of instructions that shapes how the agent interprets what it sees and decides what to do.
The prompt structures the agent's input and output in specific ways:
This strict structure guides the LLM's reasoning process, preventing common failure modes like hallucination, forgetting context, or generating vague commands.
Building browser agents presents numerous technical challenges. Here are the most significant ones we've encountered and how we've addressed them:
Problem: LLMs often hallucinate element indices or try to interact with non-existent elements, particularly when a desired element isn't visible in the current viewport.
Solution: We implemented several strategies to address this, including a precise DOM parsing system that creates unambiguous element references. Our parser traverses the DOM, identifies interactive elements, and assigns them unique indices, formatting each one as:
[15]<button>Submit Form</button>
[16]<input placeholder="Search..."></input>
This approach makes it crystal clear which elements can be interacted with and how to reference them, dramatically reducing hallucination issues.
Problem: Modern websites load content dynamically, making it challenging to know when a page is ready for interaction.
Solution: We implemented a sophisticated waiting system with multiple strategies:
Our system can adapt its waiting strategy based on the specific website and expected content, significantly improving reliability.
Problem: As agents navigate multiple pages and perform complex tasks, maintaining context becomes challenging.
Solution: We developed a structured memory system that tracks:
This comprehensive memory enables the agent to maintain context across complex workflows, remembering what it has already learned and what remains to be done.
Problem: Web interactions frequently fail in unpredictable ways - elements disappear, pages redirect unexpectedly, forms reset, etc.
Solution: We implemented a multi-layered recovery system:
This robust error handling allows the agent to recover from many common failures without human intervention.
This is just the beginning of what's possible with browser agents. Our research suggests several promising directions for future development:
Complex web tasks often involve multiple specialized skills. Future implementations could use multiple agents working together:
Currently, our agents rely primarily on in-context reasoning. Future versions could incorporate:
As browser agents gain capabilities, privacy and security become increasingly important:
These controls would make browser agents suitable for more sensitive use cases.
Browser agents represent a significant evolution in how we interact with the web. By combining the reasoning capabilities of LLMs with browser automation and enhancing them with Fireworks AI's efficient inference, we've created a system that can navigate the web with a level of understanding and adaptability approaching that of a human user.
The key innovations in our approach include:
These capabilities open up new possibilities for automation, research, and accessibility. Browser agents can handle repetitive tasks like form filling, comparison shopping, or data collection. They can also assist users with disabilities by navigating complex interfaces on their behalf.
Remember: the goal isn't to replace human interaction with the web but to augment it- freeing us from repetitive tasks while maintaining the flexibility to handle the rich complexity of the modern web.
This project is open source and builds on technologies like Playwright, browser-use, and Fireworks AI. We welcome contributions and feedback from the community.
💡 GitHub Repository: https://github.com/shubcodes/fireworksai-browseruse