Features
- Scrape: Extract clean content from single URLs or batches of URLs
- Crawl: Comprehensively crawl entire websites and extract content from all pages
- Search: Perform web searches and scrape content from search results
- Map URLs: Fast URL discovery without full content extraction
- Extract: AI-powered structured data extraction using schemas or natural language prompts
Installation
Install the required dependencies:Setup
Get your Firecrawl API key from https://firecrawl.dev and set it as an environment variable:Basic Usage
Quick Start Examples
Scraping a single URL:Class-Based API
For more control, use the class-based API:FirecrawlRequest Class
For distributed processing or when you need to separate request creation from execution:Detailed Method Documentation
Scraping Methods
**scrape_url(url_or_urls, max_concurrent=10, kwargs) Scrape content from one or more URLs. Parameters:url_or_urls
: Single URL string or list of URLsmax_concurrent
: Maximum concurrent requests for batch processing (default: 10)formats
: List of output formats (default: [“markdown”])only_main_content
: Extract only main content, skip navigation/ads (default: True)include_tags
: HTML tags to specifically includeexclude_tags
: HTML tags to excludeheaders
: Custom HTTP headers as dictionarywait_for
: Time to wait before scraping (milliseconds)timeout
: Request timeout (milliseconds)actions
: Browser actions to perform before scraping
- Single URL: Scenario object with scraped content
- Multiple URLs: ScenarioList with Scenario objects
url
: Base URL to start crawling fromlimit
: Maximum number of pages to crawlmax_depth
: Maximum crawl depth from starting URLinclude_paths
: URL path patterns to include (supports wildcards)exclude_paths
: URL path patterns to excludeformats
: Output formats for each page (default: [“markdown”])only_main_content
: Extract only main content (default: True)
Search Methods
**search_web(query_or_queries, max_concurrent=5, kwargs) Search the web and extract content from results. Parameters:query_or_queries
: Single search query or list of queriesmax_concurrent
: Maximum concurrent requests for batch processing (default: 5)limit
: Maximum number of search results per querysources
: Sources to search (e.g., [“web”, “news”, “images”])location
: Geographic location for localized resultsformats
: Output formats for scraped content from results
url
: Base URL to discover links from- Additional mapping parameters via kwargs
Extraction Methods
**extract_data(url_or_urls, schema=None, prompt=None, kwargs) Extract structured data from web pages using AI-powered analysis. Parameters:url_or_urls
: Single URL string or list of URLsschema
: JSON schema defining data structure to extractprompt
: Natural language description of what to extractmax_concurrent
: Maximum concurrent requests for batch processing (default: 5)formats
: Output formats for scraped content before extraction
- Single URL: Scenario object with extracted structured data
- Multiple URLs: ScenarioList with Scenario objects
schema
or prompt
should be provided. Schema takes precedence if both are given.
Working with Results
Scenario Fields
All methods return Scenario or ScenarioList objects with standardized fields: Common fields for all methods:url
: The scraped/crawled/searched URLtitle
: Page title (when available)description
: Page description (when available)content
: Primary content (usually markdown format)status_code
: HTTP status code (when available)
scrape_status
: “success” or “error”markdown
: Markdown contenthtml
: HTML content (if requested)links
: Extracted links (if requested)screenshot
: Screenshot data (if requested)metadata
: Full page metadata
search_query
: The original search querysearch_status
: “success” or “error”result_type
: “web”, “news”, or “image”position
: Result position in search results
extract_status
: “success” or “error”extracted_data
: The structured data extracted by AIextraction_prompt
: The prompt used (if any)extraction_schema
: The schema used (if any)
crawl_status
: “success” or “error”
discovered_url
: The discovered URLsource_url
: The URL it was discovered frommap_status
: “success” or “error”
Advanced Usage
Concurrent Processing
All batch methods support concurrent processing to speed up large operations:Custom Formats and Options
Specify different output formats and extraction options:Complex Crawling Scenarios
Advanced crawling with path filtering:Schema-Based vs Prompt-Based Extraction
Using JSON schemas for structured extraction:Error Handling
The integration handles errors gracefully by returning Scenario objects with error information:Integration with EDSL Workflows
The firecrawl integration seamlessly integrates with other EDSL components:Using with Surveys
Content Analysis Pipeline
Distributed Processing
For large-scale operations, use the request/response pattern for distributed processing:Best Practices
Performance Optimization
- Use appropriate concurrency limits - Start with defaults and adjust based on your needs and rate limits
- Filter early - Use include/exclude paths in crawling to avoid unnecessary requests
- Choose optimal formats - Only request formats you actually need
- Batch operations - Process multiple URLs/queries together when possible
Rate Limiting
- Respect rate limits - Firecrawl has rate limits based on your plan
- Adjust concurrency - Lower max_concurrent values if you hit rate limits
- Monitor costs - Each request consumes Firecrawl credits
Content Quality
- Use only_main_content=True - Filters out navigation, ads, and other noise
- Specify include/exclude tags - Fine-tune content extraction
- Choose appropriate formats - Markdown for text analysis, HTML for detailed parsing
Error Resilience
- Check status fields - Always check scrape_status, search_status, etc.
- Handle partial failures - Some URLs in a batch may fail while others succeed
- Implement retries - For critical operations, implement retry logic for failed requests
Cost Management
- Use URL mapping first - Discover URLs with map_website_urls before full crawling
- Set reasonable limits - Use limit and max_depth parameters to control crawl scope
- Cache results - Store results locally to avoid re-scraping the same content
Troubleshooting
Common Issues
“FIRECRAWL_API_KEY not found” error:- Ensure your API key is set as an environment variable or in a .env file
- Verify the key is valid and active at https://firecrawl.dev
- Reduce max_concurrent parameter
- Check your Firecrawl plan limits
- Implement delays between requests if needed
- Try only_main_content=False for more content
- Adjust include_tags/exclude_tags parameters
- Some sites may have anti-scraping measures
- Increase max_concurrent for batch operations
- Use URL mapping to discover URLs faster than full crawling
- Consider using search instead of crawling for content discovery
- Use limit parameter to control crawl size
- Process results in batches rather than storing everything in memory
- Consider using the request/response pattern for distributed processing
Getting Help
- Check the Firecrawl documentation for API-specific issues
- Review EDSL documentation for Scenario and ScenarioList usage
- Ensure your firecrawl-py package is up to date:
pip install --upgrade firecrawl-py
API Reference
Convenience Functions
Convenience function to scrape single URL or multiple URLs.edsl.scenarios.firecrawl_scenario.scrape_url(url_or_urls: str | List[str], **kwargs)[source]
Convenience function to crawl a website.edsl.scenarios.firecrawl_scenario.crawl_website(url: str, **kwargs)[source]
Convenience function to search the web with single query or multiple queries.edsl.scenarios.firecrawl_scenario.search_web(query_or_queries: str | List[str], **kwargs)[source]
Convenience function to map website URLs.edsl.scenarios.firecrawl_scenario.map_website_urls(url: str, **kwargs)[source]
Convenience function to extract structured data from single URL or multiple URLs.edsl.scenarios.firecrawl_scenario.extract_data(url_or_urls: str | List[str], **kwargs)[source]
Classes
EDSL integration for Firecrawl web scraping and data extraction. This class provides methods to use all Firecrawl features and return results as EDSL Scenario and ScenarioList objects.class edsl.scenarios.firecrawl_scenario.FirecrawlScenario(api_key: str | None = None)[source]
Crawl a website and return a ScenarioList with all pages. Args: url: Base URL to crawl limit: Maximum number of pages to crawl max_depth: Maximum crawl depth (now max_discovery_depth) include_paths: URL patterns to include exclude_paths: URL patterns to exclude formats: List of formats to return for each page only_main_content: Whether to extract only main content **kwargs: Additional parameters passed to Firecrawl Returns: ScenarioList containing scenarios for each crawled pagecrawl(url: str, limit: int | None = 10, max_depth: int | None = 3, include_paths: List[str] | None = None, exclude_paths: List[str] | None = None, formats: List[str] | None = None, only_main_content: bool = True, scrape_options: Dict[str, Any] | None = None, return_credits: bool = False, **kwargs)[source]
Smart extract method that handles both single URLs and batches. Args: url_or_urls: Single URL string or list of URLs schema: JSON schema for structured extraction prompt: Natural language prompt for extraction max_concurrent: Maximum concurrent requests for batch processing **kwargs: Additional parameters passed to extract method Returns: Scenario object for single URL, ScenarioList for multiple URLs, or (object, credits) if return_credits=Trueextract(url_or_urls: str | List[str], schema: Dict[str, Any] | None = None, prompt: str | None = None, max_concurrent: int = 5, limit: int | None = None, scrape_options: Dict[str, Any] | None = None, return_credits: bool = False, **kwargs)[source]
Execute a serialized Firecrawl request and return results. This method allows for distributed processing where requests are serialized, sent to an API, reconstituted here, executed, and results returned. Args: request_dict: Dictionary containing the serialized request Returns: Scenario or ScenarioList depending on the method and inputclassmethod from_request(request_dict: Dict[str, Any])[source]
Get all URLs from a website (fast URL discovery).map_urls(url: str, limit: int | None = None, return_credits: bool = False, **kwargs)[source]
Args:url: Website URL to map limit: Maximum number of URLs to discover and map. If None, discovers all
available linked URLs. Defaults to None.
** **kwargs: Additional parameters passed to FirecrawlReturns: ScenarioList containing scenarios for each discovered URL
Smart search method that handles both single queries and batches. Args: query_or_queries: Single query string or list of queries max_concurrent: Maximum concurrent requests for batch processing **kwargs: Additional parameters passed to search method Returns: ScenarioList for both single and multiple queries, or (ScenarioList, credits) if return_credits=Truesearch(query_or_queries: str | List[str], max_concurrent: int = 5, limit: int | None = None, sources: List[str] | None = None, location: str | None = None, scrape_options: Dict[str, Any] | None = None, return_credits: bool = False, **kwargs)[source]
Firecrawl request class that can operate in two modes:class edsl.scenarios.firecrawl_scenario.FirecrawlRequest(api_key: str | None = None)[source]
- With API key present: Automatically executes requests via FirecrawlScenario and returns results
- Without API key: Can be used as a descriptor/placeholder, raises exception when methods are called
Crawl an entire website and extract content from all discovered pages. This method performs comprehensive website crawling, discovering and scraping content from multiple pages within a website. When an API key is present, it executes the crawl and returns a ScenarioList with all pages. Without an API key, it raises an exception.crawl(url: str, limit: int | None = 10, max_depth: int | None = 3, include_paths: List[str] | None = None, exclude_paths: List[str] | None = None, formats: List[str] | None = None, only_main_content: bool = True, **kwargs) → Dict[str, Any][source]
Args:url: Base URL to start crawling from. Should be a valid HTTP/HTTPS URL. The crawler will discover and follow links from this starting point. limit: Maximum number of pages to crawl. If None, crawls all discoverable pages (subject to other constraints). Use this to control crawl scope. max_depth: Maximum crawl depth from the starting URL. Depth 0 is just the starting page, depth 1 includes pages directly linked from the start, etc. If None, no depth limit is imposed. include_paths: List of URL path patterns to include in the crawl. Only URLs matching these patterns will be crawled. Supports wildcard patterns. exclude_paths: List of URL path patterns to exclude from the crawl. URLs matching these patterns will be skipped. Applied after include_paths. formats: List of output formats for each crawled page (e.g., [“markdown”, “html”]). Defaults to [“markdown”] if not specified. only_main_content: Whether to extract only main content from each page, skipping navigation, ads, footers, etc. Defaults to True. **kwargs: Additional crawling parameters passed to the Firecrawl API.
Returns:When API key is present: ScenarioList containing Scenario objects for each crawled page. Each scenario includes the page content, URL, title, and metadata. When API key is missing: Raises ValueError
Raises:ValueError: If FIRECRAWL_API_KEY is not found in environment, parameters, or .env file.
Examples:
Basic website crawl:
Limited crawl with constraints:
Full content crawl with multiple formats:
Extract structured data from web pages using AI-powered analysis. This method uses AI to extract specific information from web pages based on either a JSON schema or natural language prompt. When an API key is present, it executes the extraction and returns structured data. Without an API key, it raises an exception.extract(url_or_urls: str | List[str], schema: Dict[str, Any] | None = None, prompt: str | None = None, max_concurrent: int = 5, limit: int | None = None, **kwargs) → Dict[str, Any][source]
Args:url_or_urls: Single URL string or list of URLs to extract data from. Each URL should be a valid HTTP/HTTPS URL. schema: JSON schema defining the structure of data to extract. Should be a dictionary with field names and their types/descriptions. Takes precedence over prompt if both are provided. prompt: Natural language description of what data to extract. Used when schema is not provided. Should be clear and specific. max_concurrent: Maximum number of concurrent requests when processing multiple URLs. Only applies when url_or_urls is a list. Defaults to 5. limit: Maximum number of URLs to extract data from when url_or_urls is a list. If None, extracts from all provided URLs. Defaults to None. **kwargs: Additional extraction parameters: formats: List of output formats for the scraped content before extraction (e.g., [“markdown”, “html”]).
Returns:When API key is present:
- For single URL: Scenario object containing extracted structured data
- For multiple URLs: ScenarioList containing Scenario objects
Raises:ValueError: If FIRECRAWL_API_KEY is not found in environment, parameters, or .env file.
Examples:
Schema-based extraction:
Prompt-based extraction:
Multiple URLs with schema:
Discover and map all URLs from a website without scraping content. This method performs fast URL discovery to map the structure of a website without downloading and processing the full content of each page. When an API key is present, it executes the mapping and returns discovered URLs. Without an API key, it raises an exception.map_urls(url: str, limit: int | None = None, **kwargs) → Dict[str, Any][source]
Args:url: Base URL to discover links from. Should be a valid HTTP/HTTPS URL. The mapper will analyze this page and discover all linked URLs. limit: Maximum number of URLs to discover and map. If None, discovers all available linked URLs. Defaults to None. **kwargs: Additional mapping parameters passed to the Firecrawl API. Common options may include depth limits or filtering criteria.
Returns:When API key is present: ScenarioList containing Scenario objects for each discovered URL. Each scenario includes the discovered URL, source URL, and any available metadata (title, description) without full content. When API key is missing: Raises ValueError
Raises:ValueError: If FIRECRAWL_API_KEY is not found in environment, parameters, or .env file.
Examples:
Basic URL mapping:
Website structure analysis:
Link discovery for targeted crawling:
Scrape content from one or more URLs using Firecrawl. This method extracts clean, structured content from web pages. When an API key is present, it automatically executes the request and returns Scenario objects. Without an API key, it raises an exception when called.scrape(url_or_urls: str | List[str], max_concurrent: int = 10, limit: int | None = None, **kwargs) → Dict[str, Any][source]
Args:url_or_urls: Single URL string or list of URLs to scrape. Each URL should be a valid HTTP/HTTPS URL. max_concurrent: Maximum number of concurrent requests when scraping multiple URLs. Only applies when url_or_urls is a list. Defaults to 10. limit: Maximum number of URLs to scrape when url_or_urls is a list. If None, scrapes all provided URLs. Defaults to None. **kwargs: Additional scraping parameters: formats: List of output formats (e.g., [“markdown”, “html”, “links”]). Defaults to [“markdown”]. only_main_content: Whether to extract only main content, skipping navigation, ads, etc. Defaults to True. include_tags: List of HTML tags to specifically include in extraction. exclude_tags: List of HTML tags to exclude from extraction. headers: Custom HTTP headers as a dictionary. wait_for: Time to wait before scraping in milliseconds. timeout: Request timeout in milliseconds. actions: List of browser actions to perform before scraping (e.g., clicking, scrolling).
Returns:When API key is present:
- For single URL: Scenario object containing scraped content
- For multiple URLs: ScenarioList containing Scenario objects
Raises:ValueError: If FIRECRAWL_API_KEY is not found in environment, parameters, or .env file.
Examples:
Basic scraping:
Multiple URLs with custom options:
Descriptor pattern (no API key):
Search the web and extract content from results using Firecrawl. This method performs web searches and automatically scrapes content from the search results. When an API key is present, it executes the search and returns ScenarioList objects with the results. Without an API key, it raises an exception.search(query_or_queries: str | List[str], max_concurrent: int = 5, limit: int | None = None, **kwargs) → Dict[str, Any][source]
Args:
query_or_queries: Single search query string or list of queries to search for.Each query should be a natural language search term.
max_concurrent: Maximum number of concurrent requests when processing multiplequeries. Only applies when query_or_queries is a list. Defaults to 5.
limit: Maximum number of search results to return per query. If None, returnsall available results. Defaults to None.
**kwargs: Additional search parameters:sources: List of sources to search (e.g., [“web”, “news”, “images”]). Defaults to [“web”]. location: Geographic location for localized search results. formats: List of output formats for scraped content from results (e.g., [“markdown”, “html”]).
Returns:When API key is present: ScenarioList containing Scenario objects for each search result. Each scenario includes search metadata (query, position, result type) and scraped content from the result URL. When API key is missing: Raises ValueError
Raises:ValueError: If FIRECRAWL_API_KEY is not found in environment, parameters, or .env file.
Examples:
Basic web search:
Request Creation Functions
Create a serializable scrape request.edsl.scenarios.firecrawl_scenario.create_scrape_request(url_or_urls: str | List[str], api_key: str | None = None, **kwargs) → Dict[str, Any][source]
Create a serializable search request.edsl.scenarios.firecrawl_scenario.create_search_request(query_or_queries: str | List[str], api_key: str | None = None, **kwargs) → Dict[str, Any][source]
Create a serializable extract request.edsl.scenarios.firecrawl_scenario.create_extract_request(url_or_urls: str | List[str], api_key: str | None = None, **kwargs) → Dict[str, Any][source]
Create a serializable crawl request.edsl.scenarios.firecrawl_scenario.create_crawl_request(url: str, api_key: str | None = None, **kwargs) → Dict[str, Any][source]
Create a serializable map_urls request.edsl.scenarios.firecrawl_scenario.create_map_request(url: str, api_key: str | None = None, **kwargs) → Dict[str, Any][source]
Execute a serialized Firecrawl request and return results.edsl.scenarios.firecrawl_scenario.execute_request(request_dict: Dict[str, Any])[source]