FAQ - Scraping Browser

Q: What is Scraping Browser and how can I best use it for collecting data?

A: Scraping Browser is one of our proxy-unlocking solutions and is designed to help you focus on your data collection from browsers while we take care of the full proxy and unblocking infrastructure for you.

It is a multi-stage unlocking product, where you can navigate target websites via libraries such as puppeteer, playwright, and selenium and interact with the site's HTML code to extract the data you need.

Check out our Getting Started guide, to see how simple it is to create a Scraping Browser, integrate it into your code, and then explore some common browser functions and examples that will help with your precise data collection needs.

Q: Which coding languages does Scraping Browser support?

A: Bright Data's Scraping Browser supports a wide range of programming languages and libraries. We currently have full native support for Node.js and Python using puppeteer, playwright, and selenium, and other languages can be integrated as well using the other libraries below, giving you the flexibility to integrate Scraping Browser right into your current tech stack. 

Language/Platform puppeteer playwright selenium
Python pyppeteer playwright-python Selenium WebDriver
JS / Node Native support Native support WebDriverJS
Ruby Puppeteer-Ruby playwright-ruby-client Selenium WebDriver for  Ruby
C# .NET: Puppeteer Sharp Playwright for .NET Selenium WebDriver for .NET
Java Puppeteer Java Playwright for Java Native support
Go chromedp playwright-go Selenium WebDriver for Go
Learn more about getting started with Bright Data's Scraping Browser and check out some real integration examples in Node.js and Python.

Q: How can I debug what's happening behind the scenes during my Scraping Browser session?

A: We understand the importance of having visibility into the inner workings of your Scraping Browser sessions. To assist you in this process, we've created the Scraping Browser Debugger, which is a powerful built-in debugger that seamlessly integrates with Chrome Dev Tools and gives you visibility into your live browser sessions.

The Scraping Browser Debugger is a valuable resource that allows you to thoroughly inspect, analyze, and optimize your code, empowering you to have better control, visibility, and efficiency within your Scraping Browser sessions. To learn more about accessing and utilizing the debugger, as well as getting started with it, please refer to our comprehensive debugger guide.

Q: How can I see a visual of what's happening in the browser?

A1: Triggering a screenshot

You can easily trigger a screenshot of the browser at any time by adding the following to your code:

// node.js puppeteer - Taking screenshot to file screenshot.png 
await page.screenshot({ path: 'screenshot.png', fullPage: true });

 To take screenshots on Python and C# see here.

A2: Automatically opening devtools to view your live browser session

See our full section on opening devtools automatically.

Q: Why does the initial navigation for certain pages take longer than others?

A: There is a lot of “behind the scenes” work that goes into unlocking your targeted site. Some sites will take just a few seconds for navigation, while others might take even up to a minute or two as they require more complex unlocking procedures. As such, we recommend setting your navigation timeout to “2 minutes” to give the navigation enough time to succeed if needed. 

You can set your navigation timeout to 2 minutes by adding the following line in your script before your “page.goto” call. 

// node.js puppeteer -  Navigate to site with 2 min timeout
page.goto('https://example.com', { timeout: 2*60*1000 });
# python playwright - Navigate to site with 2 min timeout 
page.goto('https://example.com', timeout=2*60*1000)
// C# PuppeteerSharp - Navigate to site with 2 min timeout 
await page.GoToAsync("https://example.com", new NavigationOptions()
{
    Timeout = 2*60*1000,
});

Q: Why does it seem that sometimes the bandwidth billed for my scraping session is more than if I scraped the page myself? For instance, I tried to scrape the page myself and it came out to a total of 200kb - why was I charged for 500kb on the same page?

A: Our scraping browsers download a lot of resources (JS/CSS/Images/etc) during page load and do a number of processes behind the scenes in order to unlock the pages you navigate to. The traffic calculated by Scraping Browser is the total traffic that it takes in order to unlock your page at the given time. 

Q: I came across an error code while using Scraping Browser. Can you list the error codes and the meanings behind them?

Error Code Meaning What can you do about it?
Unexpected server response: 407 An issue with the remote browser's port Please check your remote browser's port.  The correct port for Scraping Browser is port:9222
Unexpected server response: 403 Authentication Error Check authentication credentials (username, password) and check that you are using the correct "Browser API" zone from Bright Data control panel
Unexpected server response: 503 Service Unavailable

We are likely scaling browsers right now to meet demand. Try to reconnect in 1 minute.

Q: I can’t seem to connect, do I have a connection issue?

A: You can check your connection with the following curl:

curl -v -u USER:PASS https://brd.superproxy.io:9222/json/protocol

For any other issues please see our Troubleshooting guide or contact support.

Q: What are some tips for reducing bandwidth while scraping? 

A:  When optimizing your web scraping projects, conserving bandwidth is key. Explore our tips and guidelines below on effective bandwidth-saving techniques that you can utilize within your script to ensure efficient and resource-friendly scraping.

    1. Avoid unnecessary media content during scraping

A typical inefficiency when scraping browsers is the unnecessary downloading of media content, such as images and videos, from your targeted domains. Learn below how to easily avoid this by excluding them right from within your script.

Given that anti-bot systems expect specific resources to load for particular domains, approach resource-blocking cautiously, as it can directly impact Scraping Browser's ability to successfully load your target domains. If you encounter any issues after applying resource blocks, please ensure that they persist even when your blocking logic is reverted, before contacting our support team.

Puppeteer:

  • Block All Images:
  const page = await browser.newPage();

  // Enable request interception
  await page.setRequestInterception(true);

  // Listen for requests
  page.on('request', (request) => {
    if (request.resourceType() === 'image') {
      // If the request is for an image, block it
      request.abort();
    } else {
      // If it's not an image request, allow it to continue
      request.continue();
  }
});
  • Block Specific Image Formats:
  const page = await browser.newPage();

  // Enable request interception
  await page.setRequestInterception(true);

  // Listen for requests
  page.on('request', (interceptedRequest) => {

    // Check if the request URL ends with '.png' or '.jpg'
    if (
      interceptedRequest.url().endsWith('.png') ||
      interceptedRequest.url().endsWith('.jpg')
    ) {

      // If the request is for a PNG or JPG image, block it
      interceptedRequest.abort();
    } else {
      // If it's not a PNG or JPG image request, allow it to continue
      interceptedRequest.continue();
  }
});

Playwright:

  • Block specific resource types such as images and fonts.
  // Create a new context with specific resource types blocked
  const context = await browser.newContext({
    fetchResourceTypesToBlock: ['image', 'font']
  });

  const page = await context.newPage();

  // Navigate to a webpage
  await page.goto('https://example.com');

Selenium:

  • Use WebDriver functionality to disable images and other media content. NOTE: not available in Python.
# Set the preference to not load images
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)

# Create a new Chrome browser instance with the defined options
driver = webdriver.Chrome(options=chrome_options)

driver.get('https://example.com')

Selenium (workaround):

  • Use CDP command to block image URLs (requires inspection, trial and error):
with Remote(sbr_connection, options=ChromeOptions()) as driver:
driver.execute('executeCdpCommand', {
'cmd': 'Network.setBlockedURLs',
'params': {
'urls': ['*.jpg*', '*.jpeg*', '*.png*', '*.gif*', '*data:image*'],
},
})
driver.get('https://example.com')
  • Example in Java:
var exec = new RemoteExecuteMethod(driver);
exec.execute("executeCdpCommand", Map.of(
"cmd", "Network.setBlockedURLs",
"params", Map.of(
"urls", new String[] {"*.jpg*", "*.jpeg*", "*.png*", "*.gif*"}
)
));
driver.navigate().to(url);

  2. Intercepting API requests

In many cases, API requests are only accessible when initiated from the browser.

With Scraping browsers, it is possible to intercept API requests and save them to a file.

See an example of this below using Puppeteer:

const page = await browser.newPage();
const client = await page.target().createCDPSession();
// Add the 'page.on' event listener for response
page.on('response', async (response) => {
const request = response.request();
const isPreflight = request.method() === 'OPTIONS' && response.status() === 204;

if (!isPreflight && request.url().includes('mtop.global.detail.web.getdetailinfo')) {
try {
const text = await response.text();
fs.writeFileSync('response.txt', text); // Write response to a file
} catch (error) {
console.error('Error accessing response body:', error);
}
}
});

 

  3. Effectively using cached pages

One common inefficiency in scraping jobs is the repeated downloading of the same page during a single session. 

Leveraging cached pages - a version of a previously scraped page - can significantly increase your scraping efficiency, as it can be used to avoid repeated network requests to the same domain. Not only does it save on bandwidth by avoiding redundant fetches, but it also ensures faster and more responsive interactions with the preloaded content.

Please note: A single Scraping Browser session can persist for up to 20 minutes. This duration allows you ample opportunity to revisit and re-navigate the page as needed within the same session, eliminating the need for redundant sessions on identical pages during your scraping job.

Let’s see an example
In a multi-step web scraping workflow, you often gather links from a page and then dive into each link for more detailed data extraction.  You’ll often need to revisit the initial page for cross-referencing or validation. By leveraging caching, these revisits don't trigger new network requests as the data is simply loaded from the cache.

See an example of this below using puppeteer:

const puppeteer = require('puppeteer-core');
const AUTH = 'USER:PASS';
const SBR_WS_ENDPOINT = `wss://${AUTH}@brd.superproxy.io:9222`;

async function main() {
    console.log('Connecting to Scraping Browser...');
    const browser = await puppeteer.connect({
        browserWSEndpoint: SBR_WS_ENDPOINT,
    });
    try {
        console.log('Connected! Navigating...');
        const page = await browser.newPage();
        await page.goto('https://example.com', { timeout: 2 * 60 * 1000 });
        
// Extract product links from the listing page
  const productLinks = await page.$$eval('.product-link', links => links.map(link => link.href));
  const productDetails = [];

        // Navigate to an individual product page
  for (let link of productLinks) {
    await page.goto(link);

 
// Extract the product's name
    const productName = await page.$eval('.product-name', el => el.textContent);

   // Apply a coupon (assuming it doesn't navigate away)
    await page.click('.apply-coupon-button');

    // Extract the discounted product's price from the cached product detail page
    const productPrice = await page.$eval('.product-price', el => el.textContent);

    // Store product details
    productDetails.push({ productName, productPrice });
  }

 } finally {        
await browser.close();  
}
}

    4. Other general strategies to minimize bandwidth and ensure efficient scraping

  • Limit Your Requests: Only scrape what you need, rather than downloading entire webpages or sites.
  • Concurrency Control: Limit the number of concurrent pages or browsers you open. Too many parallel processes can exhaust resources.
  • Session Management: Ensure you properly manage and close sessions after scraping. This prevents resource and memory leaks.
  • Opt for APIs: If the target website offers an API, use it instead of direct scraping. APIs are typically more efficient and less bandwidth-intensive than scraping full web pages.
  • Fetch Incremental Data: If scraping periodically, try to fetch only new or updated data rather than re-fetching everything.

Was this article helpful?