How to Build a Website Scraper with Puppeteer and Firebase Functions

Let’s create a simple website scraper that download the content of a web page and extract the content of the page. For this example, we will use the New York Times website as the source of the content. The scraper will extract the top 10 news headlines on the page and display them on the web page. The scraping is done using the Puppeteer headless browser and web application is deployed on Firebase functions.

Scrape Website

1. Initialize a Firebase Function

Assuming that you have already created a Firebase project, you can initialize the Firebase functions in a local environment by running the following command:

mkdir scraper
cd scraper
npx firebase init functions
cd functions
npm install puppeteer

Follow through the prompts to initialize the project. We are also installing the Puppeteer package from NPM to use the Puppeteer headless browser.

2. Create a Node.js Application

Create a new pptr.js file in the functions folder that will contain the application code for scraping the content of the page. The script will only download the HTML content of the page and block all images, stylesheets, videos and fonts to reduce the amount of time it takes to download the page.

We are using XPath expression to select headlines on the page that are wrapped under the h3 tag. You may use Chrome Dev Tools to find the XPath of the headlines.

const puppeteer = require('puppeteer');

const scrapeWebsite = async () => {
  let stories = [];
  const browser = await puppeteer.launch({
    headless: true,
    timeout: 20000,
    ignoreHTTPSErrors: true,
    slowMo: 0,
    args: [
      '--disable-gpu',
      '--disable-dev-shm-usage',
      '--disable-setuid-sandbox',
      '--no-first-run',
      '--no-sandbox',
      '--no-zygote',
      '--window-size=1280,720'
    ]
  });

  try {
    const page = await browser.newPage();

    await page.setViewport({ width: 1280, height: 720 });

    // Block images, videos, fonts from downloading
    await page.setRequestInterception(true);

    page.on('request', (interceptedRequest) => {
      const blockResources = ['script', 'stylesheet', 'image', 'media', 'font'];
      if (blockResources.includes(interceptedRequest.resourceType())) {
        interceptedRequest.abort();
      } else {
        interceptedRequest.continue();
      }
    });

    // Change the user agent of the scraper
    await page.setUserAgent(
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36'
    );

    await page.goto('https://www.nytimes.com/', {
      waitUntil: 'domcontentloaded'
    });

    const storySelector = 'section.story-wrapper h3';

    // Only get the top 10 headlines
    stories = await page.$$eval(storySelector, (divs) =>
      divs.slice(0, 10).map((div, index) => `${index + 1}. ${div.innerText}`)
    );
  } catch (error) {
    console.log(error);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
  return stories;
};

module.exports = scrapeWebsite;

3. Write the Firebase Function

Inside the index.js file, import the scraper function and export it as a Firebase function. We are also writing a scheduled function that will run every day and will call the scraper function.

It is important to increase the function memory and time out limits as Chrome with Puppeteer is a heavy resource.

// index.js
const functions = require('firebase-functions');
const scrapeWebsite = require('./pptr');

exports.scrape = functions
  .runWith({
    timeoutSeconds: 120,
    memory: '512MB' || '2GB'
  })
  .region('us-central1')
  .https.onRequest(async (req, res) => {
    const stories = await scrapeWebsite();
    res.type('html').send(stories.join('<br>'));
  });

exports.scrapingSchedule = functions.pubsub
  .schedule('09:00')
  .timeZone('America/New_York')
  .onRun(async (context) => {
    const stories = await scrapeWebsite();
    console.log('The NYT headlines are scraped every day at 9 AM EST', stories);
    return null;
  });

4. Deploy the Function

If you wish to test the function locally, you may run the npm run serve command and navigate to the function endpoint on localhost. When you are ready to deploy the function to the cloud, the command is npm run deploy.

Puppeteer Firebase Function

5. Test the Scheduled Function

If you would like to test the scheduled function locally, you can run the command npm run shell to open an interactive shell for invoking functions manually with test data. Here type the function name scrapingSchedule() and hit enter to get the function output.

Firebase Functions Shell

How to Build a Website Scraper with Puppeteer and Firebase Functions

1. Initialize a Firebase Function

2. Create a Node.js Application

3. Write the Firebase Function

4. Deploy the Function

5. Test the Scheduled Function

How to Import Data from One Google Sheet to Another - Learn with Practical Examples

→ How to Create Print-Ready Brochures and Flyers in Multiple Languages

→ How to Automatically Cleanup Revisions in Cloud Run Functions & Firebase

→ How to Replace Bixby with Google Gemini AI in Samsung Phones

→ Essential Mac Apps and Utilities - 2025 Edition

→ Chat faster with AI Chatbots from Chrome's Address Bar

Awards & Recognition

Google Workspace Add-ons

Mail Merge with Attachments

Document Studio

Save Emails and Attachments

Google Forms Email Notifications

Email Google Spreadsheets

Creator Studio for Google Slides

Want to stay up to date?
Sign up for our email newsletter.

How to Build a Website Scraper with Puppeteer and Firebase Functions

1. Initialize a Firebase Function

2. Create a Node.js Application

3. Write the Firebase Function

4. Deploy the Function

5. Test the Scheduled Function

Awards & Recognition

Google Workspace Add-ons

Want to stay up to date? Sign up for our email newsletter.

Want to stay up to date?
Sign up for our email newsletter.