Web Scraping with NodeJS and Puppeteer

Here we will explore how we can do Web Scraping using NodeJS with Puppeteer. We will scrape a web page using the automation with Javascript.

Let’s assume we want to extract some information from the web. We still can copy the information we need from the desired website. But, what if we want to extract large amount of data quickly as possible. In that case we need some automation needed. There is where web scrapping comes into play. Unlike, tedious and manual task of copy and paste the information required it helps in receiving thousands and even millions of data sets in a smaller amount of time.

What is Web Scraping?

Web Scraping is basically the process to extract data from a website. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data (database or spreadsheet). Many large websites provides us with an API’s which allows us to access their data. But the sites which don’t have an API we can make the use of Web Scrapping.

Web scraping requires two parts, the crawler and the scraper. The crawler is an AI algorithm that browses the web to search the particular data required by following the links across the internet. The scraper, on the other hand, is a specific tool created to extract the data from the website.

What is Puppeteer?

Puppeteer is a Node library API built by google that allows us to control headless Chrome. Headless Chrome is a way to run the Chrome Browser without actually running Chrome.

The headless browser simply means we don’t have any access to GUI for working with the browser. Instead we interact with the visual elements, Mouse, Textfields, Forms with a Command Line Tool(CLI) and take it into automation. In simple words it spins up a new Chrome instance when it’s initialized, it might not be the most performant. It’s the most precise way to automate testing with Chrome though, since it’s using the actual browser under the hood.

Setting up the Project

The only prerequisite needed to have before we start is to make sure you have NodeJs installed in your machine. After that we can open up your terminal and start with the following commands.

$ mkdir web_scrapping_app
$ yarn init
$ code .

These command will create a directory in your current location called web_scrapping_app and we initialize the package.json file and open up the project in our default code editor.

Our next thing is to install Puppeteer into our project. We can do it with the command

yarn add puppeteer

After we install Puppeteer we need to import it into our code. We will make a new file and name it app.js. Inside this file we will import our library we just installed.

Work with Puppeteer

const puppeteer = require('puppeteer');

then we will use the launch() method to create a browser instance

puppeteer.launch().then(async browser => {
//...
})

We can also pass an object with options to puppeteer.launch() like

puppeteer.launch({ headless:false })

Now we can use The await operator is used to wait for a Promise. It can only be used inside an async function. So, we will wrap this method in an async function and use newPage() method on the browser object to get the page object.

const SCRAPE_URL = `https://coincodex.com/`;(async () => {
const browser = await puppeteer.launch({ headless:false });
const page = await browser.newPage();
await page.goto(SCRAPE_URL);
await page.screenshot({ path: "screenShot.png" });
await browser.close();
})();

In the snippet above we declared a constant with nameSCRAPE_URL and assigned the URL as a string literals also called template literals. We also called goto() method on the page object to load that page we passed down as a parameter. We took a screenshot of the current page and finally used close() method to close the running chromium browser.

After, we run this snippet we will get a file with the screenshot of the current website in your root folder. Which looks something like this.

Screenshot generated with Puppeteer

Isn’t this getting interesting. We just took a screenshot of the website by running just few lines of code, pretty dope huh!

Extracting data

Let’s dig deeper, we will now move ahead and extract some data that we need from the site. The data we need can be accessed by selecting a DOM element or we can also make the use of library called cheerio to achieve this but I will be going through ❤️️ plain javascript to select the elements.

const coinName = await page.evaluate(() =>
document.querySelector(".full-name").textContent.trim()
);
const coinCurrency = await page.evaluate(() =>
document.querySelector(".currency").textContent.trim()
);
console.log(coinName);
console.log(coinCurrency);

The snippet above uses the CSS selector where we selected two classes which are not pretty identical to other classes. With these in place we will print these to our console.

Note: Don't forget to use close() method at the end after we are done with our calculations.

Extracting all data

Now, we are done with this we can run our code and we will get the coin name and the coin price from the website. But we still have a problem we just have just one coin, and thats annoying. Now, let’s make some tweaks and changes to our code to make a recursion through each elements with the class name selected.

const coinName = await page.evaluate(() =>
Array.from(document.querySelectorAll(".full-name"))
.map((coins) =>
coins.innerText.trim()
)
);
const coinCurrency = await page.evaluate(() =>
Array.from(document.querySelectorAll(".currency"))
.map((currency) =>
currency.innerHTML.trim()
)
);

Here in the snippet we have Array.from() static method which creates a new, shallow-copied Array instance from an array-like or iterable object and map through the elements and get its innerHTML after trimming the whitespace if we have any.

Refactoring code

We will go through a quick refactoring our code and make it more reusable. Let’s rewrite a function.

const coining = await page.evaluate(() =>
Array.from(document.querySelectorAll("table tr.coin.ng-star-inserted")).map(
(table) => ({
coinName: table.querySelector(".full-name").innerText,
coinPrice: table.querySelector(".currency").innerText,
})
)
);
console.log(coining);

Here we just went on and used a single function to iterate on the two DOM selectors. By executing the code we can now receive all the coins and its current price from our website. Below is the final code that we have.

Gist Link

Conclusion

Here, in this Puppeteer tutorial we have demonstrated a basic functionality where we did some information we needed from the site. However, it has it has much wider use cases, including headless browser testing, PDF generation, and performance monitoring, among many others. So, we can make the right use of the available headless browsers to make the automation we need.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store