Scrape images from a search engine with Nodejs and Puppeteer
Introduction
In the previous post of this series, we discovered how to use Nodejs and Puppeteer for scraping and searching content on web pages. I recommend reading it first if you have never used Puppeteer or need to set up the project.
In this article, we will fetch full-resolution images from a search engine. Our goal time is to get a picture of every dog breed.
Script to get the images links
You should have Node.js and Puppeteer installed with npm
or yarn
.
We will use the same methods than on the first part.
We are going to use a simple JSON as our list of dog breeds that can be found here: dog breeds dataset
As for the search engine, we will scrape on Duckduckgo because it allows us to easily get the images at a full resolution which can be more tricky on Google images.
const puppeteer = require('puppeteer')
const data = require('./dog-breeds.json')
const script = async () => {
//this will open visibly a chromium window, this is useful to see what is going on and test stuff before the finalized script
const browser = await puppeteer.launch({ headless: false, slowMo: 100 })
const page = await browser.newPage()
//loop on every breed
for (let dogBreed of data) {
console.log('Start for breed:', dogBreed)
const url = `https://duckduckgo.com/?q=${dogBreed.replaceAll(
' ',
'+',
)}&va=b&t=hc&iar=images&iax=images&ia=images`
//in case we encounter a page without images or an error
try {
await page.goto(url)
//make sure the page is loaded and contain our targeted element
await page.waitForNavigation()
await page.waitForSelector('.tile--img__media')
await page.evaluate(
() => {
const firstImage = document.querySelector('.tile--img__media')
//we open the panel that contains the image info
firstImage.click()
},
{ delay: 400 },
)
//get the link of the image from the panel
await page.waitForSelector('.detail__pane a')
const link = await page.evaluate(
() => {
const links = document.querySelectorAll('.detail__pane a')
const linkImage = Array.from(links).find((item) => item.innerText.includes('fichier'))
return linkImage?.getAttribute('href')
},
{ delay: 250 },
)
console.log('link succesfully retrieved:', link)
console.log('=====')
} catch (e) {
console.log(e)
}
}
}
script()
After running the script with node scrapeImages.js
you should get something like this:
Download and optimize the images
We now have the links of every images but some of them are quite heavy (>1mb). Fortunately we can use another Node.js library to compress their size with minimal loss of quality: sharp
It is a massively used library (2M+ weekly download) to convert, resize and optimize images.
You can add this at the end of the script to have a folder filled with the optimized images
const stream = fs.createWriteStream(dogBreed + '.jpg')
await https.get(link, async function (response) {
response.pipe(stream)
stream.on('finish', () => {
stream.close()
console.log('Download Completed')
})
})
//resize to a maximum width or height of 1000px
await sharp(`./${dogBreed}.jpg`).resize(1000, 1000).toFile(`./${dogBreed}-small.jpg`)
Conclusion
You can adapt this script to get pretty much anything, you can also not limit yourself to the first image for each query but get every image. As for myself, I used this script to get the initial images for a tool I'm working on https://dreamclimate.city