<h2>Introduction</h2>
<p>In the previous post of this series, we discovered how to use Nodejs and Puppeteer for scraping and searching content on web pages. I recommend reading it first if you have never used Puppeteer or need to set up the project.</p>
<p>In this article, we will fetch full-resolution images from a search engine. Our goal time is to get a picture of every dog breed.</p>
<h2>Script to get the images links</h2>
<p>You should have Node.js and Puppeteer installed with <code>npm</code> or <code>yarn</code>.
We will use the same methods than on the first part.
We are going to use a simple JSON as our list of dog breeds that can be found here: <a href="https://raw.githubusercontent.com/dariusk/corpora/master/data/animals/dogs.json">dog breeds dataset</a></p>
<p>As for the search engine, we will scrape on Duckduckgo because it allows us to easily get the images at a full resolution which can be more tricky on Google images.</p>
<pre><code>const puppeteer = require('puppeteer')
const data = require('./dog-breeds.json')

const script = async () => {
  //this will open visibly a chromium window, this is useful to see what is going on and test stuff before the finalized script
  const browser = await puppeteer.launch({ headless: false, slowMo: 100 })
  const page = await browser.newPage()

  //loop on every breed
  for (let dogBreed of data) {
    console.log('Start for breed:', dogBreed)
    const url = `https://duckduckgo.com/?q=${dogBreed.replaceAll(
      ' ',
      '+',
    )}&#x26;va=b&#x26;t=hc&#x26;iar=images&#x26;iax=images&#x26;ia=images`

    //in case we encounter a page without images or an error
    try {
      await page.goto(url)

      //make sure the page is loaded and contain our targeted element
      await page.waitForNavigation()
      await page.waitForSelector('.tile--img__media')

      await page.evaluate(
        () => {
          const firstImage = document.querySelector('.tile--img__media')
          //we open the panel that contains the image info
          firstImage.click()
        },
        { delay: 400 },
      )

      //get the link of the image from the panel
      await page.waitForSelector('.detail__pane a')
      const link = await page.evaluate(
        () => {
          const links = document.querySelectorAll('.detail__pane a')
          const linkImage = Array.from(links).find((item) => item.innerText.includes('fichier'))
          return linkImage?.getAttribute('href')
        },
        { delay: 250 },
      )
      console.log('link succesfully retrieved:', link)
      console.log('=====')
    } catch (e) {
      console.log(e)
    }
  }
}

script()
</code></pre>
<p>After running the script with <code>node scrapeImages.js</code> you should get something like this:</p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5d4di6htb7rycojwbt0m.gif" alt="Gif scraping puppeteer"></p>
<h2>Download and optimize the images</h2>
<p>We now have the links of every images but some of them are quite heavy (>1mb).
Fortunately we can use another Node.js library to compress their size with minimal loss of quality: <a href="https://www.npmjs.com/package/sharp">sharp</a></p>
<p>It is a massively used library (2M+ weekly download) to convert, resize and optimize images.</p>
<p>You can add this at the end of the script to have a folder filled with the optimized images</p>
<pre><code>const stream = fs.createWriteStream(dogBreed + '.jpg')
await https.get(link, async function (response) {
  response.pipe(stream)
  stream.on('finish', () => {
    stream.close()
    console.log('Download Completed')
  })
})

//resize to a maximum width or height of 1000px
await sharp(`./${dogBreed}.jpg`).resize(1000, 1000).toFile(`./${dogBreed}-small.jpg`)
</code></pre>
<h2>Conclusion</h2>
<p>You can adapt this script to get pretty much anything, you can also not limit yourself to the first image for each query but get every image. As for myself, I used this script to get the initial images for a tool I'm working on <a href="https://dreamclimate.city">https://dreamclimate.city</a></p>
<p><img src="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5pbya0u5wy7zf5xgxb3z.jpeg" alt="screenshot dream climate city personal project"></p>
<hr>
<h3>😄 Thanks for reading! If you found this article useful, it's part of a series and the next article will be about scraping images on a search engine. To get notified follow me on <a href="https://twitter.com/AntoineMesnil">Twitter</a>, I also share tips on development, design and share my journey to create my own startup studio</h3>