Scraper(操纵up)不会映射到我的数组上-JavaScript / React

时间:2018-12-03 16:02:06

标签: javascript node.js web-scraping puppeteer

我写了一个带有木偶的网络刮板。它从作业门户中筛选作业。我可以显示标题,位置和图像。

从我的抓取器创建的数组如下:

[{
    "id": "2018-12-03T14:12:03Z",
    "position": "Frontend Entwickler React (w/m)",
    "company": "Muster AG",
    "image": "https://www.stepstone.de/upload_de/logo/blabla.gif",
    "date": "2018-12-03T14:12:03Z",
    "href": "https://www.stepstone.de/stellenangebote--Frontend-Entwickler"
  }] 

这是我的scraper.js的代码:

const fs = require('fs')
const path = require('path')
const puppeteer = require('puppeteer')

;(async () => {
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto(
    'https://www.stepstone.de/5/ergebnisliste.html?stf=freeText&ns=1&qs=%5B%7B%22id%22%3A%22231794%22%2C%22description%22%3A%22Frontend-Entwickler%2Fin%22%2C%22type%22%3A%22jd%22%7D%2C%7B%22id%22%3A%22300000115%22%2C%22description%22%3A%22Deutschland%22%2C%22type%22%3A%22geocity%22%7D%5D&companyID=0&cityID=300000115&sourceOfTheSearchField=homepagemex%3Ageneral&searchOrigin=Homepage_top-search&ke=Frontend-Entwickler%2Fin&ws=Deutschland&ra=30'
  )

  const stepstone = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('.job-element'), card => {
      const id = card.querySelector('time').getAttribute('datetime')
      const href = card
        .querySelector('.job-element__body > a')
        .getAttribute('href')
      const position = card
        .querySelector('.job-element__body__title')
        .textContent.trim()
        .replace(/^(.{45}[^\s]*).*/, '$1')
      const company = card
        .querySelector('.job-element__body__company')
        .textContent.trim()
        .replace(/^(.{20}[^\s]*).*/, '$1')
      const image_element = card.querySelector('.job-element__logo img')
      const image = image_element.dataset.src
        ? `https://www.stepstone.de${image_element.dataset.src}`
        : image_element.src
      const date = card.querySelector('time').getAttribute('datetime')

      return {
        id,
        position,
        company,
        image,
        date,
        href
      }
    })
  })

  fs.writeFile(
    path.join(__dirname, 'src/stepstone.json'),
    JSON.stringify(stepstone),
    err => {
      if (err) {
        console.error(err)
      } else {
        console.log('Great, it worked!')
      }
    }
  )

  await browser.close()
})()

我的工作方式:在抓取标题,职位等之后。我还想包括工作详细信息。因此,我告诉我的抓取工具转到存储此信息的数组中每个作业项目的href链接。

从该链接中获取作业详细信息类,就像上面一样。因此,我尝试映射上述数组,并告诉刮板从每个href链接中抓取项目,如下所示:

stepstone.map(async stone => {
        const page = await browser.newPage()
        await page.goto(stone.href)
        const details = await page.evaluate(() => {
          return document.querySelector('card__body')
        })
        return {
          ...stone,
          details
        }
      })

我的问题: 但是,JSON文件不会使用“详细信息”键更新(该键应包含来自'card__body'的信息)。

有什么建议吗? 谢谢!

0 个答案:

没有答案