我有一个用于抓取网站的小型网页抓取工具,它可以捕获诸如
之类的信息现在,我正在AWS Lambda中工作,并尝试提取元素,但是努力正确地解析数据。
要获取我要做的数据
const parsedData = JSON.parse(data);
如果我那么做
console.log(parsedData)
我得到一个有效的JSON对象(出于可读性的考虑而缩短)
{
"Scrape": {
"PageContent": [
{
"tag": "title",
"text": "Stack Overflow - Where Developers Learn, Share, & Build Careers"
},
{
"tag": "span",
"text": "Stack Overflow"
},
{
"tag": "span",
"text": "new"
},
{
"tag": "h4",
"text": "Try Stack Overflow for Business"
},
{
"tag": "span",
"text": "rev 2019.5.3.33574"
}
],
"PageTech": [
{
"name": "DoubleClick for Publishers (DFP)",
"confidence": "100",
"version": null,
"icon": "DoubleClick.svg",
"website": "http://www.google.com/dfp",
"categories": [
{
"36": "Advertising Networks"
}
]
},
{
"name": "Elementor",
"confidence": "100",
"version": null,
"icon": "Elementor.png",
"website": "https://elementor.com",
"categories": [
{
"51": "Landing Page Builders"
}
]
},
{
"name": "MySQL",
"confidence": "0",
"version": null,
"icon": "MySQL.svg",
"website": "http://mysql.com",
"categories": [
{
"34": "Databases"
}
]
}
],
"PageHeaders": {
"status": [
"200"
],
"cache-control": [
"private"
],
"content-type": [
"text/html; charset=utf-8"
],
"content-encoding": [
"gzip"
],
"x-frame-options": [
"SAMEORIGIN"
],
"x-request-guid": [
"f7b0f406-a047-49b0-a822-18549cec5510"
],
"strict-transport-security": [
"max-age=15552000"
],
"content-security-policy": [
"upgrade-insecure-requests"
],
"accept-ranges": [
"bytes"
],
"date": [
"Sat, 04 May 2019 15:31:45 GMT"
],
"via": [
"1.1 varnish"
],
"x-served-by": [
"cache-dca17748-DCA"
],
"x-cache": [
"MISS"
],
"x-cache-hits": [
"0"
],
"x-timer": [
"S1556983905.306505,VS0,VE17"
],
"vary": [
"Accept-Encoding,Fastly-SSL"
],
"x-dns-prefetch-control": [
"off"
],
"set-cookie": [
"prov=c352d0ff-77bc-1152-f587-c60c5b156354; domain=.stackoverflow.com; expires=Fri, 01-Jan-2055 00:00:00 GMT; path=/; HttpOnly"
],
"content-length": [
"52808"
]
},
"PageCookies": [
{
"name": "__qca",
"value": "P0-513591828-1556983906778",
"domain": ".stackoverflow.com",
"path": "/",
"expires": 1590852706,
"size": 31,
"httpOnly": false,
"secure": false,
"session": false
},
{
"name": "prov",
"value": "c352d0ff-77bc-1152-f587-c60c5b156354",
"domain": ".stackoverflow.com",
"path": "/",
"expires": 2682374400.32448,
"size": 40,
"httpOnly": true,
"secure": false,
"session": false
},
{
"name": "notice-so4",
"value": "!1",
"domain": "stackoverflow.com",
"path": "/",
"expires": 1558620000,
"size": 12,
"httpOnly": false,
"secure": false,
"session": false
}
],
"PageRequests": [
{},
"request url:",
"https://stackoverflow.com/",
"request url:",
"https://csi.gstatic.com/csi?s=ampad&ctx=2&puid=1~1556983907538&qqid=CNPSx4WZguICFREahgodlIgJLA&rt=a4a.link.i.43.8.2.1s.0.1ncx.1mpg~aa.script.j.4c.b.8.0.0.toq.tl3~simg.img.z.1w.1.5.0.0.bob.bjn~vu.img.z.29.2.h.0.0.87.0&met.a4a=dcl.0~ol.423~nvs.1556983907063~ini.1556983907539",
"request url:",
"https://pagead2.googlesyndication.com/pcs/activeview?xai=AKAOjssfcs88N7qDU9Qt-wDm3tVGQ1IHYMTJIMVe7GXXRffOfT5bDoX5Q2J-38ZpgGqsphazKd2DBC22PTJF-bHEAW9artqsaPzETpUg8026EaA&sig=Cg0ArKJSzNKjZvqCXbUqEAE&id=ampim&o=1268,478&d=300,250&ss=800,600&bs=1920,1080&mcvt=1019&mtos=0,0,1019,1019,1019&tos=0,0,1019,0,0&tfs=136&tls=1155&g=100&h=100&pt=423&tt=1168&rpt=423&rst=1556983907063&r=v&adk=2451320170&avms=ampa"
]
}
}
当我尝试从JSON中提取信息时,我会不断获得
undefined
取回价值。
我尝试过的例子是
console.log(parsedData.Scrape.PageContent)
console.log(parsedData.Scrape.PageContent[0].text
console.log(parsedData[1][1])
console.log(parsedData.Scrape)
console.log(parsedData[1]['Scrape'])
期望的输出是能够获取单个元素,以便我可以将它们写入数据库