该分配是一个命令行节点应用程序,可将某些数据从特定站点上抓取并保存到CSV文件中。
我正在使用scrape-it抓取数据并成功获取了我需要的所有数据,但是我正在努力弄清楚如何将每个URL(存储在url中)添加到其对应的shirt对象中,这是一个对象数组。
这是我到目前为止所拥有的。
const scrapeIt = require("scrape-it");
const mainURL = "http://shirts4mike.com/";
scrapeIt(`${mainURL}shirts.php`, {
pages: {
listItem: ".products li",
name: "pages",
data: {
url: {
selector: "a",
attr: "href"
}
}
}
})
.then(({ data }) => {
const urls = data.pages.map(page => `${mainURL}${page.url}`);
console.log(urls);
const shirtCalls = urls.map(url =>
scrapeIt(url, {
name: {
selector: ".shirt-picture img",
attr: "alt"
},
image: {
selector: ".shirt-picture img",
attr: "src"
},
price: {
selector: "span.price"
}
})
);
return Promise.all(shirtCalls);
})
.then(shirtResults => {
const shirts = shirtResults.map(shirtResult => shirtResult.data);
console.log(shirts);
});
“衬衫”给我的输出是
[ { name: 'Logo Shirt, Red',
image: 'img/shirts/shirt-101.jpg',
price: '$18' },
{ name: 'Mike the Frog Shirt, Black',
image: 'img/shirts/shirt-102.jpg',
price: '$20' },
{ name: 'Mike the Frog Shirt, Blue',
image: 'img/shirts/shirt-103.jpg',
price: '$20' },
{ name: 'Logo Shirt, Green',
image: 'img/shirts/shirt-104.jpg',
price: '$18' },
{ name: 'Mike the Frog Shirt, Yellow',
image: 'img/shirts/shirt-105.jpg',
price: '$25' },
{ name: 'Logo Shirt, Gray',
image: 'img/shirts/shirt-106.jpg',
price: '$20' },
{ name: 'Logo Shirt, Teal',
image: 'img/shirts/shirt-107.jpg',
price: '$20' },
{ name: 'Mike the Frog Shirt, Orange',
image: 'img/shirts/shirt-108.jpg',
price: '$25' } ]
但是我想要得到的最终结果是....
[ { name: 'Logo Shirt, Red',
image: 'img/shirts/shirt-101.jpg',
price: '$18',
url: 'http://shirts4mike.com/shirt.php?id=101' //which is at urls[0]
},
{ name: 'Mike the Frog Shirt, Black',
image: 'img/shirts/shirt-102.jpg',
price: '$20',
url: 'http://shirts4mike.com/shirt.php?id=102' //urls[1]
}, //...etc etc
希望一切都有意义,对Promise(和节点)来说还是很新的,所以我感到有点不合时宜。预先谢谢你!
答案 0 :(得分:1)
尝试这样的事情:
const scrapeIt = require("scrape-it");
const mainURL = "http://shirts4mike.com/";
scrapeIt(`${mainURL}shirts.php`, {
pages: {
listItem: ".products li",
name: "pages",
data: {
url: {
selector: "a",
attr: "href"
}
}
}
})
.then(({ data }) => {
const urls = data.pages.map(page => `${mainURL}${page.url}`);
console.log(urls);
return urls.map(async (url) => {
let urlObj = await scrapeIt(url, {
name: {
selector: ".shirt-picture img",
attr: "alt"
},
image: {
selector: ".shirt-picture img",
attr: "src"
},
price: {
selector: "span.price"
}
});
return {...urlObj.data, url};
});
})
.then(shirtResults => {
console.log(shirtResults);
});
答案 1 :(得分:1)
因此,由于另一个用户的建议,我实际上设法使它工作了(尽管我认为他们删除了他们的评论?)。 在最后的.then()中,我映射到衬衫上,从image属性中获取pageID,然后将mainURL,路径以及最后的pageID插值到模板文字中,并将该键/值添加到每个对象中。还以此为契机,将完整的图像URL存储在image属性中。
.then(shirtResults => {
const shirts = shirtResults.map(shirtResult => shirtResult.data);
shirts.map(shirt => {
let pageID = shirt.image.replace(/\D/g, "");
shirt.url = `${mainURL}shirt.php?id=${pageID}`;
shirt.image = shirt.image.replace(/^/, `${mainURL}`);
});
console.log(shirts);
});
感谢您的帮助!