Question

在我的刮板在另一个网站上正常工作之后，我正在学习如何为另一个网站Reverb.com构建另一个刮板。但是，混响在从中提取信息方面更具挑战性，并且使用我的旧刮板的模型无法正常工作。我进行了一些研究，似乎使用requests_html代替requests就像Reverb.com一样，大多数用于Javascript的选项。

我实质上是在尝试刮掉标题和价格信息的文本版本，或者在不同的页面上分页，或者在URL列表中循环以获取所有内容。我在那儿，但是遇到了障碍。以下是我喜欢的2个版本的代码。

下面的第一个版本将打印出所有内容中只有3页的所有内容，但是会打印带有标记的所有乐器名称和价格。但是，在CSV中，所有这些项目仅一起打印在3行上，而不是每行1个项目/价格对。

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent


session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")

#content scrape
b = soup.findAll("h4", class_="grid-card__title") #title
for i in b:
    print(i)


p = soup.findAll("div", class_="grid-card__price") #price
for i in p:
    print(i)

相反，此版本仅将3行打印为CSV，但是名称和价格被删除了所有标记。但这仅在我将findAll更改为find时发生。我读到for html in r.html是一种无需创建URL列表即可遍历页面的方法。

from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent


#make csv file
csv_file = open("rvscrape.csv", "w", newline='') #added the newline thing on 5.17.20 to try to stop blank lines from writing
csv_writer = csv.writer(csv_file)
csv_writer.writerow(["bass_name","bass_price"])

session = HTMLSession()
r = session.get("https://reverb.com/marketplace/bass-guitars?year_min=1900&year_max=2022")
r.html.render(sleep=5)
soup = BeautifulSoup(r.html.raw_html, "html.parser")

for html in r.html:
    #content scrape
    bass_name = []
    b = soup.find("h4", class_="grid-card__title").text.strip() #title
    #for i in b:
    #    bass_name.append(i)
    #    for i in bass_name:
    #        print(i)

    price = []
    p = soup.find("div", class_="grid-card__price").text.strip() #price
    #for i in p:
    #    print(i)

    csv_writer.writerow([b, p])

使用BeautifulSoup 4和Requests_HTML抓取JavaScript网站

0 个答案: