网络抓取不会用所有页面要求的所有信息来填充文件

时间:2019-08-19 09:34:57

标签: python-3.x beautifulsoup

我是python初学者,我需要抓取餐厅名称,社会经济地位,名称客户,评价日期,滴定度评价以及仅10家至40页的一家餐厅(python3.7和漂亮的汤)的评价。但是,当我打开csv文件时,我仅拥有第一审稿人的所有信息。这是我的代码:

csv_file = open("lebouclard.csv", "w", encoding="utf-8")
csv_writer = csv.writer(csv_file, delimiter = ";")
csv_writer.writerow(["inf_rest_name", "rest_eclf", "name_client", "date_rev_cl", "titre_rev_cl", "opinion_cl"])
for i in range(10,40):
    url = requests.get("https://www.tripadvisor.fr/Restaurant_Review-g187147-d947475-Reviews-or10-Le_Bouclard-Paris_Ile_de_France.html".format(i)).text
    page_soup = soup(url, "html.parser")
    gen_rest = page_soup.find_all("div", {"class":"page"})
    for rest in gen_rest:
        rname= rest.find("h1",{"class":"ui_header h1"})
        inf_rest_name = rname.text
        print("inf_rest_name: " + inf_rest_name)
        econ_class_food = rest.find("div", {"class":"header_links"})
        rest_eclf = econ_class_food.text.strip()
        print("rest_eclf: " + rest_eclf)
    for clients in gen_rest:
        client_info = clients.find_all("div", {"class":"info_text"})
        name_client = client_info[0].text
        print("name_client: " + name_client)
        date_review = clients.find_all("span", {"class":"ratingDate"})
        date_rev_cl = date_review[0].text.strip()
        print("date_rev_cl: " + date_rev_cl)
        titre_review = clients.find_all("span", {"class":"noQuotes"})
        titre_rev_cl = titre_review[0].text.strip()
        print("titre_rev_cl: " + titre_rev_cl)
        opinion = clients.find_all("p", {"class":"partial_entry"})
        opinion_cl = opinion[0].text.replace("\n","")
        print("opinion_cl: " + opinion_cl)
        csv_writer.writerow([inf_rest_name, rest_eclf, name_client, date_rev_cl, titre_rev_cl, opinion_cl])
csv_file.close()

我试图在gen_rest中消除for客户端,并放置:

client_info = rest.find_all("div", {"class":"info_text"})
name_client = client_info[0].text
print("name_client: " + name_client)
date_review = rest.find_all("span", {"class":"ratingDate"})
date_rev_cl = date_review[0].text.strip()
print("date_rev_cl: " + date_rev_cl)
titre_review = rest.find_all("span", {"class":"noQuotes"})
titre_rev_cl = titre_review[0].text.strip()
print("titre_rev_cl: " + titre_rev_cl)
opinion = rest.find_all("p", {"class":"partial_entry"})
opinion_cl = opinion[0].text.replace("\n","")
print("opinion_cl: " + opinion_cl)

但是它向我显示了scv文件中的相同信息。在我决定消除find_all和[0]之后,结果却是相同的。我想念的是什么?...我已经阅读了其他有关此问题,但没有找到我的错误。

1 个答案:

答案 0 :(得分:0)

在使用f字符串的地方尝试以下操作,以便在循环期间将下一组评论的值传递到字符串中

import requests, csv
from bs4 import BeautifulSoup as bs

with open("lebouclard.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
    w = csv.writer(csv_file, delimiter = ";", quoting=csv.QUOTE_MINIMAL)
    w.writerow(["inf_rest_name", "rest_eclf", "name_client", "date_rev_cl", "titre_rev_cl", "opinion_cl"])

    with requests.Session() as s:
        for offset in range(0,40,10):
            url = f'https://www.tripadvisor.fr/Restaurant_Review-g187147-d947475-Reviews-or{offset}-Le_Bouclard-Paris_Ile_de_France.html'
            r = s.get(url)
            soup = bs(r.content, 'lxml')
            if not offset:
                inf_rest_name = soup.select_one('.heading').text.replace("\n","").strip()
                rest_eclf = soup.select_one('.header_links a').text.strip()

            for review in soup.select('.reviewSelector'):
                name_client = review.select_one('.info_text > div:first-child').text.strip()
                date_rev_cl = review.select_one('.ratingDate')['title'].strip()
                titre_rev_cl = review.select_one('.noQuotes').text.strip()
                opinion_cl = review.select_one('.partial_entry').text.replace("\n","").strip()
                row = [f"{inf_rest_name}", f"{rest_eclf}", f"{name_client}", f"{date_rev_cl}" , f"{titre_rev_cl}", f"{opinion_cl}"]
                w.writerow(row)

对于我的设置,为了使其正常工作,我必须将定界符设置为“,”而不是“;”

结果样本:

enter image description here