Question

我正在尝试从kununu的大众汽车页面上提取信息。例如“专业”信息。

url = 'https://www.kununu.com/de/volkswagen/kommentare'
page = requests.get(url)

soup = bs(page.text, 'html.parser')
divs = soup.find_all(class_="col-xs-12 col-lg-12")

for h2 in soup.find_all('h2', class_='h3', text=['Pro']):
    print(h2.find_next_sibling('p').get_text())

但是作为输出，我只有前10个“ Pro”。看起来默认情况下它仅显示前10条评论，但是所有不可见的评论都在“ col-xs-12 col-lg-12”类下……或者也许我缺少了一些内容您能帮我提取所有数据，而不仅仅是前10个吗？

Answer 1

您可以加载这些评论，模仿XHR请求，浏览器将发送这些评论以动态加载更多评论。

有效的代码（注意：使用f字符串，因此为3.6+；如果使用的是较早的Python版本，则使用.format()）：

from bs4 import BeautifulSoup
import requests


comments = []
with requests.Session() as session:
    session.headers = {
        'x-requested-with': 'XMLHttpRequest'
    }

    page = 1
    while True:
        print(f"Processing page {page}..")

        url = f'https://www.kununu.com/de/volkswagen/kommentare/{page}'
        response = session.get(url)

        soup = BeautifulSoup(response.text, 'html.parser')
        new_comments = [
            pro.find_next_sibling('p').get_text()
            for pro in soup.find_all('h2', text='Pro')
        ]
        if not new_comments:
            print(f"No more comments. Page: {page}")
            break

        comments += new_comments

        # just to see current progress so far
        print(comments)
        print(len(comments))

        page += 1

print(comments)

请注意，当向同一主机发送多个请求时，我们如何实例化和使用provides performance benefits的requests.Session()对象。

beautifulsoup仅提取前10个元素

1 个答案: