Question

我正在尝试解析亚马逊搜索结果页面。我想要使用<li>，<id=result_0>，<id=result_1>等访问<id=result_2>标记中包含的数据。find_all('li')函数仅返回4个结果（最多result_3），我认为这很奇怪，因为在浏览器中查看网页时，我看到了12个结果。

当我打印parsed_html时，我发现它一直包含到result_23。为什么没有找到__返回所有24个对象？我的代码片段如下。

import requests

try: 
    from BeautifulSoup import bsoup
except ImportError:
    from bs4 import BeautifulSoup as bsoup

search_url = 'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-
              alias%3Dstripbooks&field-keywords=data+analytics'
response = requests.get(search_url, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36
        (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"})
parsed_html = bsoup(response.text)
results_tags = parsed_html.find_all('div',attrs={'id':'atfResults'})
results_html = bsoup(str(results_tags[0]))
results_html.find_all('li')

对于它的价值，results_tags对象也只包含4个结果。这就是为什么我认为问题出在find_all步骤中，而不是使用BeautifulSoup对象。

如果有人可以帮我弄清楚这里发生了什么，以及如何访问这个网页上的所有搜索结果，我将非常感激!!

Answer 1

可以通过类而不是id直接访问li元素。这将打印每个li元素的文本。

results_tags = parsed_html.find_all('li',attrs={'class':'s-result-item'})
for r in results_tags:
    print(r.text)

Answer 2

import requests, re

try: 
    from BeautifulSoup import bsoup
except ImportError:
    from bs4 import BeautifulSoup as bsoup

search_url = 'https://www.amazon.com/s/?url=search-%20alias%3Dstripbooks&field-keywords=data+analytics' #delete the irrelevant part from url
response = requests.get(search_url, headers={
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
        "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" })  # add 'Accept' header
parsed_html = bsoup(response.text, 'lxml')
lis = parsed_html.find_all('li', class_='s-result-item' ) # use class to find li tag
len(lis)

出：

为什么Beautifulsoup find_all没有返回完整的结果？

2 个答案: