Beatifulsoup没有返回页面的完整html

时间:2018-06-14 12:34:11

标签: python beautifulsoup request

我想从亚马逊网站上搜索一些页面,例如title,url,aisn,我遇到一个问题,脚本只解析了15个产品,而它显示在50页面上。我决定将所有html打印到控制台和我看到html以15个产品结束,脚本没有任何错误。 这是我脚本的一部分

keyword = "men jeans".replace(' ', '+')

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5'}
url = "https://www.amazon.com/s/field-keywords={}".format(keyword)

request = requests.session()
req = request.get(url, headers = headers)
sleep(3)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup)

2 个答案:

答案 0 :(得分:2)

因为很少有动态生成的项目。除了使用硒之外,可能还有更好的解决方案。但是,作为一种解决方法,您可以尝试以下方式。

from selenium import webdriver
from bs4 import BeautifulSoup

def fetch_item(driver,keyword):
    driver.get(url.format(keyword.replace(" ", "+")))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    for items in soup.select("[id^='result_']"):
        try:
            name = items.select_one("h2").text
        except AttributeError: name = ""
        print(name)

if __name__ == '__main__':
    url = "https://www.amazon.com/s/field-keywords={}"
    driver = webdriver.Chrome()
    try:
        fetch_item(driver,"men jeans")
    finally:
        driver.quit()

运行上面的脚本后,你应该得到56个名字或者其他东西。

答案 1 :(得分:0)

import requests
from bs4 import BeautifulSoup

for page in range(1, 21):
    keyword = "red car".replace(' ', '+')
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b3) Gecko/20090305 Firefox/3.1b3 GTB5'}
    url = "https://www.amazon.com/s/field-keywords=" + keyword + "?page=" + str(page)
    request = requests.session()
    req = request.get(url, headers=headers)
    soup = BeautifulSoup(req.content, 'html.parser')
    results = soup.findAll("li", {"class": "s-result-item"})

    for i in results:
        try:
            print(i.find("h2", {"class": "s-access-title"}).text.replace('[SPONSORED]', ''))

            print(i.find("span", {"class": "sx-price-large"}).text.replace("\n", ' '))

            print('*' * 20)
        except:
            pass

亚马逊的页面范围是最大20,这里是抓取页面