错误-使用BeautifulSoup4解析网页时

时间:2019-02-14 16:39:29

标签: python python-3.x beautifulsoup

我正在尝试解析网页并打印项目链接(href)。 您能帮我解决哪里问题吗?

import requests
from bs4 import BeautifulSoup

link = "https://www.amazon.in/Power- 
Banks/b/ref=nav_shopall_sbc_mobcomp_powerbank?ie=UTF8&node=6612025031"

def amazon(url):
    sourcecode = requests.get(url)
    sourcecode_text = sourcecode.text
    soup = BeautifulSoup(sourcecode_text)

    for link in soup.findALL('a', {'class': 'a-link-normal aok-block a- 
text-normal'}):
        href = link.get('href')
        print(href)

amazon(link)

输出:

  

C:\ Users \ TIMAH \ AppData \ Local \ Programs \ Python \ Python37 \ python.exe   “ C:/用户/ TIMAH / OneDrive /学习资料/ Python_Test_Scripts / Self   Basic / Class_Test.py“追溯(最近一次通话):文件   “ C:/用户/ TIMAH / OneDrive /学习资料/ Python_Test_Scripts / Self   Basic / Class_Test.py”,第15行       亚马逊(链接)文件“ C:/ Users / TIMAH / OneDrive /研究材料/ Python_Test_Scripts / Self Basic / Class_Test.py”,第9行,在   亚马逊       汤= BeautifulSoup(sourcecode_text,'features =“ html.parser”')文件   “ C:\ Users \ TIMAH \ AppData \ Local \ Programs \ Python \ Python37 \ lib \ site-packages \ bs4__init __。py”,   第196行,在 init 中       %“,”。join(features))bs4.FeatureNotFound:找不到具有您请求的特征的树生成器:features =“ html.parser”。做   您需要安装解析器库吗?

     

以退出代码1完成的过程

3 个答案:

答案 0 :(得分:1)

尽管可以添加标题。然后,当您执行find_all('a')时,只要在href:

import requests
from bs4 import BeautifulSoup

link = "https://www.amazon.in/Power-Banks/b/ref=nav_shopall_sbc_mobcomp_powerbank?ie=UTF8&node=6612025031"

def amazon(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

    sourcecode = requests.get(url, headers=headers)
    sourcecode_text = sourcecode.text
    soup = BeautifulSoup(sourcecode_text, 'html.parser')

    for link in soup.find_all('a', href=True):
        href = link.get('href')
        print(href)

amazon(link)

答案 1 :(得分:0)

您的代码中的问题是使用错误的方法名称findALL .. 汤对象中没有findALL方法,因此不会返回None。 修复将find_all用于新代码的问题,findAll应该也可以工作(小写的double l)。希望你能明白这一点。

import requests
from bs4 import BeautifulSoup

link = "https://www.amazon.in/Power-Banks/b/ref=nav_shopall_sbc_mobcomp_powerbank?ie=UTF8&node=6612025031"


def amazon(url):
    sourcecode = requests.get(url)
    sourcecode_text = sourcecode.text
    soup = BeautifulSoup(sourcecode_text, "html.parser")
    # add "html.parser" as second arg , so you not get a warning .
    # use soup.find_all for new code , also soup.findAll should work 
    for link in soup.find_all('a', {'class': 'a-link-normal aok-block a-text-normal'}):
        href = link.get('href')
        print(href)

amazon(link)

答案 2 :(得分:0)

如果您现在尝试使用 requests 抓取 Amazon,您将不会得到任何回报,因为 Amazon 会知道这是一个脚本,而标头对其无济于事(据我所知)。< /p>

相反,他们会告诉以下内容:

To discuss automated access to Amazon data please contact api-services-support@amazon.com.

您可以通过渲染使用 requests-htmlselenium 抓取亚马逊。

Requeests-html 抓取标题的简单示例(如果您在隐身标签中打开相同的链接,结果将类似):

from requests_html import HTMLSession

session = HTMLSession()
url = 'https://www.amazon.com/s?k=apple+watch+series+6+band'
r = session.get(url)
r.html.render(sleep=1, keep_page=True, scrolldown = 1)

for container in r.html.find('.a-size-medium'):
    title = container.text
    print(f"Title: {title}")

输出:

Title: New Apple Watch Series 6 (GPS, 40mm) - (Product) RED - Aluminum Case with (Product) RED - Sport Band
Title: SUPCASE [Unicorn Beetle Pro] Designed for Apple Watch Series 6/SE/5/4 [44mm], Rugged Protective Case with Strap Bands(Black)
Title: Spigen Rugged Armor Pro Designed for Apple Watch Band with Case for 44mm Series 6/SE/5/4 - Charcoal Gray
Title: Highly rated and well-priced products
Title: Fitlink Stainless Steel Metal Band for Apple Watch 38/40/42/44mm Replacement Link Bracelet Band Compatible with Apple Watch Series 6 Apple Watch Series 5 Apple Watch Series 1/2/3/4 (Grey,42/44mm)
Title: TalkWorks Compatible for Apple Watch Band 42mm / 44mm Comfort Fit Mesh Loop Stainless Steel Adjustable Magnetic Strap for iWatch Series 6, 5, 4, 3, 2, 1, SE - Rose Gold
Title: COOYA Compatible for Apple Watch Band 44mm 42mm Women Men iWatch Wristband with Protective Rugged Case Sport Strap Adjustable Replacement Band Compatible with Apple Watch Series 6 SE 5 4 3 2, Clear
Title: Stainless Steel Metal Bands Compatible with Apple Watch Band 42mm 44mm, Gold Replacement Strap with Adapter+Case Cover Compatible with iWatch Series 6 5 4 3 2 1 SE Sport
Title: elago W2 Charger Stand Compatible with Apple Watch Series 6/SE/5/4/3/2/1 (44mm, 42mm, 40mm, 38mm), Durable Silicone, Compatible with Nightstand Mode (Black)
Title: Element Case Black Ops Watch Band for Apple Watch Series 4/5/6/SE, 44mm - Black (EMT-522-244A-01)
...