来自 ebay 的 python 网页抓取

时间:2021-05-06 18:46:24

标签: python html web-scraping

我正在尝试制作一个程序来从 amazon.com 抓取笔记本电脑产品列表中第一项的标题信息。我猜最后两行代码在捕获正确的标签和属性时有问题。请告诉我为什么代码无法找到信息以及您的建议是什么。感谢阅读。

import requests
import re
from bs4 import BeautifulSoup

url = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=laptop&_sacat=0&_pgn=1"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}
res = requests.get(url, headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")

# print(res.text)
items = soup.find_all("div", attrs={"class":re.compile("^sg-col-inner")}) 
print(items[0].find("span", attrs={"class":"a-size-medium a-color-base a-text-normal"}).get_text()) # Error
# IndexError: list index out of range

1 个答案:

答案 0 :(得分:2)

带有 class="s-item" 的第一个标记不包含 <h3> 标记(您在检查页面的 HTML 结构时会看到它)。您可以使用此示例如何打印所有搜索结果的标题:

import requests
from bs4 import BeautifulSoup

url = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=laptop&_sacat=0&_pgn=1"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
}
res = requests.get(url, headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")

for item in soup.select("#srp-river-results li.s-item"):
    print(item.h3.text)

打印:

Lenovo ThinkPad T400 2 Duo P8400 3GB 160GB HDD 1280x800 WiFi DVD Windows 10 Pro
HP ProBook 655 G1 15.6" Laptop AMD CPU 2.5GHz 4GB 250GB Windows 10
Lenovo ThinkPad Yoga 370 Intel i5 8GB DDR4 512GB SSD 1920x1080 IPS Windows 10 Pro
Portatil acer extensa ex215-52-37y7 Core i3-1005g1 8gb ddr4 ssd 256gb FHD I

...
相关问题