我在使用Python和BeautifulSoup从HTML解析和提取ID值时遇到问题

时间:2019-10-13 13:01:08

标签: python beautifulsoup python-requests

我想提取此页面(https://www.example.com/products)上的所有产品URL,然后转到每个URL,对其进行解析并获取ID值并进行打印。我写了一些代码,但到目前为止还没有奏效。 [这是从中提取URL的产品列表] [1]

[这是从中提取ID值的尺寸列表] [2]

这是我到目前为止编写的代码。

from bs4 import BeautifulSoup

webpage_response = requests.get('https://www.example.com/products/', headers={'authority': 'www.example.com', 'method': 'GET', 'path': '/products/', 'scheme': 'https', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-US,en;q=0.9', 'cache-control': 'max-age=0', 'referer': 'https://www.example.com/products', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'same-origin', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'})

webpage = webpage_response.content
soup = BeautifulSoup(webpage, 'lxml')

for li in soup.find_all("li", {"class":"productData"}):
    for link in li.select("a.fn"):
        url = link.get('href')
        webpage2_response = requests.get(url, headers={'authority': 'www.example.com', 'method': 'GET', 'path': '/products/', 'scheme': 'https', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-US,en;q=0.9', 'cache-control': 'max-age=0', 'referer': 'https://www.example.com/products', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'same-origin', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'})
        webpage2 = webpage_response2.content
        soup = BeautifulSoup(webpage, "lxml")
        for div in soup.find_all("div", {"class":"size "}):
            for pid in div.select("a.selectSize"):
                print pid['id']


  [1]: https://i.stack.imgur.com/sMl7s.png
  [2]: https://i.stack.imgur.com/EX6T9.png

1 个答案:

答案 0 :(得分:0)

我并没有过多地使用BeautifulSoup,但是您可以尝试使用re模块(link)。这将找到所有href网址,这意味着您要么必须将webpage_reponse.content从字节转换为utf-8,要么仅使用webpage_response.text:

更新:使用查找ID值重新查找URL:

import re

# Regular expression
ssize_pat = r'<a class\="selectSize" id\="(.*)"\s+?.*<\/a>'


# re.findall() with re.MULTILINE (or re.M) flag 
# to ignore newline characters (\n) in our search

ids = re.findall(ssize_pat, webpage_response.text, re.M)

照片中示例的输出示例:

In [0]: ids
Out[0]: ['44927']