我想提取此页面(https://www.example.com/products)上的所有产品URL,然后转到每个URL,对其进行解析并获取ID值并进行打印。我写了一些代码,但到目前为止还没有奏效。 [这是从中提取URL的产品列表] [1]
[这是从中提取ID值的尺寸列表] [2]
这是我到目前为止编写的代码。
from bs4 import BeautifulSoup
webpage_response = requests.get('https://www.example.com/products/', headers={'authority': 'www.example.com', 'method': 'GET', 'path': '/products/', 'scheme': 'https', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-US,en;q=0.9', 'cache-control': 'max-age=0', 'referer': 'https://www.example.com/products', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'same-origin', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'})
webpage = webpage_response.content
soup = BeautifulSoup(webpage, 'lxml')
for li in soup.find_all("li", {"class":"productData"}):
for link in li.select("a.fn"):
url = link.get('href')
webpage2_response = requests.get(url, headers={'authority': 'www.example.com', 'method': 'GET', 'path': '/products/', 'scheme': 'https', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-US,en;q=0.9', 'cache-control': 'max-age=0', 'referer': 'https://www.example.com/products', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'same-origin', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'})
webpage2 = webpage_response2.content
soup = BeautifulSoup(webpage, "lxml")
for div in soup.find_all("div", {"class":"size "}):
for pid in div.select("a.selectSize"):
print pid['id']
[1]: https://i.stack.imgur.com/sMl7s.png
[2]: https://i.stack.imgur.com/EX6T9.png
答案 0 :(得分:0)
我并没有过多地使用BeautifulSoup,但是您可以尝试使用re
模块(link)。这将找到所有href网址,这意味着您要么必须将webpage_reponse.content从字节转换为utf-8,要么仅使用webpage_response.text:
更新:使用查找ID值重新查找URL:
import re
# Regular expression
ssize_pat = r'<a class\="selectSize" id\="(.*)"\s+?.*<\/a>'
# re.findall() with re.MULTILINE (or re.M) flag
# to ignore newline characters (\n) in our search
ids = re.findall(ssize_pat, webpage_response.text, re.M)
照片中示例的输出示例:
In [0]: ids
Out[0]: ['44927']