我正在为网站创建一个python scraper来提取价格,产品编号,编号,描述。当我运行此脚本时,它只会拉出页面的第一项,然后转到下一个URL。 python的新手只是想知道如何修改以从页面中提取所有产品。由于澄清第一个网址只有一个产品,但第二个第三个网站都有很多产品没有被拉。
import requests
from bs4 import BeautifulSoup
import random
import time
product_urls = [
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-precursor-assays/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assays/#orderinginformation',
]
for URL in product_urls:
page = requests.get(URL)
soup = BeautifulSoup(page.text,"lxml")
timeDelay = random.randrange(5, 25)
for item in soup.select('.content'):
cat_name = item.select_one('.title').text.strip()
cat_discription = item.select_one('.copy').text.strip()
product_name = (item.find('div',{'class':'headline'}).text.strip())
product_discription = (item.find('div',{'class': 'copy'}).text.strip())
product_number = (item.find('td',{'class': 'textLeft paddingTopLess'}).text.strip())
cat_number = (item.find('td',{'class': 'textRight paddingTopLess2'}).text.strip())
product_price = (item.find('span',{'class': 'prc'}).text.strip())
print("Catagory Name: {}\n\nCatagory Discription: {}\n\nProduct Name: {}\n\nProduct Discription: {}\n\nProduct Number: {}\n\nCat No: {}\n\nPrice: {}\n\n".format(cat_name,cat_discription,product_name,product_discription,product_number,cat_number,product_price))
time.sleep(timeDelay)
答案 0 :(得分:0)
您可以从pane
类div获取表格元素。第四张桌子是主要产品&第五个(现有的)是附加产品。
在下面的示例中,我使用list comprehension输出元组列表,其中包含标题,描述,产品编号,类别n°&价格:
from bs4 import BeautifulSoup
import requests
product_urls = [
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-precursor-assays/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assays/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/real-time-pcr-enzymes-and-kits/miscript-target-protectors/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/real-time-pcr-enzymes-and-kits/two-step-qrt-pcr/miscript-sybr-green-pcr-kit/#orderinginformation'
]
session = requests.Session()
for URL in product_urls:
response = session.get(URL)
soup = BeautifulSoup(response.content, "html.parser")
tables = soup.find_all("div", {"class":"pane"})[0].find_all("table")
if (len(tables) > 4):
product_list = [
(
t[0].findAll("div", {"class":"headline"})[0].text.strip(), #title
t[0].findAll("div", {"class":"copy"})[0].text.strip(), #description
t[1].text.strip(), #product number
t[2].text.strip(), #category number
t[3].text.strip() #price
)
for t in (t.find_all('td') for t in tables[4].find_all('tr'))
if t
]
elif (len(tables) == 1):
product_list = [
(
t[0].findAll("div", {"class":"catNo"})[0].text.strip(), #catNo
t[0].findAll("div", {"class":"headline"})[0].text.strip(), #headline
t[0].findAll("div", {"class":"price"})[0].text.strip(), #price
t[0].findAll("div", {"class":"copy"})[0].text.strip() #description
)
for t in (t.find_all('td') for t in tables[0].find_all('tr'))
if t
]
else:
print("could not parse main product")
print(product_list)
if len(tables) > 5:
add_product_list = [
(
t[0].findAll("div", {"class":"title"})[0].text.strip(), #title
t[0].findAll("div", {"class":"copy"})[0].text.strip(), #description
t[1].text.strip(), #product number
t[2].text.strip(), #category number
t[3].text.strip() #price
)
for t in (t.find_all('td') for t in tables[5].find_all('tr'))
if t
]
print(add_product_list)
如果要将元组索引转换为每个字段的单个列表,请检查this answer