我正在尝试为这个网站运行一个刮刀,当我只使用一个网址时,代码可以工作,但是当我添加多个时,它没有输出。我需要它通过不同的网址运行并刮取信息。
> Blockquote
>`import requests
>import csv
>from bs4 import BeautifulSoup
>from html.parser import HTMLParser
>from time import sleep
from random import randint
<import urllib.request
r=requests.get('https://www.qiagen.com/us/products/a-z-list/#&&s=Ascending&pg=55&q=&l=')
c=r.content
s=BeautifulSoup(c,"html.parser")
product_urls = ['https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-precursor-assays/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assays/#orderinginformation',
'https://www.qiagen.com/us/shop/genes-and-pathways/technology-portals/browse-qpcr/mirna-gene-expression/mirna-isolation/miscript-single-cell-qpcr-kit/#orderinginformation']
for url in product_urls:
page = urllib.request.urlopen(url)
s = BeautifulSoup(page,"html.parser")
getall = s.find_all("div",{"class":"gene_globe_segment_0_OrderingInfoPane"})
getall
for i in getall:
product_name = (i.find('div',{'class':'title'}).text.strip())
product_discription = (i.find('div',{'class': 'copy'}).text.strip())
product_number = (i.find('td',{'class': 'textLeft paddingTopLess'}).text.strip())
cat_number = (i.find('td',{'class': 'textRight paddingTopLess'}).text.strip())
product_price = (i.find('td',{'class': 'textRight paddingTopLess priceSingle'}).text.strip())
for i in getall:
print(i.find('div',{'class':'title'}).text.strip()) #product name
print(i.find('div',{'class': 'copy'}).text.strip()) #product discription
print(i.find('td',{'class': 'textLeft paddingTopLess'}).text.strip()) #product number
print(i.find('td',{'class': 'textRight paddingTopLess'}).text.strip()) #cat number
print(i.find('td',{'class': 'textRight paddingTopLess priceSingle'}).text.strip()) #product price
print(' ')`<
答案 0 :(得分:0)
您的脚本存在多个问题。您已在主容器中定义了错误的class
名称。你的脚本错误地缩进了。最后,您需要以这样的方式调整选择器,以便它可以处理不同的站点。我已将您的打印项目减少到两个,以便我可以给您一个透明的演示。我试着清理一下你的烂摊子。我在下面粘贴的修改过的脚本是有效的。
你走了:
import requests
from bs4 import BeautifulSoup
product_urls = [
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-precursor-assays/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assay-plate/#orderinginformation',
'https://www.qiagen.com/us/shop/pcr/primer-sets/miscript-primer-assays/#orderinginformation',
]
for URL in product_urls:
page = requests.get(URL)
soup = BeautifulSoup(page.text,"lxml")
for item in soup.select(".content"):
product_name = item.select_one('.title').text.strip()
product_discription = item.select_one('.copy').text.strip()
print("Name: {}\n\nDescription: {}\n\n".format(product_name,product_discription))