我正在尝试构建一个导入csv文件的scrape工具,然后将csv中的每一行附加到一个url,然后为特定字段抓取该url。到目前为止,该工具添加了数据的所有URL和擦除,但它只返回前2个的数据,只显示其余的URL:
import urllib
import re
import requests
from numpy import genfromtxt
from time import sleep
my_data = genfromtxt('ASINS.csv', delimiter=',', dtype=None)
for ASIN in my_data[:20]:
url = "http://www.amazon.com/gp/product/" + ASIN[1:11]
sleep(1.5)
website_html = requests.get(url).text
print len(website_html)
print url
ranks = re.findall(r'#.\sin\s.*', website_html)
for rank in ranks:
print rank
输出仅返回第一个示例的刮擦,例如:
344781
http://www.amazon.com/gp/product/B00DPE9EQO
#1 in Beauty (<a href="http://www.amazon.com/gp/bestsellers/beauty">See Top 100 in Beauty</a>)
1378
http://www.amazon.com/gp/product/B00CD0H1ZC
327515
http://www.amazon.com/gp/product/B00GP184WO
1378
http://www.amazon.com/gp/product/B00CAZAU62
1378
http://www.amazon.com/gp/product/B00KCFAZTE
1378
http://www.amazon.com/gp/product/B00C7DYBX0
3
和来自csv的剪辑:
B00DPE9EQO
B00CD0H1ZC
B00GP184WO
B00CAZAU62
B00KCFAZTE
B00C7DYBX0
B00IS8Y0HK
B00CKFL93K
B00DDT116M
B00GYF65TK
B00JV8L5N8
任何人都可以给我任何关于为什么会这样做的意见吗?
答案 0 :(得分:1)
有几件事帮助我抓住你要求的数据:
requests.Session()
User-Agent
标题BeautifulSoup
)来提取Best Sellers Ranks
完整代码:
from time import sleep
from bs4 import BeautifulSoup
from numpy import genfromtxt
import requests
my_data = genfromtxt('ASINS.csv', delimiter=',', dtype=None)
# initialize a session
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'}
for ASIN in my_data[:20]:
url = "http://www.amazon.com/gp/product/" + ASIN[1:11]
sleep(1.5)
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print url
# get best seller rankings
for rank in soup.select('ul.zg_hrsr li.zg_hrsr_item'):
print rank.text
print "----"
打印:
http://www.amazon.com/gp/product/B00DPE9EQO
#1
in Health & Personal Care > Vitamins & Dietary Supplements > Vitamins > Vitamin C > C-Complex
#1
in Beauty > Skin Care > Face > Creams & Moisturizers > Fluids & Lotions > Fluids
#1
in Beauty > Skin Care > Face > Oils & Serums
----
http://www.amazon.com/gp/product/B00CD0H1ZC
#1
in Pet Supplies > Dogs > Grooming > Shedding Tools
#1
in Pet Supplies > Cats > Grooming > Shedding Tools
----
http://www.amazon.com/gp/product/B00GP184WO
#1
in Health & Personal Care > Health Care > Sleep & Snoring > Sleeping Masks
----
...
答案 1 :(得分:0)
首先,所有那些相同的小尺寸,1378字节长的结果,可能是&#34; 404 Not Found&#34;某种页面。我尝试使用if len(website_html) == 1378: print website_html
进行一次测试并查看输出结果。如果事实证明您没有找到404 Not Found,或者其他一些错误,例如&#34;您过快地检索网页并且我们认为您是机器人,那么我们就赢了#39;给你那个页面&#34;,然后你就会知道如何修复你的代码(例如,在后一种情况下增加sleep()
时间。)
其次,只有在#后面只有一个字符时,你的正则表达式才有效。如果某些内容排名第10或更低(例如数字上更高的排名),那么你的正则表达式将失败。请尝试使用#\d+
代替#.
,看看是否有帮助。
......啊。在我写这篇文章时,其他人给出了更好的答案。好。我还会发布这个帖子,因为我提出的建议没有重复,也可能有所帮助。