我的代码在实际在线运行时无法正常工作,当我使用None
时,它会返回Find
如何解决此问题?
这是我的代码;
import time
import sys
import urllib
import re
from bs4 import BeautifulSoup, NavigableString
print "Initializing Python Script"
print "The passed arguments are "
urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]
i =0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
word = "tweakers"
alternate = "alternate"
while i<len(urls):
dataraw = urllib.urlopen(urls[i])
data = dataraw.read()
soup = BeautifulSoup(data)
table = soup.find("table", {"class" : "spec-detail"})
print table
i+=1
结果如下:
Initializing Python Script
The passed arguments are
None
None
None
None
Script finalized
我尝试过使用findAll和其他方法..但我似乎不明白为什么它在我的命令行上工作但不在服务器本身... 有什么帮助吗?
修改
Traceback (most recent call last):
File "python_script.py", line 35, in
soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
答案 0 :(得分:0)
我怀疑你遇到了differences between parsers。
明确指定解析器对我有用:
import urllib2
from bs4 import BeautifulSoup
urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
"http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
"https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
"http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]
for url in urls:
soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
table = soup.find("table", {"class": "spec-detail"})
print table
在这种情况下,我使用html.parser
,但您可以玩游戏并指定lxml
或html5lib
,例如。
请注意,第三个网址不包含table
class="spec-detail"
,因此会为其打印None
。
我也引入了一些改进:
urllib
替换为urllib2
您还可以使用requests
模块并设置适当的User-Agent
标题,假装是真正的浏览器:
from bs4 import BeautifulSoup
import requests
urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
"http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
"https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
"http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'}
for url in urls:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find("table", {"class": "spec-detail"})
print table
希望有所帮助。