Python Html表在服务器上运行时无法找到数据

时间:2014-10-06 13:35:13

标签: python web-scraping beautifulsoup

我的代码在实际在线运行时无法正常工作,当我使用None时,它会返回Find如何解决此问题?

这是我的代码;

import time
import sys

import urllib
import re
from bs4 import BeautifulSoup, NavigableString

print "Initializing Python Script"

print "The passed arguments are "
urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/", "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/", "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798", "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]
i =0
regex = '<title>(.+?)</title>'
pattern = re.compile(regex)
word = "tweakers"
alternate = "alternate"
while i<len(urls):

  dataraw = urllib.urlopen(urls[i])
  data = dataraw.read()
  soup = BeautifulSoup(data)
  table = soup.find("table", {"class" : "spec-detail"})
  print table
  i+=1

结果如下:

Initializing Python Script
The passed arguments are 
None
None
None
None


Script finalized

我尝试过使用findAll和其他方法..但我似乎不明白为什么它在我的命令行上工作但不在服务器本身... 有什么帮助吗?

修改

Traceback (most recent call last):
  File "python_script.py", line 35, in 
soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden

1 个答案:

答案 0 :(得分:0)

我怀疑你遇到了differences between parsers

明确指定解析器对我有用:

import urllib2
from bs4 import BeautifulSoup

urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
        "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
        "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
        "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]

for url in urls:
    soup = BeautifulSoup(urllib2.urlopen(url), 'html.parser')
    table = soup.find("table", {"class": "spec-detail"})
    print table

在这种情况下,我使用html.parser,但您可以玩游戏并指定lxmlhtml5lib,例如。

请注意,第三个网址不包含table class="spec-detail",因此会为其打印None

我也引入了一些改进:

  • 删除了未使用的导入
  • 用一个很好的for循环
  • 替换了带索引的while循环
  • 删除了不相关的代码
  • urllib替换为urllib2

您还可以使用requests模块并设置适当的User-Agent标题,假装是真正的浏览器:

from bs4 import BeautifulSoup
import requests

urls = ["http://tweakers.net/pricewatch/355474/gigabyte-gv-n78toc-3g/specificaties/",
        "http://tweakers.net/pricewatch/328943/sapphire-radeon-hd-7950-3gb-gddr5-with-boosts/specificaties/",
        "https://www.alternate.nl/GIGABYTE/GV-N78TOC-3GD-grafische-kaart/html/product/1115798",
        "http://tweakers.net/pricewatch/320116/raspberry-pi-model-b-(512mb)/specificaties/"]

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'}
for url in urls:
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find("table", {"class": "spec-detail"})
    print table

希望有所帮助。