网页上有大量的期刊名称和其他详细信息。我试图将表内容刮到数据帧中。
#http://www.citefactor.org/journal-impact-factor-list-2015.html
import bs4 as bs
import urllib #Using python 2.7
import pandas as pd
dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0)
for df in dfs:
print(df)
df.to_csv('citefactor_list.csv', header=True)
但是我收到了以下错误..我确实试过提到一些已提出的问题,但无法修复。
错误:
Traceback (most recent call last):
File "scrape_impact_factor.py", line 7, in <module>
dfs = pd.read_html('http://www.citefactor.org/journal-impact-factor-list-2015.html/', header=0)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 896, in read_html
keep_default_na=keep_default_na)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 733, in _parse
raise_with_traceback(retained)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 727, in _parse
tables = p.parse_tables()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 196, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 450, in _build_doc
return BeautifulSoup(self._setup_build_doc(), features='html5lib',
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 443, in _setup_build_doc
raw_text = _read(self.io)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/html.py", line 130, in _read
with urlopen(obj) as url:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/common.py", line 60, in urlopen
with closing(_urlopen(*args, **kwargs)) as f:
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 448, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 500: Internal Server Error
答案 0 :(得分:1)
500内部服务器错误意味着服务器出现问题,因此无法控制。
但问题是您使用的是错误的网址。
如果您在浏览器中转到http://www.citefactor.org/journal-impact-factor-list-2015.html/,则会收到404未找到错误。删除尾部斜杠,即http://www.citefactor.org/journal-impact-factor-list-2015.html,它将起作用。