到目前为止,我所获得的代码是针对一个页面工作的,但我希望它可以用于多个页面(循环中的7 * 29),例如 http://www.oddsportal.com/basketball/usa/nba-2013-2014/results/#/page/1 我猜,你不得不每次都重启浏览器模拟,但我不确定,怎么做。 所以这是我运行代码的控制台输出(python 3.5)。
content-type missing in HTTP POST, defaulting to application/x-www-form- urlencoded. Use QNetworkRequest::setHeader() to fix this problem.
done
QObject::connect: Cannot connect (null)::configurationAdded(QNetworkConfiguration) to QNetworkConfigurationManager::configurationAdded(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationRemoved(QNetworkConfiguration) to QNetworkConfigurationManager::configurationRemoved(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::configurationChanged(QNetworkConfiguration) to QNetworkConfigurationManager::configurationChanged(QNetworkConfiguration)
QObject::connect: Cannot connect (null)::onlineStateChanged(bool) to QNetworkConfigurationManager::onlineStateChanged(bool)
QObject::connect: Cannot connect (null)::configurationUpdateComplete() to QNetworkConfigurationManager::updateCompleted()
我也不确定,缺少什么内容类型,但它适用于单个页面,所以我忽略了它。 为了测试我想用它做的事情,我继续并在2014赛季手动更改了网址,结果工作正常,所以我有点失落。 代码包含一个通用的抓取javascript部分,我几乎复制粘贴和我自己编写的html解释部分。 由于我不确定,问题出在哪里,这里是完整的代码。
from lxml import html
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
import pickle
class Render(QWebPage):
def __init__ (self,url):
self.app =QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
def getHtml(str_url):
r_html = Render(str_url)
html = r_html.frame.toHtml()
return html
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
def scrape_js(url):
str_html = getHtml(url)
result = str(str_html.encode("utf-8"))
tree = html.fromstring(result)
content = tree.xpath('//table[@class=" table-main"]//tr[(@class=" deactivate") or (@class="odd deactivate")]//td[position()>1]//text()')
liste=[[]]
i=0;
k=0;
n=int(len(content))
while i<n:
if is_number(content[i-1]) and is_number(content[i-2]) and is_number(content[i-3]):
liste.append([content[i]])
i+=1
k+=1
else:
liste[k].append(content[i])
i+=1
liste = liste[1:]
for line in liste:
if is_number(line[2]):
liste = liste[1:]
return liste
complete_liste = []
file_name = 'odds_2009'
for page in range(30):
url = ''.join(['http://www.oddsportal.com/basketball/usa/nba-2008-2009/results/#/page/',str(page)])
liste = scrape_js(url)
for line in liste:
complete_liste.append(line)
print('done')
fileObject = open(file_name,'wb')
pickle.dump(complete_liste,fileObject)
fileObject.close()