import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
from lxml import html
# Take this class for granted.Just use result of rendering.
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://pycoders.com/archive/'
r = Render(url)
result = r.frame.toHtml()
# This step is important.Converting QString to Ascii for lxml to process
archive_links = html.fromstring(str(result.encode('utf-8')))
print(archive_links)
我从this page获得了该脚本。我必须将result.toAscii()更改为result.encode(' utf-8)
当我运行此脚本时,它返回:
> <Element div at 0x7f98226af458>
我不是Python的专家,我不知道究竟是什么意思。这是一种存储信息吗? 它假设返回网页中的链接。
之后,我添加了这样的循环:
for link in archive_links:
print(link)
它返回:
> <Element p at 0x7f09a14d3408> <Element meta at 0x7f09a14d34a8>
> <Element meta at 0x7f09a14d34f8> <Element meta at 0x7f09a14d3408>
> <Element meta at 0x7f09a14d34a8> <Element meta at 0x7f09a14d34f8>
> <Element link at 0x7f09a14d3408> <Element link at 0x7f09a14d34a8>
> <Element title at 0x7f09a14d34f8> <Element style at 0x7f09a14d3408>
> <!-- Bootstrap core CSS --> <Element link at 0x7f09a14d34f8> <!--
> Custom styles for this template --> <Element link at 0x7f09a14d3408>
> <!-- Fonts from Google Fonts --> <Element link at 0x7f09a14d34a8> <!--
> HTML5 shim and Respond.js IE8 support of HTML5 elements and media
> queries --> <!--[if lt IE 9]>\n\t <script
> src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>\n\t
> <script
> src="https://oss.maxcdn.com/libs/respond.js/1.3.0/respond.min.js"></script>\n\t<![endif]-->
> <!-- Fixed navbar --> <Element div at 0x7f09a14d34a8> <Element div at
> 0x7f09a14d3408> <Element div at 0x7f09a14d34f8> <!-- /container -->
> <!-- Bootstrap core
> JavaScript\n\t================================================== -->
> <!-- Placed at the end of the document so the pages load faster -->
> <Element script at 0x7f09a14d34f8> <Element script at 0x7f09a14d3408>
> <Element script at 0x7f09a14d34a8> <Element script at 0x7f09a14d34f8>
另外,我们可以轻松地使用PyQt5吗?我可以接受任何建议。 谢谢。
编辑:这个编辑与JS抓取主题相关。但它是关于抓取另一个site并使用不同的技术。 在创建此主题之前,我检查了this page。我不明白这段代码的作用:
>>> BeautifulSoup(html, 'lxml').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"}) is None
> True
返回True。我对此代码所做的是:
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('http://oil-price.net').read()
soup = BeautifulSoup(html, 'lxml').find("div",{"id":"cntPos"}).find("table",{"class":"cntTb"}).tbody.find_all("tr")[1].find("td",{"class":"cntBoxGreyLnk"})
print(soup)
但是这段代码没有返回动态油价。它返回:
> /usr/bin/python3.5
> /home/dogus/PycharmProjects/Recipes/bsdoesntfindjs.py <td
> class="cntBoxGreyLnk" rowspan="2" valign="top"> <script
> src="http://www.oil-price.net/COMMODITIES/gen.php?lang=en"
> type="text/javascript"> </script> <noscript> To get live <a
> href="http://www.oil-price.net/dashboard.php?lang=en#COMMODITIES">gold,
> oil and commodity price</a>, please enable Javascript. </noscript>
> <br/> <table cellpadding="0" cellspacing="0" class="b11"> <tbody> <tr>
> <td colspan="3" height="15"> <a
> href="http://feeds.feedburner.com/Oil-pricenet-OilPriceTodayAndTomorrow"
> style="font-size: 16px; font-weight: normal; "><img border="0"
> src="/pics/feed-icon.gif"/> Subscribe to RSS</a><br/> <hr/> <br/>
> <form action="http://www.feedburner.com/fb/a/emailverify"
> method="post"
> onsubmit="window.open('http://www.feedburner.com/fb/a/emailverifySubmit?feedId=1678900',
> 'popupwindow', 'scrollbars=yes,width=550,height=520');return true"
> style="border:0px solid #ccc;padding:8px;text-align:center; font-size:
> 12px; font-weight: normal;" target="popupwindow"> <p style="font-size:
> 14px; font-weight: bold;"><img border="0" height="40"
> src="/index_files/email_40.png" style="vertical-align:text-bottom;"
> widht="40"/> Receive our FREE<br/>Oil Intelligence Newsletter:
> <br/><span style="font-size:11px; font-weight:normal;">(We don't
> spam)</span> </p><p><input name="email" style="width:140px"
> type="text"/></p><input name="url" type="hidden"
> value="http://feeds.feedburner.com/~e?ffid=1678900"/><input
> name="title" type="hidden" value="Oil-price.net - Oil Price, Today and
> Tomorrow"/><input name="loc" type="hidden" value="en_US"/><input
> type="submit" value="Subscribe"/></form> </td> </tr><tr><td
> colspan="3"> <hr/><br/> <div class="b11"> <strong style="white-space:
> nowrap;">oil-price.net is available in</strong> <br/> <br/> </div>
> </td></tr> <tr> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=fr"><img border="0" height="21"
> src="index_files/lng_fr.png" width="54"/><br/> Français </a> </td>
> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=en"><img border="0" height="21"
> src="index_files/lng_en.png" width="54"/><br/> English </a> </td>
> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=zh"><img border="0" height="21"
> src="index_files/lng_zh.png" width="54"/><br/> 中国 </a> </td> </tr>
> <tr> <td colspan="3" height="15"></td> </tr> <tr> <td
> style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=it"><img border="0" height="21"
> src="index_files/lng_it.png" width="54"/><br/> Italiano </a> </td>
> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=th"><img border="0" height="21"
> src="index_files/lng_th.png" width="54"/><br/> ภาษาไทย </a> </td>
> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=ar"><img border="0" height="21"
> src="index_files/lng_ar.png" width="54"/><br/> العربيه </a> </td>
> </tr> <tr> <td colspan="3" height="15"></td> </tr> <tr> <td
> style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=nl"><img border="0" height="21"
> src="index_files/lng_nl.png" width="54"/><br/> Nederland </a> </td>
> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=pt"><img border="0" height="21"
> src="index_files/lng_pt.png" width="54"/><br/> Português </a> </td>
> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=ko"><img border="0" height="21"
> src="index_files/lng_ko.png" width="54"/><br/> 한국어 </a> </td> </tr>
> <tr> <td colspan="3" height="15"></td> </tr> <tr> <td
> style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=ja"><img border="0" height="21"
> src="index_files/lng_ja.png" width="54"/><br/> 日本語 </a> </td> <td
> style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=ru"><img border="0" height="21"
> src="index_files/lng_ru.png" width="54"/><br/> Русскийязык </a>
> </td> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=id"><img border="0" height="21"
> src="index_files/lng_id.png" width="54"/><br/> Bahasa
> <br/>Indonesia </a> </td> </tr> <tr> <td colspan="3" height="15"></td>
> </tr> <tr> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=es"><img border="0" height="21"
> src="index_files/lng_es.png" width="54"/><br/> Espanol </a> </td>
> <td style="padding-right: 15px;" valign="middle"> <a
> href="/index.php?lang=de"><img border="0" height="21"
> src="index_files/lng_de.png" width="54"/><br/> Deutsch </a> </td>
> </tr></tbody></table> </td>
另一位用户建议使用PyV8,但在我看来,它是一个太旧的库使用。最新更新是在2012年。我应该学习它吗?或者我应该坚持使用PyQt4?还是5?