Question

我想要抓取的网站使用JavaScript填充回报。

我可以简单地以某种方式调用脚本并使用其结果吗？（当然，没有分页。）我不想运行整个过程来抓取生成的格式化HTML，但原始源是空白的。

看看：http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0

回报的来源就是

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/templates/base_template.xsl"?>
<content>
  <head>
    <SCRIPT type="text/javascript" src="/js/searchResultsView.js"></SCRIPT>    
  </head>
    <whitebox>
    <div id = "hits"></div>  
  </whitebox>
</content>

我更喜欢简单的Python工具。

Answer 1

我已下载Selenium和ChromeDriver。

from selenium import webdriver

driver = webdriver.Chrome()
driver.get('http://kozbeszerzes.ceu.hu/searchresults.xhtml?q=1998&page=0')

for e in driver.find_elements_by_class_name('result'):
    link = e.find_element_by_tag_name('a')
    print(link.text.encode('ascii', 'ignore'), link.get_attribute('href').encode('ascii', 'ignore'))

driver.quit()

如果您使用的是Chrome，则可以使用F12检查页面属性，这非常有用。

Answer 2

的确，你可以用Python做到这一点。你需要python-ghost或Selenium。我更喜欢后者combined with PhantomJS，更轻松，更简单，易于使用：

使用npm（节点包管理器）安装phantomjs：

apt-get install nodejs
npm install phantomjs

安装selenium：

pip install selenium

并像这样得到结果页面，并像往常一样用beautifulSoup（或其他lib）解析它：

from BeautifulSoup4 import BeautifulSoup as bs
from selenium import webdriver
client = webdriver.PhantomJS()
client.get("http://foo")
soup = bs(client.page_source)

Answer 3

简而言之：你不能只用Python做这件事。

正如你所说，这是由javascript（jquery）填充的，它可以动态添加内容。

您可以尝试在本地运行带有 nodejs 的脚本，并在某些时候将DOM转储为html。但无论如何你需要深入研究js代码。

如果在javascript中返回，如何刮取搜索结果（使用python）

3 个答案: