python - Python：使用Ghost进行动态Web抓取

尝试从http://metservice.com/maps-radar/local-observations/local-3-hourly-observations

获取天气数据

找到了关于如何使用Ghost进行网页抓取动态内容的示例here，但我还没有找到如何处理结果。

由于ghost在交互式shell中运行时似乎有问题，我使用

打印（结果）

将输出管道输出到文件：

python getMetObservation.py＆gt; proper_result

这是我的python代码：

  来自ghost import Ghost
  url ='http://metservice.com/maps-radar/local-observations/local-3-hourly-observations'
  gh = Ghost（wait_timeout = 60）
  page，resources = gh.open（url）
  result，resources = gh.evaluate（“document.getElementsByClassName（'obs-content'）;”）
  打印（结果）

在检查文件时，它确实包含了我所追求的内容，但它也包含了我不会追踪的大量信息。还不清楚如何使用评估回报的变量结果。检查ghost.py似乎是由

处理

self.main_frame.evaluateJavaScript（“％s”％script）

在：

def评价（自我，剧本）：
    msgstr“”“在页面框架中评估脚本。

：param脚本：要评估的脚本   “”“
          返回（
              self.main_frame.evaluateJavaScript（“％s”％script），
              self._release_last_resources（），
          ）

当我执行命令时：

document.getElementsByClassName（ 'OBS-内容'）;

在Chromium控制台中，我得到了正确的回复。

我是初学者，当谈到python但愿意学习。另请注意，如果重要的话，我在Ubuntu下的python虚拟环境中运行它。

注意，我发布此答案是因为我当前的解决方案是使用iMacros扩展并在本地保存网页，然后使用BeautifulSoup对现在的静态数据进行抓取。

最初的问题是如何使用Ghost在动态页面上工作，但由于我没有到目前为止，我找到了另一种可以用于其他人的解决方案。

iMacro内容（我将其命名为GetWeather.iim）：

VERSION BUILD=8881205 RECORDER=FX
  TAB T=1
  URL GOTO=http://www.metservice.com/maps-radar/local-observations/local-3-hourly-observations
  WAIT SECONDS=5
  SAVEAS TYPE=CPL FOLDER=* FILE=+_{{!NOW:yyyymmdd_hhnnss}}

从crontab调用的shellscript：

#!/bin/bash
  export DISPLAY=:0.0
  /usr/bin/firefox &
  sleep 5   /usr/bin/firefox imacros://run/?m=GetWeather.iim
  sleep 10
  wmctrl -c "Mozilla Firefox"

与使用BeautifulSoup进行实际网页抓取的python脚本一起使用。

根据thread

的第一个回答中的指示，使用正确的方法停止firefox而不恢复安全模式

Python：使用Ghost进行动态Web抓取

1 个答案: