Question

我有一个网站，我想要监控更改，特别是在HTML中的一个DIV中。我使用http://www.followthatpage.com/来监控网页的变化，但我遇到了两个问题：

它检查整个网站，而不仅仅是一个DIV
它每小时仅检查一次网站

理想情况下，我想编写一个bash或python脚本，每隔15分钟执行两个文件的差异，并通过电子邮件发送任何更改。我想我可以在下载两个文件后使用diff命令，如果有变化则将其设置为cron发送电子邮件，但我仍然不知道如何仅过滤到特定的DIV 。

有没有一种更简单的方法，然后自己弄清楚如何做（现有的脚本）？如果没有，那么最好的方法是什么？

Answer 1

由于您想要的div特定于网站，因此您可能需要设置一个简单的检查。

这包括

下载HTML - urllib.urlopen(URL)或requests.get(URL)。
提取恰到好处的部分（BeautifulSoup，自己动手）
执行比较（直接比较或difflib）。

确定提取数据的内容和方法将花费您最长的时间。我建议在Chrome / Firefox中使用开发者工具。

假设我们想知道digitalocean.com上的计数器何时更新。计数器的div看起来像这样：

<div class='inner'>
<span class='count'>5</span>
<span class='count'>8</span>
<span class='count'>2</span>
<span class='count_delimiter'>,</span>
<span class='count'>4</span>
<span class='count'>1</span>
<span class='count'>7</span>
</div>

可悲的是，没有id，使用BeautifulSoup4非常容易。（例如soup.find(id="counter")。

相反，我会选择提取所有具有“count”类的内部元素。

import requests
from bs4 import BeautifulSoup

resp = requests.get('https://www.digitalocean.com')
soup = BeautifulSoup(resp.text)
digits = [tag.getText() for tag in soup.find_all(class_="count")]
count = int(''.join(digits))

BeautifulSoup具有出色的documentation，用于解析HTML文档而不必担心（取决于您正在抓取的网站的布局）。

Answer 2

如果您有权访问Linux终端，另一种方法是添加cronjob

$ crontab -e

并放置以下行（每天16:00）

0   16   *   *   *   diff_web_page.sh

diff_web_page.sh的内容

#!/bin/bash

URL="http://linux.die.net/man/1/bash";
TMP_FILE="/tmp/diff_page.txt";
if [[ ! -f $TMP_FILE ]]; then
    # First time that we are running, create the file and exit.
    lynx -dump "$URL" &> $TMP_FILE;
    # lynx -dump "$URL" | pcegrep -M "<div>.*</div>" > $TMP_FILE
else
    # the file exist, grub the new version and compare it
    lynx -dump "$URL" &> $TMP_FILE.new; ## use pcegrep if required.
    diff -Npaur $TMP_FILE $TMP_FILE.new;
    mv $TMP_FILE.new $TMP_FILE;
fi

每次在user @ host中执行时，都会通过电子邮件发送网页的差异（在运行此cron作业的linux框中）。

如果你想要一个特定的div，你可以在使用lynx转储网页时用{{1}}唤醒输出

是否有一种简单的方法来编写网页比较的脚本？

2 个答案: