Question

我家里有一个小项目，我需要每隔一段时间抓一个网站链接，并将链接保存在一个txt文件中。

脚本需要在我的Synology NAS上运行，因此脚本需要用bash脚本或python编写而不使用任何插件或外部库，因为我无法在NAS上安装它。（无论如何我的知识）

链接如下所示：

<a href="http://www.example.com">Example text</a>

我想将以下内容保存到我的文本文件中：

Example text - http://www.example.com

我以为我可以用curl和grep（或者正则表达式）来隔离文本。首先我研究了使用Scrapy或Beutifulsoup，但找不到在NAS上安装它的方法。

你们当中有人能帮我把剧本放在一起吗？

Answer 1

您可以使用Python免费提供的urllib2。使用它，您可以轻松获取任何网址的HTML

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

现在，关于解析html。您仍然可以使用BeautifulSoup而无需安装它。从their site，它说＆＃34; 您也可以下载tarball并直接在项目中使用BeautifulSoup.py ＆＃34;。因此，在互联网上搜索该BeautifulSoup.py文件。如果找不到，请下载this one并保存到项目内的本地文件中。然后像下面一样使用它：

soup = BeautifulSoup(html)
for link in soup("a"):
    print link["href"]
    print link.renderContents()

Answer 2

我建议使用Python的htmlparser库。它会将页面解析为对象的层次结构。然后，您可以找到a href标签。

http://docs.python.org/2/library/htmlparser.html

有很多使用此库查找链接的示例，因此我不会列出所有代码，但这是一个有效的示例： Extract absolute links from a page using HTMLParser

修改

正如Oday指出的那样，htmlparser是一个外部库，你可能无法加载它。在这种情况下，以下是内置模块的两个建议，可以满足您的需求：

htmllib包含在Python 2.X中。

xml包含在Python 2.X和3.X中。

本网站其他地方也有一个很好的解释，如何使用wget＆amp; grep做同样的事情：
Spider a Website and Return URLs Only

Answer 3

根据您的示例，您需要这样的内容：

wget -q -O- https://dl.dropboxusercontent.com/s/wm6mt2ew0nnqdu6/links.html?dl=1 | sed -r 's#<a href="([^"]+)">([^<]+)</a>.*$#\2 - \1#' > links.txt

cat links.txt 输出

1Visit W3Schools - http://www.w3schools.com/
2Visit W3Schools - http://www.w3schools.com/
3Visit W3Schools - http://www.w3schools.com/
4Visit W3Schools - http://www.w3schools.com/
5Visit W3Schools - http://www.w3schools.com/
6Visit W3Schools - http://www.w3schools.com/
7Visit W3Schools - http://www.w3schools.com/

从www抓取链接并保存为txt文件（Bash或Python）

3 个答案: