Question

我决定学习用于数据分析的python 2.7编码，并且已经在youtube上观看了许多教程，以便更好地理解基础知识。

我正处于这样的阶段，我想创建简单的网页抓取工具用于教育目的，只是为了学习不同的技术，并且习惯了一些编码。

我正在关注网络抓取工具的教程，但我不确定一些事情。这就是我到目前为止所做的：

ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
WARNING: There were 67287 transport failures.
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]
Tip: /usr/share/doc/systemtap/README.Debian should help you get started.

我似乎无法将href链接分开并显示文本和日期信息。

我希望它看起来像这样：

文章名称
链接到文章
文章日期

有人可以提供一些有关为何采取这些措施的信息吗？

非常感谢！

Answer 1

一个小代码来帮助你。在bs4中，所有节点都是连接，你们都读到了一个＆＃34;链接＆＃34;节点（实际上是一个div），你想让他的孩子像标记一样，所以link.a没问题。

然后，节点有两个部分值，一个是属性，a['href']访问，a.text访问内容。

for link in statements:
    print(link.a['href'])

PS：这是链接变量：

<div id="legalert_title"><a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</a></div>

这是link.a：

<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</a>

这是link.a [＆＃39; href＆＃39;]：

/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act

这是.text：

Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"

所有html都是这样的，也许你需要学习一点html。

试图创建一个简单的python web爬虫

1 个答案: