试图创建一个简单的python web爬虫

时间:2016-10-31 01:02:20

标签: python web-crawler

我决定学习用于数据分析的python 2.7编码,并且已经在youtube上观看了许多教程,以便更好地理解基础知识。

我正处于这样的阶段,我想创建简单的网页抓取工具用于教育目的,只是为了学习不同的技术,并且习惯了一些编码。

我正在关注网络抓取工具的教程,但我不确定一些事情。这就是我到目前为止所做的:

ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
WARNING: There were 67287 transport failures.
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed.  [man error::pass5]
Tip: /usr/share/doc/systemtap/README.Debian should help you get started.

我似乎无法将href链接分开并显示文本和日期信息。

我希望它看起来像这样:

  1. 文章名称
  2. 链接到文章
  3. 文章日期
  4. 有人可以提供一些有关为何采取这些措施的信息吗?

    非常感谢!

1 个答案:

答案 0 :(得分:0)

一个小代码来帮助你。在bs4中,所有节点都是连接,你们都读到了一个"链接"节点(实际上是一个div),你想让他的孩子像标记一样,所以link.a没问题。

然后,节点有两个部分值,一个是属性,a['href']访问,a.text访问内容。

for link in statements:
    print(link.a['href'])

PS: 这是链接变量:

<div id="legalert_title"><a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</a></div>

这是link.a:

<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</a>

这是link.a [&#39; href&#39;]:

/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act

这是.text:

Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"

所有html都是这样的,也许你需要学习一点html。