我决定学习用于数据分析的python 2.7编码,并且已经在youtube上观看了许多教程,以便更好地理解基础知识。
我正处于这样的阶段,我想创建简单的网页抓取工具用于教育目的,只是为了学习不同的技术,并且习惯了一些编码。
我正在关注网络抓取工具的教程,但我不确定一些事情。这就是我到目前为止所做的:
ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
WARNING: There were 67287 transport failures.
WARNING: /usr/bin/staprun exited with status: 1
Pass 5: run failed. [man error::pass5]
Tip: /usr/share/doc/systemtap/README.Debian should help you get started.
我似乎无法将href链接分开并显示文本和日期信息。
我希望它看起来像这样:
有人可以提供一些有关为何采取这些措施的信息吗?
非常感谢!
答案 0 :(得分:0)
一个小代码来帮助你。在bs4中,所有节点都是连接,你们都读到了一个"链接"节点(实际上是一个div),你想让他的孩子像标记一样,所以link.a
没问题。
然后,节点有两个部分值,一个是属性,a['href']
访问,a.text
访问内容。
for link in statements:
print(link.a['href'])
PS: 这是链接变量:
<div id="legalert_title"><a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</a></div>
这是link.a:
<a href="/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</a>
这是link.a [&#39; href&#39;]:
/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act
这是.text:
Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"
所有html都是这样的,也许你需要学习一点html。