对于与网络抓取相关的任何事情,我都是新手,据我所知,Requests和BeautifulSoup是解决之道。 我想编写一个程序,每隔几个小时仅通过电子邮件将给定链接的一个段落发送给我(尝试一种全天阅读博客的新方法) 假设此特定链接“ https://fs.blog/mental-models/”有一个段落,分别针对不同的模型。
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
现在,段落文字开始之前,汤有几层墙:<p> this is what I want to read </p>
soup.title.string
工作得很好,但是我不知道如何从这里前进。请问任何方向?
谢谢
答案 0 :(得分:2)
在soup.findAll('p')
上查找所有p
标签,然后使用.text
获取其文本:
此外,因为您不希望页脚段落,所以在div
类rte
下进行所有操作。
from bs4 import BeautifulSoup
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
divTag = soup.find_all("div", {"class": "rte"})
for tag in divTag:
pTags = tag.find_all('p')
for tag in pTags[:-2]: # to trim the last two irrelevant looking lines
print(tag.text)
输出:
Mental models are how we understand the world. Not only do they shape what we think and how we understand but they shape the connections and opportunities that we see.
.
.
.
5. Mutually Assured Destruction
Somewhat paradoxically, the stronger two opponents become, the less likely they may be to destroy one another. This process of mutually assured destruction occurs not just in warfare, as with the development of global nuclear warheads, but also in business, as with the avoidance of destructive price wars between competitors. However, in a fat-tailed world, it is also possible that mutually assured destruction scenarios simply make destruction more severe in the event of a mistake (pushing destruction into the “tails” of the distribution).
答案 1 :(得分:1)
如果要获取所有p
标签的文本,则可以使用find_all
方法在它们上循环:
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)
data = soup.find_all('p')
for p in data:
text = p.get_text()
print(text)
编辑:
这里是为了将它们分开包含在列表中的代码。您可以在结果列表上应用循环,以删除空字符串,未使用的字符,例如\n
等...
from bs4 import BeautifulSoup
import re
import requests
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')
result = []
for p in data:
result.append(p.get_text())
print(result)
答案 2 :(得分:1)
这是解决方案:
from bs4 import BeautifulSoup
import requests
import Clock
url = 'https://fs.blog/mental-models/'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')
result = []
for p in data:
result.append(p.get_text())
Clock.schedule_interval(print(result), 60)