Python BeautifulSoup段落仅文本

时间:2019-03-18 09:13:21

标签: python beautifulsoup

对于与网络抓取相关的任何事情,我都是新手,据我所知,Requests和BeautifulSoup是解决之道。 我想编写一个程序,每隔几个小时仅通过电子邮件将给定链接的一个段落发送给我(尝试一种全天阅读博客的新方法) 假设此特定链接“ https://fs.blog/mental-models/”有一个段落,分别针对不同的模型。

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

现在,段落文字开始之前,汤有几层墙:<p> this is what I want to read </p>

soup.title.string工作得很好,但是我不知道如何从这里前进。请问任何方向?

谢谢

3 个答案:

答案 0 :(得分:2)

soup.findAll('p')上查找所有p标签,然后使用.text获取其文本:

此外,因为您不希望页脚段落,所以在divrte下进行所有操作。

from bs4 import BeautifulSoup
import requests

url = 'https://fs.blog/mental-models/'    
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

divTag = soup.find_all("div", {"class": "rte"})    
for tag in divTag:
    pTags = tag.find_all('p')
    for tag in pTags[:-2]:  # to trim the last two irrelevant looking lines
        print(tag.text)

输出

Mental models are how we understand the world. Not only do they shape what we think and how we understand but they shape the connections and opportunities that we see.
.
.
.
5. Mutually Assured Destruction
Somewhat paradoxically, the stronger two opponents become, the less likely they may be to destroy one another. This process of mutually assured destruction occurs not just in warfare, as with the development of global nuclear warheads, but also in business, as with the avoidance of destructive price wars between competitors. However, in a fat-tailed world, it is also possible that mutually assured destruction scenarios simply make destruction more severe in the event of a mistake (pushing destruction into the “tails” of the distribution).

答案 1 :(得分:1)

如果要获取所有p标签的文本,则可以使用find_all方法在它们上循环:

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup)

data = soup.find_all('p')
for p in data:
    text = p.get_text()
    print(text)

编辑:

这里是为了将它们分开包含在列表中的代码。您可以在结果列表上应用循环,以删除空字符串,未使用的字符,例如\n等...

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

data = soup.find_all('p')
result = []
for p in data:
    result.append(p.get_text())

print(result)

答案 2 :(得分:1)

这是解决方案:

from bs4 import BeautifulSoup
import requests
import Clock

url = 'https://fs.blog/mental-models/'  
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
data = soup.find_all('p')

result = []

for p in data:
    result.append(p.get_text())

Clock.schedule_interval(print(result), 60)