Question

我正在尝试从下面的HTML代码中提取文本内容作为完整的句子，但是我无法这样做。我尝试同时使用Beautifulsoup.prettify()和Beautifulsoup.get_text()，但这些给了我3句话。我想将下面的HTML作为

这样的单个句子阅读

获得Microsoft＆Google，Inc.办公室的认可。

<li>Recognized by   
                                    <em>Microsoft</em> &amp; 
                                    <em>Google, Inc.</em>, offices.</li>

Answer 1

您可以使用BeautifulSoup之类的HTML解析器提取不带标签（soup.text）的文本，然后去除重复的空格/换行符等文本：

input_str = '''
<li>Recognized by   
                                    <em>Microsoft</em> &amp; 
                                    <em>Google, Inc.</em>, offices.</li>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(input_str,"html.parser")
text = " ".join(soup.text.split())
print(text)

输出：

Recognized by Microsoft & Google, Inc., offices.

编辑：根据您的评论，为了获取字符串列表作为输出（每个li标签一个，您可以这样做：

input_str = '''<ul> <li>This is sentence one in a order</li> <li>This is sentence two in a order</li> <li>This is sentence <em>Three</em> in a order </li> <li>This is sentence <em>four</em> in a order </li> </ul>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(input_str,"html.parser")

result = []
for li in soup.find_all('li'):
    text = " ".join(li.text.split())
    result.append(text)

print(result)

输出：

['This is sentence one in a order', 'This is sentence two in a order', 'This is sentence Three in a order', 'This is sentence four in a order']

Answer 2

我真的不明白您需要什么，但是它将帮助您从网站的网址中提取内容

import requests
import xlsxwriter 
from bs4 import BeautifulSoup

#Text File where the content will be written
file = open("test.txt","w")

#Url from where the data will be extracted
urls ="https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python"
page = requests.get(urls)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('p'): #extracting all content of <P> tag from the url
    #You can put the desired tag according to your need
 file.write(link.get_text())  
file.close()

跨HTML标签提取文本作为单个字符串

2 个答案: