我正在尝试从下面的HTML代码中提取文本内容作为完整的句子,但是我无法这样做。我尝试同时使用Beautifulsoup.prettify()
和Beautifulsoup.get_text()
,但这些给了我3句话。我想将下面的HTML作为
获得Microsoft&Google,Inc.办公室的认可。
<li>Recognized by
<em>Microsoft</em> &
<em>Google, Inc.</em>, offices.</li>
答案 0 :(得分:0)
您可以使用BeautifulSoup之类的HTML解析器提取不带标签(soup.text
)的文本,然后去除重复的空格/换行符等文本:
input_str = '''
<li>Recognized by
<em>Microsoft</em> &
<em>Google, Inc.</em>, offices.</li>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(input_str,"html.parser")
text = " ".join(soup.text.split())
print(text)
输出:
Recognized by Microsoft & Google, Inc., offices.
编辑:根据您的评论,为了获取字符串列表作为输出(每个li
标签一个,您可以这样做:
input_str = '''<ul> <li>This is sentence one in a order</li> <li>This is sentence two in a order</li> <li>This is sentence <em>Three</em> in a order </li> <li>This is sentence <em>four</em> in a order </li> </ul>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(input_str,"html.parser")
result = []
for li in soup.find_all('li'):
text = " ".join(li.text.split())
result.append(text)
print(result)
输出:
['This is sentence one in a order', 'This is sentence two in a order', 'This is sentence Three in a order', 'This is sentence four in a order']
答案 1 :(得分:0)
我真的不明白您需要什么,但是它将帮助您从网站的网址中提取内容
import requests
import xlsxwriter
from bs4 import BeautifulSoup
#Text File where the content will be written
file = open("test.txt","w")
#Url from where the data will be extracted
urls ="https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python"
page = requests.get(urls)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('p'): #extracting all content of <P> tag from the url
#You can put the desired tag according to your need
file.write(link.get_text())
file.close()