跨HTML标签提取文本作为单个字符串

时间:2019-05-03 10:36:12

标签: html python-3.x web-scraping

我正在尝试从下面的HTML代码中提取文本内容作为完整的句子,但是我无法这样做。我尝试同时使用Beautifulsoup.prettify()Beautifulsoup.get_text(),但这些给了我3句话。我想将下面的HTML作为

这样的单个句子阅读
  

获得Microsoft&Google,Inc.办公室的认可。

<li>Recognized by   
                                    <em>Microsoft</em> &amp; 
                                    <em>Google, Inc.</em>, offices.</li>

2 个答案:

答案 0 :(得分:0)

您可以使用BeautifulSoup之类的HTML解析器提取不带标签(soup.text)的文本,然后去除重复的空格/换行符等文本:

input_str = '''
<li>Recognized by   
                                    <em>Microsoft</em> &amp; 
                                    <em>Google, Inc.</em>, offices.</li>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(input_str,"html.parser")
text = " ".join(soup.text.split())
print(text)

输出:

Recognized by Microsoft & Google, Inc., offices.

编辑:根据您的评论,为了获取字符串列表作为输出(每个li标签一个,您可以这样做:

input_str = '''<ul> <li>This is sentence one in a order</li> <li>This is sentence two in a order</li> <li>This is sentence <em>Three</em> in a order </li> <li>This is sentence <em>four</em> in a order </li> </ul>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(input_str,"html.parser")

result = []
for li in soup.find_all('li'):
    text = " ".join(li.text.split())
    result.append(text)

print(result)

输出:

['This is sentence one in a order', 'This is sentence two in a order', 'This is sentence Three in a order', 'This is sentence four in a order']

答案 1 :(得分:0)

我真的不明白您需要什么,但是它将帮助您从网站的网址中提取内容

import requests
import xlsxwriter 
from bs4 import BeautifulSoup

#Text File where the content will be written
file = open("test.txt","w")

#Url from where the data will be extracted
urls ="https://www.pythonforbeginners.com/files/reading-and-writing-files-in-python"
page = requests.get(urls)
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('p'): #extracting all content of <P> tag from the url
    #You can put the desired tag according to your need
 file.write(link.get_text())  
file.close()