如何将html中的多行段落合并为一行?

时间:2021-03-18 12:47:24

标签: python html beautifulsoup

我有一个 html 文件,其中包含 pdf 文件的标题和段落。但是在这个文件中,每一行段落都被认为是另一个段落,这就是为什么它给出了许多

标签行,所以不可能创建多行的单个段落。谁能建议我解决这个问题的方法?

这是我得到的方式:

["<p>Forti provides access to a diverse array of Forti solutions through a single sign-on ",
  "<p>including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti ",
  "<p>cloud-based management and services. Forti accounts are free which require a license for ",
  "<p>each solution. "]

我想要的地方:

['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, FortiWeb Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']

我已经这样做了:

paragraphs_1 = []
local_path = "file.json"
data = json.loads(open(local_path).read())
for x in data:
    soup = BeautifulSoup(x, 'html.parser') 
    for paragraphs in soup.find_all("p"): 
        paragraphs_1.append(paragraphs.get_text())

3 个答案:

答案 0 :(得分:1)

您可以使用替换功能来摆脱所有 p...like

yourtext.replace("<p>", "") 

答案 1 :(得分:0)

试试这个代码:

new_list = []
for text in my_list_of_text:
    # first remove <p>
    new_list.append(text.replace('<p>', ''))
# next step create a long text using list comprehension
listToStr = ' '.join([str(elem) for elem in new_list]) 
# remove possible double space
final_text= listToStr.replace('  ', ' ')   

例如使用 simplenlg 有更复杂的方式。但是对于您的问题,此代码应该足够了。

答案 2 :(得分:0)

以下函数有助于清理 raw_html 标签

import re

def cleanhtml(raw_html):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', raw_html)
  return cleantext

如果您想将列表中的多个元素组合起来并作为一个段落返回,您可以尝试.join()

paragraph = cleanhtml(str(''.join(para)))

输出:

'Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. '

以列表形式返回

paragraph = [cleanhtml(str(''.join(para)))]

输出

['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']