我有一个 html 文件,其中包含 pdf 文件的标题和段落。但是在这个文件中,每一行段落都被认为是另一个段落,这就是为什么它给出了许多
标签行,所以不可能创建多行的单个段落。谁能建议我解决这个问题的方法?
这是我得到的方式:
["<p>Forti provides access to a diverse array of Forti solutions through a single sign-on ",
"<p>including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti ",
"<p>cloud-based management and services. Forti accounts are free which require a license for ",
"<p>each solution. "]
我想要的地方:
['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, FortiWeb Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']
我已经这样做了:
paragraphs_1 = []
local_path = "file.json"
data = json.loads(open(local_path).read())
for x in data:
soup = BeautifulSoup(x, 'html.parser')
for paragraphs in soup.find_all("p"):
paragraphs_1.append(paragraphs.get_text())
答案 0 :(得分:1)
您可以使用替换功能来摆脱所有 p...like
yourtext.replace("<p>", "")
答案 1 :(得分:0)
试试这个代码:
new_list = []
for text in my_list_of_text:
# first remove <p>
new_list.append(text.replace('<p>', ''))
# next step create a long text using list comprehension
listToStr = ' '.join([str(elem) for elem in new_list])
# remove possible double space
final_text= listToStr.replace(' ', ' ')
例如使用 simplenlg 有更复杂的方式。但是对于您的问题,此代码应该足够了。
答案 2 :(得分:0)
以下函数有助于清理 raw_html
标签
import re
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext
如果您想将列表中的多个元素组合起来并作为一个段落返回,您可以尝试.join()
paragraph = cleanhtml(str(''.join(para)))
输出:
'Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. '
或
以列表形式返回
paragraph = [cleanhtml(str(''.join(para)))]
输出
['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']