我想提取文件中特定句子后面的文字。
答案 0 :(得分:1)
你特别要求BeautifulSoup吗?如果不使用以下内容:
要在特定句子之后立即拆分文本,请尝试此操作,因为我不确定您在句子之后想要提取的内容,我将假设您的意思是句子之后的所有内容,
例如,如果我有一个文件file.txt:
Lorem ipsum dolor坐下来,精神上的精神。 Vivamus congue mattis risus,坐在amet elementum lorem gravida eu。简历 ante vel erat feugiat scelerisque。 Etiam nec urna sed enim blandit blandit non nec odio。 Quisque lacinia tempus rhoncus。 Mauris euismod leo ut velit lobortis feugiat。 Phasellus ultrices nunc sit amet tortor pretium eu mollis neque condimentum。 Fusce placerat bibendum diam eget euismod。 Phasellus ultricies erat nibh,sed volutpat quam。 Nunc quis mauris sed purus aliquet aliquam。整数viverra rutrum arcu ac tempor。
我的判决是Mauris euismod leo ut velit lobortis feugiat.
你可以这样做:
with open("file.txt") as file: #open a file securily, then automitaclly close it
seperator = "Mauris euismod leo ut velit lobortis feugiat." #assign pre-opt variable for the sentence
for line in file:
text = line.split(seperator,1)[1]
print text
>>> Phasellus ultrices nunc sit amet tortor pretium eu mollis neque condimentum. Fusce placerat bibendum diam eget euismod. Phasellus ultricies erat nibh, sed volutpat quam. Nunc quis mauris sed purus aliquet aliquam. Integer viverra rutrum arcu ac tempor.
使用BeautifulSoup
您可以从文件中提取所有文本,如果您需要更具体的信息,请告诉我。
from bs4 import BeautifulSoup
soup = """<html><body><div style="DISPLAY: block; TEXT-INDENT: 0pt"><br/></div> <div align="justify" style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: Arial">Our Earnings are Significantly Affected by General Business and Economic Conditions</font></div></body></html>"""
print(soup.get_text())
输出:
Our Earnings are Significantly Affected by General Business and Economic Conditions