我希望将所有内容都放在HTML文档中并将句子大写(在段落标记内)。输入文件包含所有大写字母。
我的尝试有两个缺陷 - 首先,它会删除段落标记本身,其次,它只是降低匹配组中的所有内容。我不太清楚大写()是如何工作的,但我认为它会留下句子的第一个字母......大写。
这也可能比正则表达式更容易实现。这就是我所拥有的:
import re
def replace(match):
return match.group(1).capitalize()
with open('explanation.html', 'rbU') as inf:
with open('out.html', 'wb') as outf:
cont = inf.read()
par = re.compile(r'(?s)\<p(.*?)\<\/p')
s = re.sub(par, replace, cont)
outf.write(s)
答案 0 :(得分:3)
beautifulsoup和nltk的示例:
from nltk.tokenize import PunktSentenceTokenizer
from bs4 import BeautifulSoup
html_doc = '''<html><head><title>abcd</title></head><body>
<p>i want to take everything in an HTML document and capitalize the sentences (within paragraph tags).
the input file has everything in all caps.</p>
<p>my attempt has two flaws - first, it removes the paragraph tags, themselves, and second, it simply lower-cases everything in the match groups.
i don't quite know how capitalize() works, but I assumed that it would leave the first letter of sentences... capitalized.</p>
<p>there may be a much easier way to do this than regex, too. Here's what I have:</p>
</body>
<html>'''
soup = BeautifulSoup(html_doc, 'html.parser')
for paragraph in soup.find_all('p'):
text = paragraph.get_text()
sent_tokenizer = PunktSentenceTokenizer(text)
sents = [x.capitalize() for x in sent_tokenizer.tokenize(text)]
paragraph.string = "\n".join(sents)
print(soup)