正则表达式在HTML python中大写段落

时间:2015-09-09 18:40:59

标签: python html regex python-2.7

我希望将所有内容都放在HTML文档中并将句子大写(在段落标记内)。输入文件包含所有大写字母。

我的尝试有两个缺陷 - 首先,它会删除段落标记本身,其次,它只是降低匹配组中的所有内容。我不太清楚大写()是如何工作的,但我认为它会留下句子的第一个字母......大写。

这也可能比正则表达式更容易实现。这就是我所拥有的:

import re

def replace(match):
    return match.group(1).capitalize()

with open('explanation.html', 'rbU') as inf:
    with open('out.html', 'wb') as outf:
        cont = inf.read()
        par = re.compile(r'(?s)\<p(.*?)\<\/p')
        s = re.sub(par, replace, cont)
        outf.write(s)

1 个答案:

答案 0 :(得分:3)

beautifulsoupnltk的示例:

from nltk.tokenize import PunktSentenceTokenizer
from bs4 import BeautifulSoup

html_doc = '''<html><head><title>abcd</title></head><body>
<p>i want to take everything in an HTML document and capitalize the sentences (within paragraph tags).
the input file has everything in all caps.</p>
<p>my attempt has two flaws - first, it removes the paragraph tags, themselves, and second, it simply lower-cases everything in the match groups.
 i don't quite know how capitalize() works, but I assumed that it would leave the first letter of sentences... capitalized.</p>
<p>there may be a much easier way to do this than regex, too. Here's what I have:</p>
</body>
<html>'''

soup = BeautifulSoup(html_doc, 'html.parser')

for paragraph in soup.find_all('p'):
    text = paragraph.get_text()
    sent_tokenizer = PunktSentenceTokenizer(text)
    sents = [x.capitalize() for x in sent_tokenizer.tokenize(text)]
    paragraph.string = "\n".join(sents)

print(soup)