Question

我在文件夹中有很多html格式文件。我需要检查它们是否都包含此标记：

<strong>QQ</strong>

并且只需要提取“QQ”及其内容。我首先阅读其中一个要测试的文件，但似乎我的正则表达式不匹配。如果我将fo_read替换为标记

<strong>QQ</strong>

它会匹配。

fo = open('4251-fu.html', "r")
fo_read = fo.read()
m = re.search('<strong>(QQ)</strong>', fo_read)
if m:
    print 'Match found: ', m.group(1)
else:
    print 'No match'
fo.close()

Answer 1

您可以尝试使用BeautifulSoup：

from bs4 import BeautifulSoup
f = open('4251-fu.html',mode = 'r')
soup = BeautifulSoup(f, 'lxml')
search_result = [str(e) for e in soup.find_all('strong')]
print search_result
if '<strong>Question-and-Answer Session</strong>' in search_result:
    print 'Match found'
else:
    print 'No match'
f.close()

输出：

['<strong>Question-and-Answer Session1</strong>', '<strong>Question-and-Answer Session</strong>', '<strong>Question-and-Answer Session3</strong>']
Match found

Answer 2

result = soup.find("strong", string=re.compile("Question-and-Answer Session"))
if result:
    print("Question-and-Answer Session")
    # for the rest of text in the parent
    rest = result.parent.text.split("Question-and-Answer Session")[-1].strip()
    print(rest)
else:
    print("no match")

Python Regex提取标签内的html文件内容

2 个答案: