我在文件夹中有很多html格式文件。我需要检查它们是否都包含此标记:
<strong>QQ</strong>
并且只需要提取“QQ”及其内容。 我首先阅读其中一个要测试的文件,但似乎我的正则表达式不匹配。 如果我将fo_read替换为标记
<strong>QQ</strong>
它会匹配。
fo = open('4251-fu.html', "r")
fo_read = fo.read()
m = re.search('<strong>(QQ)</strong>', fo_read)
if m:
print 'Match found: ', m.group(1)
else:
print 'No match'
fo.close()
答案 0 :(得分:0)
您可以尝试使用BeautifulSoup:
from bs4 import BeautifulSoup
f = open('4251-fu.html',mode = 'r')
soup = BeautifulSoup(f, 'lxml')
search_result = [str(e) for e in soup.find_all('strong')]
print search_result
if '<strong>Question-and-Answer Session</strong>' in search_result:
print 'Match found'
else:
print 'No match'
f.close()
输出:
['<strong>Question-and-Answer Session1</strong>', '<strong>Question-and-Answer Session</strong>', '<strong>Question-and-Answer Session3</strong>']
Match found
答案 1 :(得分:0)
result = soup.find("strong", string=re.compile("Question-and-Answer Session"))
if result:
print("Question-and-Answer Session")
# for the rest of text in the parent
rest = result.parent.text.split("Question-and-Answer Session")[-1].strip()
print(rest)
else:
print("no match")