我正在学习beautifulsoup,我想使用正则表达式来过滤字符串。
例如,html标记为:
<div>apple<\div>
<div>android<\div>
<div>windows<\div>
此代码将起作用:
re_words = re.compile(u".*(apple|android).*")
for content in body.findAll("div"):
if re_words.match(content.text):
print content.text
但是我想在正则表达式中动态添加关键字,所以我尝试编写这段代码:
word0 = "apple"
word1 = "android"
regular = "u""\".*("
regular += word0
regular += "|"
regular += word1
regular +=").*\""
re_words = re.compile(regular)
for content in body.findAll("div"):
if re_words.match(content.text):
print content.text
这次我没能创建合法的re.compile()。有人会帮忙吗?
答案 0 :(得分:0)
首先,您可以将compiled regular expression传递给find_all()
来电的|
参数。要动态创建正则表达式,我会将占位符放入括号中并使用keywords = ["apple", "android"]
pattern = r"(%s)" % "|".join(keywords)
for content in body.find_all("div", text=re.compile(pattern)):
print(content.text)
加入关键字:
text
或者,您可以将callable作为keywords = ["apple", "android"]
for content in body.find_all("div",
text=lambda text: any(keyword in text
for keyword in keywords)):
print(content.text)
参数值传递:
keywords = ["apple", "android"]
for content in body.find_all("div", text=keywords):
print(content.text)
另请注意,如果您需要精确匹配文本,则不需要正则表达式:
for thing in my_list: #don't call it "list"
if isinstance(thing, list):
for other in thing:
print(other)