Question

我正在使用re.search检查是否在html页面中找到了一个字符串到文本。有时它找不到字符串，虽然它肯定存在。例如，我想找到：<div class="dlInfo-Speed">有没有人知道如何创建正则表达式来找到该字符串？

有没有人知道re.search有什么好的替代方案？

由于

Answer 1

如果您只是想确定是否存在子字符串，可以在中使用。

if some_substring in some_string: do_something_exciting()

Answer 2

至于正则表达式，这是我现在最好的：

if re.search(r"<[dD][iI][vV]\s+.*?class="dlInfo-Speed".*?>(.*?)</[dD][iI][vV]>",
             html_doc,
             re.DOTALL):
    print "found"
else:
    print "not found"

http://regexr.com?37iqr

我发现正则表达式通常不是这类问题的最佳解决方案。

我的选择是BeautifulSoup：http://www.crummy.com/software/BeautifulSoup/

以下是使用bs4解决问题的方法：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
tag = soup.find("div", class_="dlInfo-Speed")

print tag.string #one way to get the contents

Answer 3

如上所述，有可能找不到该字符串，因为其他HTML与其混合在一起。它也可能以这样的方式格式化，即标记属性之间有换行符，如：

some text goes here <div
class="dlInfo-Speed"> More text

甚至

some text goes here <div class="dlInfo-Speed"
> More text

您可以编写一个正则表达式来解释可能出现的所有位置的空格（包括换行符和制表符）：

re.search(text, r'<div \s+ class="dlInfo-Speed" \s* >', re.VERBOSE)

但总的来说，我非常赞同这样的评论：除了非常简单，定义明确的搜索之外，通常最好正确解析HTML并遍历文档树以找到您要查找的内容。

Answer 4

有可能无法找到的字符串与某些html标签混合在一起：

<div>string you are <span class="x">looking</span> for</div>

也许你应该尝试删除html标签（除非它们包含你搜索的字符串），这样文本就更容易搜索了。使用正则表达式执行此操作的简单方法：

text = re.sub('<[^<]+?>', '', html_page)
if some_substring in text:
    do_something(text)

对于re.search替代方案，您可以使用字符串 index 方法。

try:
    index = html_data.index(some_substring)
    do_something(html_data)
except ValueError:
    # string not found
    pass

甚至找到方法：

if html_data.find(some_substring) >= 0:
    do_something(html_data)

Python的re.search的替代品

4 个答案: