Question

嗨，来自笨拙的地方，

我试图解析一个论坛。更具体地说，线程＆＃39;名。

线程由论坛引擎（vbulletin）提供，如此

<a href="http://www.example.com/showthread.php?t=555555" id="thread_title_555555">NAME OF THE TITLE</a>

使用python和beautifulsoup，我已经获得了其他领域的成功。但是，我无法解析＆＃34; id＆＃34;属性使用正则表达式。我需要解析器的这些行找到每个＆＃34; a＆＃34;具有六位数id的元素并从中获取文本

类似这样的事情

for elements in soup.findAll("a"):
    if re.match("thread_title_", element['id']) is not None:
        print element.text

或在伪皮带中：

for elements in soup.finAll("a", {"id": "thread_title_".*}):
    print element.text

我尝试了几十种变体，但无济于事。我该怎么办？

提前致谢

Answer 1

\D*(\d{6})

这不符合您的要求吗？如果没有，你还尝试了什么？

已编辑：如果主题标题可包含上述不匹配的数字，请考虑使用正则表达式\w*(\d{6})

差异为\D匹配所有非数字，而\w匹配任何字母，数字或下划线。

Answer 2

您可以在findAll() ...

的调用中将ID与正则表达式匹配

for element in soup.findAll("a", id=re.compile("^thread_title_")):
    print element.text