Question

我希望使用Python从网页中提取常规文本字符串 - 源代码运行如下：

<br /><strong>Date: 06/12/2010</strong> <br />

它始终开始

<strong>Date:

＆安培;端

</strong>

我已经删除了网页的文字，只想提取日期和类似结构的信息。有什么建议怎么做？（对不起，这是一个新手问题！）

Answer 1

您可以使用正则表达式：

import re
pattern = re.compile(r'<strong>Date:(?P<date>.*?)</strong>') # re.MULTILINE?
# Then use it with
pattern.findall(text) # Returns all matches
# or
match = pattern.search(text) # grabs the first match
match.groupdict() # gives a dictionary with key 'date'
# or
match.groups()[0] # gives you just the text of the match.

或尝试使用beautiful soup解析该事物。

This是测试Python正则表达式的好地方。

Answer 2

import re

text = "<br /><strong>Date: 06/12/2010</strong> <br />"
m = re.search("<strong>(Date:.*?)</strong>", text)
print m.group(1)

输出

Date: 06/12/2010

新手Python正则表达式问题：从网页上拉日期

2 个答案: