我有一个非常大的defaultdict,在dict中有一个dict,内部dict包含来自电子邮件正文的html。我只想从内部字典中返回一个http字符串。提取它的最佳方法是什么?
在使用正则表达式之前,是否需要将dict转换为其他数据结构?有没有更好的办法?我仍然相当陌生,并且很欣赏任何指针。
例如,我正在与之合作:
defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To:
somebody@email.com LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}
我尝试过的一件事是在defaultdict上使用re.findall,但这并不起作用:
confirmation_link = re.findall('Click this link to confirm your registration:<br />"
(.*?)"', body)
for conf in confirmation_link:
print conf
错误:
line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
答案 0 :(得分:1)
一旦你在字典上迭代了相应的值,你就只能使用正则表达式了:
import re
d = defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To: somebody@email.com LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}
for k, v in d.iteritems():
#v is the dictionary that contains your html string:
str_with_html = v['RFC822']
#this regular expression starts with matching http, and then
#continuing until a white space character is hit.
match = re.search("http[^\s]+", str_with_html)
if match:
print match.group(0)
输出:
http://the_url_I_want_to_extract.com