访问defaultdict中的值并删除其中的url部分

时间:2014-06-19 15:34:51

标签: python regex dictionary defaultdict

我有一个非常大的defaultdict,在dict中有一个dict,内部dict包含来自电子邮件正文的html。我只想从内部字典中返回一个http字符串。提取它的最佳方法是什么?

在使用正则表达式之前,是否需要将dict转换为其他数据结构?有没有更好的办法?我仍然相当陌生,并且很欣赏任何指针。

例如,我正在与之合作:

defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To: 
somebody@email.com      LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}

我尝试过的一件事是在defaultdict上使用re.findall,但这并不起作用:

confirmation_link = re.findall('Click this link to confirm your registration:<br />"
(.*?)"', body)

for conf in confirmation_link:
    print conf

错误:

line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

1 个答案:

答案 0 :(得分:1)

一旦你在字典上迭代了相应的值,你就只能使用正则表达式了:

import re

d = defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To: somebody@email.com      LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}

for k, v in d.iteritems():
    #v is the dictionary that contains your html string:
    str_with_html = v['RFC822']

    #this regular expression starts with matching http, and then 
    #continuing until a white space character is hit.
    match = re.search("http[^\s]+", str_with_html)
    if match:
        print match.group(0)

输出:

http://the_url_I_want_to_extract.com