Question

我有一个非常大的defaultdict，在dict中有一个dict，内部dict包含来自电子邮件正文的html。我只想从内部字典中返回一个http字符串。提取它的最佳方法是什么？

在使用正则表达式之前，是否需要将dict转换为其他数据结构？有没有更好的办法？我仍然相当陌生，并且很欣赏任何指针。

例如，我正在与之合作：

defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To: 
somebody@email.com      LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}

我尝试过的一件事是在defaultdict上使用re.findall，但这并不起作用：

confirmation_link = re.findall('Click this link to confirm your registration:<br />"
(.*?)"', body)

for conf in confirmation_link:
    print conf

错误：

line 177, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

Answer 1

一旦你在字典上迭代了相应的值，你就只能使用正则表达式了：

import re

d = defaultdict(<type 'dict'>, {16: {u'SEQ': 16, u'RFC822': u'Delivered-To: somebody@email.com      LOTS MORE HTML until http://the_url_I_want_to_extract.com' }}

for k, v in d.iteritems():
    #v is the dictionary that contains your html string:
    str_with_html = v['RFC822']

    #this regular expression starts with matching http, and then 
    #continuing until a white space character is hit.
    match = re.search("http[^\s]+", str_with_html)
    if match:
        print match.group(0)

输出：

http://the_url_I_want_to_extract.com

访问defaultdict中的值并删除其中的url部分

1 个答案: