Question

感谢您提交至ourdirectory.com 网址：http://myurlok.us 请点击以下链接确认您的提交。 http://www.ourdirectory.com/confirm.aspx?id=1247778154270076

Once we receive your comfirmation, your site will be included for process!
regards,

http://www.ourdirectory.com

Thank you!

我应该明确提取哪个URL。

Answer 1

如果是带有超链接的HTML电子邮件，您可以使用HTMLParse库作为快捷方式。

import HTMLParser
class parseLinks(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for name, value in attrs:
                if name == 'href':
                    print value
                    print self.get_starttag_text()

someHtmlContainingLinks = ""
linkParser = parseLinks()
linkParser.feed(someHtmlContainingLinks)

Answer 2

@OP，如果您的电子邮件始终是标准的，

f=open("emailfile")
for line in f:
    if "confirm your submission" in line:
        print f.next().strip()        
f.close()

Answer 3

不容易。一个建议（取自RegexBuddy库）：

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])

将匹配网址（没有mailto:，如果您需要，请说明，即使它们括在括号中）。如果网址以http://或ftp://开头，也会匹配没有www.或ftp.等的网址。

更简单的版本：

\bhttps?://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

这完全取决于您的需求/输入的内容。

Answer 4

正则表达式：

"http://www.ourdirectory.com/confirm.aspx\?id=[0-9]+$"

或没有正则表达式，逐行解析电子邮件并测试字符串是否包含“http://www.ourdirectory.com/confirm.aspx?id=”，如果是，则表示这是您的网址。

当然，如果您的输入实际上是HTML源代码而不是您发布的文本，那么这一切都会消失。

Answer 5

此解决方案仅在源不是HTML时才有效。

def extractURL(self,fileName):

    wordsInLine = []
    tempWord = []
    urlList = []

    #open up the file containing the email
    file=open(fileName)
    for line in file:
        #create a list that contains is each word in each line
        wordsInLine = line.split(' ')
        #For each word try to split it with :
        for word in wordsLine:
            tempWord = word.split(":")
            #Check to see if the word is a URL
            if len(tempWord) == 2:
                if tempWord[0] == "http" or tempWord[0] == "https":
                    urlList.append(word)

    file.close()

    return urlList

Answer 6

检查一下。

我为此写了一篇文章。此帖子中的代码可以从电子邮件文件中提取URL，无论是纯文本还是html内容类型，还是quoted-printable或base 64或7bit编码。

Python - How to extract URLs (plain/html, quote-printable/base64/7bit) from an email file

用Python从电子邮件中提取URL

6 个答案: