Question

我正在开展一个学校项目，其目标是使用Natural Language Toolkit软件包分析诈骗邮件。基本上我愿意做的是比较不同年份的骗局并试图找到一个趋势 - 他们的结构如何随着时间而变化。我发现了一个诈骗数据库：http://www.419scam.org/emails/ 我想用python下载链接的内容，但我被卡住了。到目前为止我的代码：

from BeautifulSoup import BeautifulSoup
import urllib2, re

html = urllib2.urlopen('http://www.419scam.org/emails/').read()
soup = BeautifulSoup(html)
links = soup.findAll('a')

links2 = soup.findAll(href=re.compile("index"))

print links2

所以我可以获取链接，但我不知道如何下载内容。有任何想法吗？非常感谢！

Answer 1

你有一个良好的开端，但现在你只是检索索引页面并将其加载到BeautifulSoup解析器中。现在您已经从链接中获得了href，您基本上需要打开所有这些链接，并将其内容加载到您可以用于分析的数据结构中。

这基本上相当于一个非常简单的网络爬虫。如果你可以使用其他人的代码，你可以通过谷歌搜索“python网络爬虫”找到适合的东西。我已经看过其中的一些，而且它们很简单，但对于这个任务来说可能有些过分。大多数Web爬网程序使用递归来遍历给定站点的完整树。看起来更简单的东西就足以满足您的需求。

鉴于我对BeautifulSoup的不熟悉，这个基本结构有望让您走上正确的道路，或者让您了解网络抓取的完成方式：

from BeautifulSoup import BeautifulSoup
import urllib2, re

emailContents = []

def analyze_emails():
    # this function and any sub-routines would analyze the emails after they are loaded into a data structure, e.g. emailContents

def parse_email_page(link):
    print "opening " + link
    # open, soup, and parse the page.  
    #Looks like the email itself is in a "blockquote" tag so that may be the starting place.  
    #From there you'll need to create arrays and/or dictionaries of the emails' contents to do your analysis on, e.g. emailContents

def parse_list_page(link):
    print "opening " + link
    html = urllib2.urlopen(link).read()
    soup = BeatifulSoup(html)
    email_page_links = # add your own code here to filter the list page soup to get all the relevant links to actual email pages   
    for link in email_page_links:
        parseEmailPage(link['href'])


def main():
    html = urllib2.urlopen('http://www.419scam.org/emails/').read()
    soup = BeautifulSoup(html)    
    links = soup.findAll(href=re.compile("20")) # I use '20' to filter links since all the relevant links seem to have 20XX year in them. Seemed to work

    for link in links:
        parse_list_page(link['href'])

    analyze_emails()         

if __name__ == "__main__":
    main()

从Python下载URL中的文本

1 个答案: