  1. 清理数据的最佳做法 - 以编程方式删除多余的标签& '>>>>>>>>',以及其他无意义的通讯flotsam和jetsum
  2. 一旦它被清理干净 - 如何将它打包起来,以便在django& amp; sqlite的。
    • 我是否根据日期,人物,主题,单词将其变成csv然后将其输入我数据库中的数据类?
  3. 好吧,在我进入数据库之前,我希望能够对排序进行排序并干净地显示数据 - 我很少将数据放入数据库,我最接近的是使用XML,csv和JSON。

    我需要通过排名获得ngrams,例如某个单词在一系列电子邮件中出现的次数。我正试图更接近了解人们如何与我谈论主题,等等Jon Kleinberg's work analyzing his own emails.



    >我的输出目前看起来像这样::1,'每个':1,'我':1,'IN!\ r \ n \ r \ n \ n2012 / 1/31':1,'计算器。\ r \ n &n>>>>>> \ r \ n>>>>>>>':1,'people':1,'= 97MB \ r \ n&n; \ r \ n&n> ;':1,'我们':2,'写道:\ r \ n>>>>>> \ r \ n>>>>>>':1,'= \ r \ n写道:\ r \ n>>>>> \ r \ n>>>>>>>':1,'2012/1/31':2,'是' :1,'31,':5,'= 97MB \ r \ n>>>> \ r \ n>>>>':1,'1:45':1,'是\ r \ n>>>>>':1,'已发送':


      import getpass, imaplib, email
    # NGramCounter builds a dictionary relating ngrams (as tuples) to the number
    # of times that ngram occurs in a text (as integers)
    class NGramCounter(object):
      # parameter n is the 'order' (length) of the desired n-gram
      def __init__(self, text):
        self.text = text
        self.ngrams = dict()
        # feed method calls tokenize to break the given string up into units
      def tokenize(self):
        return self.text.split(" ")
      # feed method takes text, tokenizes it, and visits every group of n tokens
      # in turn, adding the group to self.ngrams or incrementing count in same
      def parse(self):
        tokens = self.tokenize()
        #Moves through every individual word in the text, increments counter if already found
        #else sets count to 1
        for word in tokens:
            if word in self.ngrams:
                self.ngrams[word] += 1
                self.ngrams[word] = 1
      def get_ngrams(self):
        return self.ngrams
    #loading profile for login
    M = imaplib.IMAP4_SSL('imap.gmail.com')
    M.login("EMAIL", "PASS")
    new = open('liamartinez.txt', 'w')
    typ, data = M.search(None, 'FROM', 'SEARCHGOES_HERE') #Gets ALL messages
    def get_first_text_part(msg): #where should this be nested? 
        maintype = msg.get_content_maintype()
        if maintype == 'multipart':
            for part in msg.get_payload():
                if part.get_content_maintype() == 'text':
                    return part.get_payload()
        elif maintype == 'text':
            return msg.get_payload()
    for num in data[0].split(): #Loops through all messages
        typ, data = M.fetch(num, '(RFC822)') #Pulls Message
        msg = email.message_from_string(data[0][2]) #Puts message into easy to use python objects
        _from =  msg['from'] #pull from
        _to = msg['to'] #pull to
        _subject = msg['subject'] #pull subject
        _body = get_first_text_part(msg) #pull body
        if _body:
            ngrams = NGramCounter(_body)
            _feed = ngrams.get_ngrams()
            # print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
            print _feed
        # print 'Content-Type:',msg.get_content_type()
        #     print _from
        #     print _to
        #     print _subject
        #     print _body
        print '---------------------------------'

你的主循环没有错。由于您需要从外部服务器检索所有电子邮件,因此该过程有点慢。我建议的是在客户端上下载所有消息一次。然后将它们保存到数据库(sqlite,zodb,mongodb ...您喜欢的那个)中,然后在db对象上执行所需的所有分析。这两个过程(下载和分析)最好保持彼此的第一部分,否则调整它们会变得复杂,并且代码复杂性会增加。

if _body:
    ngrams = NGramCounter(_body)
    _feed = ngrams.get_ngrams()
    # print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
    print _feed

if _body:
    ngrams = NGramCounter(" ".join(_body.strip(">").split()))
    _feed = ngrams.get_ngrams()
    print _feed