如何从我的电子邮件中打印有组织的ngrams?

时间:2012-04-12 07:08:28

标签: python imaplib

此时我需要做两件事,但我需要你的帮助:

  1. 清理数据的最佳做法 - 以编程方式删除多余的标签& '>>>>>>>>',以及其他无意义的通讯flotsam和jetsum
  2. 一旦它被清理干净 - 如何将它打包起来,以便在django& amp; sqlite的。
    • 我是否根据日期,人物,主题,单词将其变成csv然后将其输入我数据库中的数据类?
  3. 好吧,在我进入数据库之前,我希望能够对排序进行排序并干净地显示数据 - 我很少将数据放入数据库,我最接近的是使用XML,csv和JSON。

    我需要通过排名获得ngrams,例如某个单词在一系列电子邮件中出现的次数。我正试图更接近了解人们如何与我谈论主题,等等Jon Kleinberg's work analyzing his own emails.

    的基本版本。

    要温柔,要粗暴,但请帮忙:)!

    >我的输出目前看起来像这样::1,'每个':1,'我':1,'IN!\ r \ n \ r \ n \ n2012 / 1/31':1,'计算器。\ r \ n &n>>>>>> \ r \ n>>>>>>>':1,'people':1,'= 97MB \ r \ n&n; \ r \ n&n> ;':1,'我们':2,'写道:\ r \ n>>>>>> \ r \ n>>>>>>':1,'= \ r \ n写道:\ r \ n>>>>> \ r \ n>>>>>>>':1,'2012/1/31':2,'是' :1,'31,':5,'= 97MB \ r \ n>>>> \ r \ n>>>>':1,'1:45':1,'是\ r \ n>>>>>':1,'已发送':

      

      import getpass, imaplib, email
    
    # NGramCounter builds a dictionary relating ngrams (as tuples) to the number
    # of times that ngram occurs in a text (as integers)
    class NGramCounter(object):
    
      # parameter n is the 'order' (length) of the desired n-gram
      def __init__(self, text):
        self.text = text
        self.ngrams = dict()
    
        # feed method calls tokenize to break the given string up into units
      def tokenize(self):
        return self.text.split(" ")
    
      # feed method takes text, tokenizes it, and visits every group of n tokens
      # in turn, adding the group to self.ngrams or incrementing count in same
      def parse(self):
    
        tokens = self.tokenize()
        #Moves through every individual word in the text, increments counter if already found
        #else sets count to 1
        for word in tokens:
            if word in self.ngrams:
                self.ngrams[word] += 1
            else:
                self.ngrams[word] = 1
    
      def get_ngrams(self):
        return self.ngrams
    
    #loading profile for login
    M = imaplib.IMAP4_SSL('imap.gmail.com')
    M.login("EMAIL", "PASS")
    M.select()
    new = open('liamartinez.txt', 'w')
    typ, data = M.search(None, 'FROM', 'SEARCHGOES_HERE') #Gets ALL messages
    
    def get_first_text_part(msg): #where should this be nested? 
        maintype = msg.get_content_maintype()
        if maintype == 'multipart':
            for part in msg.get_payload():
                if part.get_content_maintype() == 'text':
                    return part.get_payload()
        elif maintype == 'text':
            return msg.get_payload()
    
    for num in data[0].split(): #Loops through all messages
        typ, data = M.fetch(num, '(RFC822)') #Pulls Message
        msg = email.message_from_string(data[0][2]) #Puts message into easy to use python objects
        _from =  msg['from'] #pull from
        _to = msg['to'] #pull to
        _subject = msg['subject'] #pull subject
        _body = get_first_text_part(msg) #pull body
        if _body:
            ngrams = NGramCounter(_body)
            ngrams.parse()
            _feed = ngrams.get_ngrams()
            # print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
            print _feed
        # print 'Content-Type:',msg.get_content_type()
        #     print _from
        #     print _to
        #     print _subject
        #     print _body
        #    
    
        new.write(_from)
    
        print '---------------------------------'
    
    M.close()
    M.logout()
    

2 个答案:

答案 0 :(得分:1)

你的主循环没有错。由于您需要从外部服务器检索所有电子邮件,因此该过程有点慢。我建议的是在客户端上下载所有消息一次。然后将它们保存到数据库(sqlite,zodb,mongodb ...您喜欢的那个)中,然后在db对象上执行所需的所有分析。这两个过程(下载和分析)最好保持彼此的第一部分,否则调整它们会变得复杂,并且代码复杂性会增加。

答案 1 :(得分:0)

替换

if _body:
    ngrams = NGramCounter(_body)
    ngrams.parse()
    _feed = ngrams.get_ngrams()
    # print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
    print _feed

if _body:
    ngrams = NGramCounter(" ".join(_body.strip(">").split()))
    ngrams.parse()
    _feed = ngrams.get_ngrams()
    print _feed