Question

此时我需要做两件事，但我需要你的帮助：

清理数据的最佳做法 - 以编程方式删除多余的标签＆amp; '＆gt;＆gt;＆gt;＆gt;＆gt;＆gt;＆gt;＆gt;'，以及其他无意义的通讯flotsam和jetsum
一旦它被清理干净 - 如何将它打包起来，以便在django＆amp; amp; sqlite的。
- 我是否根据日期，人物，主题，单词将其变成csv然后将其输入我数据库中的数据类？

好吧，在我进入数据库之前，我希望能够对排序进行排序并干净地显示数据 - 我很少将数据放入数据库，我最接近的是使用XML，csv和JSON。

我需要通过排名获得ngrams，例如某个单词在一系列电子邮件中出现的次数。我正试图更接近了解人们如何与我谈论主题，等等Jon Kleinberg's work analyzing his own emails.

要温柔，要粗暴，但请帮忙:)！

＆GT;我的输出目前看起来像这样：：1，'每个'：1，'我'：1，'IN！\ r \ n \ r \ n \ n2012 / 1/31'：1，'计算器。\ r \ n ＆n＆gt;＆gt;＆gt;＆gt;＆gt;＆gt; \ r \ n＆gt;＆gt;＆gt;＆gt;＆gt;＆gt;＆gt;'：1，'people'：1，'= 97MB \ r \ n＆n; \ r \ n＆n＆gt; ;'：1，'我们'：2，'写道：\ r \ n＆gt;＆gt;＆gt;＆gt;＆gt;＆gt; \ r \ n＆gt;＆gt;＆gt;＆gt;＆gt;＆gt;'：1，'= \ r \ n写道：\ r \ n＆gt;＆gt;＆gt;＆gt;＆gt; \ r \ n＆gt;＆gt;＆gt;＆gt;＆gt;＆gt;＆gt;'：1，'2012/1/31'：2，'是' ：1，'31，'：5，'= 97MB \ r \ n＆gt;＆gt;＆gt;＆gt; \ r \ n＆gt;＆gt;＆gt;＆gt;'：1，'1：45'：1，'是\ r \ n＆gt;＆gt;＆gt;＆gt;＆gt;'：1，'已发送'：

  import getpass, imaplib, email

# NGramCounter builds a dictionary relating ngrams (as tuples) to the number
# of times that ngram occurs in a text (as integers)
class NGramCounter(object):

  # parameter n is the 'order' (length) of the desired n-gram
  def __init__(self, text):
    self.text = text
    self.ngrams = dict()

    # feed method calls tokenize to break the given string up into units
  def tokenize(self):
    return self.text.split(" ")

  # feed method takes text, tokenizes it, and visits every group of n tokens
  # in turn, adding the group to self.ngrams or incrementing count in same
  def parse(self):

    tokens = self.tokenize()
    #Moves through every individual word in the text, increments counter if already found
    #else sets count to 1
    for word in tokens:
        if word in self.ngrams:
            self.ngrams[word] += 1
        else:
            self.ngrams[word] = 1

  def get_ngrams(self):
    return self.ngrams

#loading profile for login
M = imaplib.IMAP4_SSL('imap.gmail.com')
M.login("EMAIL", "PASS")
M.select()
new = open('liamartinez.txt', 'w')
typ, data = M.search(None, 'FROM', 'SEARCHGOES_HERE') #Gets ALL messages

def get_first_text_part(msg): #where should this be nested? 
    maintype = msg.get_content_maintype()
    if maintype == 'multipart':
        for part in msg.get_payload():
            if part.get_content_maintype() == 'text':
                return part.get_payload()
    elif maintype == 'text':
        return msg.get_payload()

for num in data[0].split(): #Loops through all messages
    typ, data = M.fetch(num, '(RFC822)') #Pulls Message
    msg = email.message_from_string(data[0][2]) #Puts message into easy to use python objects
    _from =  msg['from'] #pull from
    _to = msg['to'] #pull to
    _subject = msg['subject'] #pull subject
    _body = get_first_text_part(msg) #pull body
    if _body:
        ngrams = NGramCounter(_body)
        ngrams.parse()
        _feed = ngrams.get_ngrams()
        # print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
        print _feed
    # print 'Content-Type:',msg.get_content_type()
    #     print _from
    #     print _to
    #     print _subject
    #     print _body
    #    

    new.write(_from)

    print '---------------------------------'

M.close()
M.logout()

Answer 1

你的主循环没有错。由于您需要从外部服务器检索所有电子邮件，因此该过程有点慢。我建议的是在客户端上下载所有消息一次。然后将它们保存到数据库（sqlite，zodb，mongodb ...您喜欢的那个）中，然后在db对象上执行所需的所有分析。这两个过程（下载和分析）最好保持彼此的第一部分，否则调整它们会变得复杂，并且代码复杂性会增加。

Answer 2

替换

if _body:
    ngrams = NGramCounter(_body)
    ngrams.parse()
    _feed = ngrams.get_ngrams()
    # print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
    print _feed

与

if _body:
    ngrams = NGramCounter(" ".join(_body.strip(">").split()))
    ngrams.parse()
    _feed = ngrams.get_ngrams()
    print _feed

如何从我的电子邮件中打印有组织的ngrams？

2 个答案: