此时我需要做两件事,但我需要你的帮助:
好吧,在我进入数据库之前,我希望能够对排序进行排序并干净地显示数据 - 我很少将数据放入数据库,我最接近的是使用XML,csv和JSON。
我需要通过排名获得ngrams,例如某个单词在一系列电子邮件中出现的次数。我正试图更接近了解人们如何与我谈论主题,等等Jon Kleinberg's work analyzing his own emails.
的基本版本。要温柔,要粗暴,但请帮忙:)!
>我的输出目前看起来像这样::1,'每个':1,'我':1,'IN!\ r \ n \ r \ n \ n2012 / 1/31':1,'计算器。\ r \ n &n>>>>>> \ r \ n>>>>>>>':1,'people':1,'= 97MB \ r \ n&n; \ r \ n&n> ;':1,'我们':2,'写道:\ r \ n>>>>>> \ r \ n>>>>>>':1,'= \ r \ n写道:\ r \ n>>>>> \ r \ n>>>>>>>':1,'2012/1/31':2,'是' :1,'31,':5,'= 97MB \ r \ n>>>> \ r \ n>>>>':1,'1:45':1,'是\ r \ n>>>>>':1,'已发送':
import getpass, imaplib, email
# NGramCounter builds a dictionary relating ngrams (as tuples) to the number
# of times that ngram occurs in a text (as integers)
class NGramCounter(object):
# parameter n is the 'order' (length) of the desired n-gram
def __init__(self, text):
self.text = text
self.ngrams = dict()
# feed method calls tokenize to break the given string up into units
def tokenize(self):
return self.text.split(" ")
# feed method takes text, tokenizes it, and visits every group of n tokens
# in turn, adding the group to self.ngrams or incrementing count in same
def parse(self):
tokens = self.tokenize()
#Moves through every individual word in the text, increments counter if already found
#else sets count to 1
for word in tokens:
if word in self.ngrams:
self.ngrams[word] += 1
else:
self.ngrams[word] = 1
def get_ngrams(self):
return self.ngrams
#loading profile for login
M = imaplib.IMAP4_SSL('imap.gmail.com')
M.login("EMAIL", "PASS")
M.select()
new = open('liamartinez.txt', 'w')
typ, data = M.search(None, 'FROM', 'SEARCHGOES_HERE') #Gets ALL messages
def get_first_text_part(msg): #where should this be nested?
maintype = msg.get_content_maintype()
if maintype == 'multipart':
for part in msg.get_payload():
if part.get_content_maintype() == 'text':
return part.get_payload()
elif maintype == 'text':
return msg.get_payload()
for num in data[0].split(): #Loops through all messages
typ, data = M.fetch(num, '(RFC822)') #Pulls Message
msg = email.message_from_string(data[0][2]) #Puts message into easy to use python objects
_from = msg['from'] #pull from
_to = msg['to'] #pull to
_subject = msg['subject'] #pull subject
_body = get_first_text_part(msg) #pull body
if _body:
ngrams = NGramCounter(_body)
ngrams.parse()
_feed = ngrams.get_ngrams()
# print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
print _feed
# print 'Content-Type:',msg.get_content_type()
# print _from
# print _to
# print _subject
# print _body
#
new.write(_from)
print '---------------------------------'
M.close()
M.logout()
答案 0 :(得分:1)
你的主循环没有错。由于您需要从外部服务器检索所有电子邮件,因此该过程有点慢。我建议的是在客户端上下载所有消息一次。然后将它们保存到数据库(sqlite,zodb,mongodb ...您喜欢的那个)中,然后在db对象上执行所需的所有分析。这两个过程(下载和分析)最好保持彼此的第一部分,否则调整它们会变得复杂,并且代码复杂性会增加。
答案 1 :(得分:0)
替换
if _body:
ngrams = NGramCounter(_body)
ngrams.parse()
_feed = ngrams.get_ngrams()
# print "\n".join("\t".join(str(_feed) for col in row) for row in tab)
print _feed
与
if _body:
ngrams = NGramCounter(" ".join(_body.strip(">").split()))
ngrams.parse()
_feed = ngrams.get_ngrams()
print _feed