所以我有一个包含多列的大型csv文件(500万行)。我特别感兴趣的是一个包含文本的列。
输入csv具有以下格式:
system_id,member_name,message,is_post
0157e407,member1011,“我的肺部多年来一直有问题。这一切都始于感染......”,虚假
1915d457,member1055,“看起来很多人服用扑热息痛治疗疼痛......”,虚假
列'消息'包含文字并且很有意思。
现在的任务是将此列的所有行连接成一个单独的大文本,然后在其上计算n-gram(n = 1,2,3,4,5)。输出应该是5个不同的文件,对应于n-gram,格式如下: 例如:
bigram.csv
n-gram,count
“word1 word2”,7
“word1 word3”,11
trigram.csv
n-gram,count
“word1 word2 word3”,22
“word 1 word2 word4”,24
这是我到目前为止所尝试的内容:
from collections import OrderedDict
import csv
import re
import sys
import nltk
if __name__ == '__main__':
if len(sys.argv) < 2:
print "%d Arguments Given : Exiting..." % (len(sys.argv)-1)
print "Usage: python %s <inp_file_path>" % sys.argv[0]
exit(1)
ifpath = sys.argv[1]
with open(ifpath, 'r') as ifp:
reader = csv.DictReader(ifp)
all_msgs = []
fieldnames = reader.fieldnames
processed_rows = []
for row in reader:
msg = row['message']
res = {'message': msg}
txt = msg.decode('ascii', 'ignore')
# some preprocessing
txt = re.sub(r'[\.]{2,}', r". ", txt)
txt = re.sub(r'([\.,;!?])([A-Z])', r'\1 \2', txt)
sentences = nltk.tokenize.sent_tokenize(txt.strip())
all_msgs.append(' '.join(sentences))
text = ' '.join(all_msgs)
tokens = nltk.word_tokenize(text)
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = list(nltk.bigrams(tokens))
tri_tokens = list(nltk.trigrams(tokens))
bigrms = []
for item in sorted(set(bi_tokens)):
bb = OrderedDict()
bb['bigrams'] = ' '.join(item)
bb['count'] = bi_tokens.count(item)
bigrms.append(bb)
trigrms = []
for item in sorted(set(tri_tokens)):
tt = OrderedDict()
tt['trigrams'] = ' '.join(item)
tt['count'] = tri_tokens.count(item)
trigrms.append(tt)
with open('bigrams.csv', 'w') as ofp2:
header = ['bigrams', 'count']
dict_writer = csv.DictWriter(ofp2, header)
dict_writer.writeheader()
dict_writer.writerows(bigrms)
with open('trigrams.csv', 'w') as ofp3:
header = ['trigrams', 'count']
dict_writer = csv.DictWriter(ofp3, header)
dict_writer.writeheader()
dict_writer.writerows(trigrms)
tokens = nltk.word_tokenize(text)
fourgrams = nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
quadgrams = []
for fourgram, freq in fourgrams.ngram_fd.items():
dd = OrderedDict()
dd['quadgram'] = " ".join(fourgram)
dd['count'] = freq
quadgrams.append(dd)
with open('quadgram.csv', 'w') as ofp4:
header = ['quadgram', 'count']
dict_writer = csv.DictWriter(ofp4, header)
dict_writer.writeheader()
dict_writer.writerows(quadgrams)
这已经在4核机器上运行了2天。我怎样才能提高效率(或许使用熊猫和/或多处理)并尽可能合理地加快速度?
答案 0 :(得分:0)
我会做一些改变:
花了这么长时间?
所以我要做的第一件事是在不同步骤之间进行更清晰的分离,理想情况下可以重新启动中途
我会将其提取到另一种方法。从我读到的内容(例如here)pandas
读取csv文件比csv
快得多。如果读取csv只需要2天1分钟,这可能不是问题,但我会这样做:
def read_text(filename): # you could add **kwarg to pass on to the read_csv
df = pd.read_csv(filename) # add info on file encoding etc
message = df['message'].str.replace(r'[\.]{2,}', r". ") # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html
message = message.str.replace(r'([\.,;!?])([A-Z])', r'\1 \2')
messame = message.strip()
sentences = message.apply(nltk.tokenize.sent_tokenize)
return ' '.join(sentences.appy(' '.join))
您甚至可以在块中执行此操作,并yield
而不是返回句子以使其成为生成器,可能会节省内存
您是否有特定原因加入sent_tokenize
之后的句子,因为我在文档中找到了这个
Treebank tokenizer使用正则表达式来标记文本,如Penn Treebank中所示。这是word_tokenize()调用的方法。它假设文本已被分段为句子,例如使用sent_tokenize()。
所以你会这样称呼它:
text = read_text(csv_file)
with open(text_file, 'w') as file:
file.write(text)
print('finished reading text from file') # or use logging
保持粗鲁的同样
tokens = nltk.word_tokenize(text)
print('finished tokenizing the text')
def save_tokens(filename, tokens):
# save the list somewhere, either json or pickle, so you can pick up later if something goes wrong
你的代码包含很多样板文件,它只用不同的函数或文件名做同样的事情,所以我把它抽象出来,包含名字的元组列表,让bigrams计算它们的函数和文件名保存
ngrams = [
('bigrams', nltk.bigrams, collections.Counter, 'bigrams.csv'),
('trigrams', nltk.trigrams, collections.Counter, 'quadgrams.csv'),
('quadgrams', nltk.collocations.QuadgramCollocationFinder.from_words, parse_quadgrams, 'quadgrams.csv'),
]
如果您想要计算列表中的项目数量,只需使用collections.Counter
而不是在每个项目上制作(昂贵的)collection.OrderedDict
。如果你想自己计算,最好使用元组而不是OrderedDict
。您也可以使用pd.Series.value_counts()
def parse_quadgrams(quadgrams):
return quadgrams.ngram_fd #from what I see in the code this dict already contains the counts
for name, ngram_method, parse_method, output_file in ngrams:
grams = ngram_method(tokens)
print('finished generating ', name)
# You could write this intermediate result to a temporary file in case something goes wrong
count_df = pd.Series(parse_method(grams)).reset_index().rename(columns={'index': name, 0: 'count')
# if you need it sorted you can do this on the DataFrame
print('finished counting ', name)
count_df.to_csv(output_file)
print('finished writing ', name, ' to file: ', output_file)