连接大型csv的所有行

时间:2017-06-23 07:23:13

标签: python csv pandas nltk python-multiprocessing

所以我有一个包含多列的大型csv文件(500万行)。我特别感兴趣的是一个包含文本的列。

输入csv具有以下格式:

system_id,member_name,message,is_post

0157e407,member1011,“我的肺部多年来一直有问题。这一切都始于感染......”,虚假

1915d457,member1055,“看起来很多人服用扑热息痛治疗疼痛......”,虚假

列'消息'包含文字并且很有意思。

现在的任务是将此列的所有行连接成一个单独的大文本,然后在其上计算n-gram(n = 1,2,3,4,5)。输出应该是5个不同的文件,对应于n-gram,格式如下: 例如:

bigram.csv

n-gram,count

“word1 word2”,7

“word1 word3”,11

trigram.csv

n-gram,count

“word1 word2 word3”,22

“word 1 word2 word4”,24

这是我到目前为止所尝试的内容:

from collections import OrderedDict
import csv
import re
import sys

import nltk


if __name__ == '__main__':
    if len(sys.argv) < 2:
        print "%d Arguments Given : Exiting..." % (len(sys.argv)-1)
        print "Usage: python %s <inp_file_path>" % sys.argv[0]
        exit(1)
    ifpath = sys.argv[1]
    with open(ifpath, 'r') as ifp:
        reader = csv.DictReader(ifp)
        all_msgs = []
        fieldnames = reader.fieldnames
        processed_rows = []
        for row in reader:
            msg = row['message']
            res = {'message': msg}
            txt = msg.decode('ascii', 'ignore')
            # some preprocessing
            txt = re.sub(r'[\.]{2,}', r". ", txt)
            txt = re.sub(r'([\.,;!?])([A-Z])', r'\1 \2', txt)
            sentences = nltk.tokenize.sent_tokenize(txt.strip())
            all_msgs.append(' '.join(sentences))
    text = ' '.join(all_msgs)

    tokens = nltk.word_tokenize(text)
    tokens = [token.lower() for token in tokens if len(token) > 1]
    bi_tokens = list(nltk.bigrams(tokens))
    tri_tokens = list(nltk.trigrams(tokens))
    bigrms = []
    for item in sorted(set(bi_tokens)):
        bb = OrderedDict()
        bb['bigrams'] = ' '.join(item)
        bb['count'] = bi_tokens.count(item)
        bigrms.append(bb)

    trigrms = []
    for item in sorted(set(tri_tokens)):
        tt = OrderedDict()
        tt['trigrams'] = ' '.join(item)
        tt['count'] = tri_tokens.count(item)
        trigrms.append(tt)

    with open('bigrams.csv', 'w') as ofp2:
        header = ['bigrams', 'count']
        dict_writer = csv.DictWriter(ofp2, header)
        dict_writer.writeheader()
        dict_writer.writerows(bigrms)

    with open('trigrams.csv', 'w') as ofp3:
        header = ['trigrams', 'count']
        dict_writer = csv.DictWriter(ofp3, header)
        dict_writer.writeheader()
        dict_writer.writerows(trigrms)

    tokens = nltk.word_tokenize(text)
    fourgrams = nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
    quadgrams = []
    for fourgram, freq in fourgrams.ngram_fd.items():
        dd = OrderedDict()
        dd['quadgram'] = " ".join(fourgram)
        dd['count'] = freq
        quadgrams.append(dd)
    with open('quadgram.csv', 'w') as ofp4:
        header = ['quadgram', 'count']
        dict_writer = csv.DictWriter(ofp4, header)
        dict_writer.writeheader()
        dict_writer.writerows(quadgrams)

这已经在4核机器上运行了2天。我怎样才能提高效率(或许使用熊猫和/或多处理)并尽可能合理地加快速度?

1 个答案:

答案 0 :(得分:0)

我会做一些改变:

找出瓶颈

花了这么长时间?

  • 阅读CSV
  • 标记化
  • 制作n-gram
  • 计算n-gram
  • 写入磁盘

所以我要做的第一件事是在不同步骤之间进行更清晰的分离,理想情况下可以重新启动中途

阅读文字

我会将其提取到另一种方法。从我读到的内容(例如herepandas读取csv文件比csv快得多。如果读取csv只需要2天1分钟,这可能不是问题,但我会这样做:

def read_text(filename):  # you could add **kwarg to pass on to the read_csv
    df = pd.read_csv(filename) # add info on file encoding etc
    message = df['message'].str.replace(r'[\.]{2,}', r". ")  # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.replace.html
    message = message.str.replace(r'([\.,;!?])([A-Z])', r'\1 \2')

    messame = message.strip()
    sentences = message.apply(nltk.tokenize.sent_tokenize)
    return ' '.join(sentences.appy(' '.join))

您甚至可以在块中执行此操作,并yield而不是返回句子以使其成为生成器,可能会节省内存

您是否有特定原因加入sent_tokenize之后的句子,因为我在文档中找到了这个

  

Treebank tokenizer使用正则表达式来标记文本,如Penn Treebank中所示。这是word_tokenize()调用的方法。它假设文本已被分段为句子,例如使用sent_tokenize()。

所以你会这样称呼它:

text = read_text(csv_file)
with open(text_file, 'w') as file:
    file.write(text)
print('finished reading text from file') # or use logging

标记化

保持粗鲁的同样

tokens = nltk.word_tokenize(text)
print('finished tokenizing the text')

def save_tokens(filename, tokens):
    # save the list somewhere, either json or pickle, so you can pick up later if something goes wrong

制作n-gram,计算并将它们写入磁盘

你的代码包含很多样板文件,它只用不同的函数或文件名做同样的事情,所以我把它抽象出来,包含名字的元组列表,让bigrams计算它们的函数和文件名保存

ngrams = [
    ('bigrams', nltk.bigrams, collections.Counter, 'bigrams.csv'),
    ('trigrams', nltk.trigrams, collections.Counter, 'quadgrams.csv'),
    ('quadgrams', nltk.collocations.QuadgramCollocationFinder.from_words, parse_quadgrams, 'quadgrams.csv'),
]

如果您想要计算列表中的项目数量,只需使用collections.Counter而不是在每个项目上制作(昂贵的)collection.OrderedDict。如果你想自己计算,最好使用元组而不是OrderedDict。您也可以使用pd.Series.value_counts()

def parse_quadgrams(quadgrams):
    return quadgrams.ngram_fd #from what I see in the code this dict already contains the counts

for name, ngram_method, parse_method, output_file in ngrams:
    grams = ngram_method(tokens)
    print('finished generating ', name)
    # You could write this intermediate result to a temporary file in case something goes wrong
    count_df = pd.Series(parse_method(grams)).reset_index().rename(columns={'index': name, 0: 'count')
    # if you need it sorted you can do this on the DataFrame
    print('finished counting ', name)
    count_df.to_csv(output_file)
    print('finished writing ', name, ' to file: ', output_file)