更有效的方法导入并将函数应用于Pandas Dataframe中的文本数据

时间:2015-11-16 17:42:36

标签: python pandas

基本上我的代码读入文本文件,将其标记为句子,然后找到两个预定义锚词出现的句子,在句子中找到它们之间的距离,然后将距离追加到列标题的列是两个预定义的单词,行是文件中的句子。或者如果句子中没有出现这两个单词,则为null

即如果句子是'棕色狐狸跳了。在懒狗身上。快乐的一句话。'数据框看起来像。

0 |棕色+跳跃|跳过+ +狗|在+狗
1 | 1 | null | null |空
2 | null | null | 2 | 3
3 | null | null | null |空

解析一个短段时代码运行正常,但是当处理较大的文本文件时,代码需要更长的时间。我知道使用Dataframes时加速的关键是避免for循环并将函数应用于整个数据集。

我的问题是,在读取文本文件时是否有更快捷的方法将函数应用于字符串,而不是逐行并将其附加到数据框?

如果有任何帮助,这就是代码的样子。

   for filename in file_list:
        doc_df = pd.DataFrame()
        doc = open(doc_folder+filename, "r")
        doc_text = doc.read().replace('-\n\n ', '').replace('\n', '').replace('\x0c', '.')
        doc.close()
        sents = sent_detector.tokenize(doc_text)
        sent_count=0
        for sent in sents:
            sent_l = sent.lower()
            sent_ws = set(re.findall(r'[A-Z]?[a-z]+|(?<= )[A-Z]+', sent_l))
            sent_anchs = anchor_words.intersection(sent_ws) #anchor_words is a predefined list of words that I'm looking for
            if sent_anchs != set():
                sent_vecs = sent_to_vecs(sent_l, list(sent_anchs)) # sent_to_vec vectorizes the words in the sentence, and a list of anchor words
                for sent_vec in sent_vecs:
                    # Save the word that it was measured from                
                    base_word = sent_vec[0]
                    df_dict = {}                    
                    for each_tup in mean_treat(sent_vec)[1]:
                        if each_tup[0] in ['ROOT', 'a', 'an', 'the']:
                            continue
                        df_dict[base_word+'+'+each_tup[0]]=1/(each_tup[1]) #append distance between two words to a the line the sentence is on
                    sent_count+=1
                    doc_df = doc_df.append(pd.DataFrame(df_dict, index=[sent_count]))
        doc_df = doc_df.append(pdf) #pdf is the column header.
        doc_df = doc_df.fillna(null_val)
        print('Saving {} to a csv.'.format(filename))
        doc_df.to_csv(doc_folder+filename[0:-4]+'.csv')

1 个答案:

答案 0 :(得分:0)

我建议重构代码以减少嵌套循环的数量。

下面是一个使用TextBlob标识单词和句子的示例,以及collections来构建各种单词组合。结果将附加到Pandas DataFrame。

import itertools
from textblob import TextBlob
from collections import defaultdict
import pandas as pd

data = TextBlob('The brown fox jumped. Over the lazy dog. Happy word.')

anchors = ['brown', 'jumped', 'over', 'the', 'dog']
anchor_pairs = [x for x in itertools.combinations(anchors, 2)]

df = pd.DataFrame()
for idx, sentence in enumerate(data.sentences):
    word_list = sentence.words.lower()
    row = {}
    for pair in itertools.combinations(word_list, 2):
        if pair in anchor_pairs:
            first, second = pair
            label = '%s+%s' % (first, second)
            row.update({label: word_list.index(second) - word_list.index(first)})
    df = df.append(pd.Series(row), ignore_index=True)

结果是:

    brown+jumped    over+dog    over+the    the+dog
0   2               NaN         NaN         NaN
1   NaN             3           1           2
2   NaN             NaN         NaN         NaN