基本上我的代码读入文本文件,将其标记为句子,然后找到两个预定义锚词出现的句子,在句子中找到它们之间的距离,然后将距离追加到列标题的列是两个预定义的单词,行是文件中的句子。或者如果句子中没有出现这两个单词,则为null
即如果句子是'棕色狐狸跳了。在懒狗身上。快乐的一句话。'数据框看起来像。
0 |棕色+跳跃|跳过+ +狗|在+狗
1 | 1 | null | null |空
2 | null | null | 2 | 3
3 | null | null | null |空
解析一个短段时代码运行正常,但是当处理较大的文本文件时,代码需要更长的时间。我知道使用Dataframes时加速的关键是避免for循环并将函数应用于整个数据集。
我的问题是,在读取文本文件时是否有更快捷的方法将函数应用于字符串,而不是逐行并将其附加到数据框?
如果有任何帮助,这就是代码的样子。
for filename in file_list:
doc_df = pd.DataFrame()
doc = open(doc_folder+filename, "r")
doc_text = doc.read().replace('-\n\n ', '').replace('\n', '').replace('\x0c', '.')
doc.close()
sents = sent_detector.tokenize(doc_text)
sent_count=0
for sent in sents:
sent_l = sent.lower()
sent_ws = set(re.findall(r'[A-Z]?[a-z]+|(?<= )[A-Z]+', sent_l))
sent_anchs = anchor_words.intersection(sent_ws) #anchor_words is a predefined list of words that I'm looking for
if sent_anchs != set():
sent_vecs = sent_to_vecs(sent_l, list(sent_anchs)) # sent_to_vec vectorizes the words in the sentence, and a list of anchor words
for sent_vec in sent_vecs:
# Save the word that it was measured from
base_word = sent_vec[0]
df_dict = {}
for each_tup in mean_treat(sent_vec)[1]:
if each_tup[0] in ['ROOT', 'a', 'an', 'the']:
continue
df_dict[base_word+'+'+each_tup[0]]=1/(each_tup[1]) #append distance between two words to a the line the sentence is on
sent_count+=1
doc_df = doc_df.append(pd.DataFrame(df_dict, index=[sent_count]))
doc_df = doc_df.append(pdf) #pdf is the column header.
doc_df = doc_df.fillna(null_val)
print('Saving {} to a csv.'.format(filename))
doc_df.to_csv(doc_folder+filename[0:-4]+'.csv')
答案 0 :(得分:0)
我建议重构代码以减少嵌套循环的数量。
下面是一个使用TextBlob标识单词和句子的示例,以及collections来构建各种单词组合。结果将附加到Pandas DataFrame。
import itertools
from textblob import TextBlob
from collections import defaultdict
import pandas as pd
data = TextBlob('The brown fox jumped. Over the lazy dog. Happy word.')
anchors = ['brown', 'jumped', 'over', 'the', 'dog']
anchor_pairs = [x for x in itertools.combinations(anchors, 2)]
df = pd.DataFrame()
for idx, sentence in enumerate(data.sentences):
word_list = sentence.words.lower()
row = {}
for pair in itertools.combinations(word_list, 2):
if pair in anchor_pairs:
first, second = pair
label = '%s+%s' % (first, second)
row.update({label: word_list.index(second) - word_list.index(first)})
df = df.append(pd.Series(row), ignore_index=True)
结果是:
brown+jumped over+dog over+the the+dog
0 2 NaN NaN NaN
1 NaN 3 1 2
2 NaN NaN NaN NaN