Question

我有一个大约有100,000行的数据框。数据框中的“序列”列具有一些长字符串作为其值。我基本上想计算子序列在“序列”列中出现的重叠情况（即，如果值是aaaaaaa，并且子串是aa，则频率应该是5，而不是3）。

下面是与实际代码相似的可复制代码。

import pandas as pd
import re
import itertools
import time
from random import choice

# generate 100,000 random strings of fixed size for demo purpose
alphabet = "abcdefghijk"
str_list = []

for i in range(100000):
     str_list.append(''.join(choice(alphabet) for i in range(100)))
# make pandas dataframe
df = pd.DataFrame(columns=['sequence'], data=str_list)

# get a list of substrings to count its frequency in the dataframe
# for the sake of demo, make substrings from "alphabet" with length of 3
# actual application can have up to substrings of length 5 (i.e. 11^5 substrings)
words = [''.join(p) for p in itertools.product(alphabet, repeat=3))]

# calculate frequency of words in the dataframe
for word in words:
    tic = time.time()
    df['frequency'] = df['sequence'].apply(lambda x: len(re.findall('(?={0})'.format(word), x)))
    print("{}s took for one iteration".format(time.time() - tic))

请注意，Pandas内置函数“ pd.Series.str.count”将计算子字符串不重叠的出现，因此我不得不将apply与正则表达式结合使用。 / p>

问题是，在我的计算机上，每个子字符串的计算频率大约需要0.5秒，并且由于有11 ^ 3至11 ^ 5个子字符串，因此最多可能需要80,000秒（或11个小时）最坏的情况。

lambda操作似乎在减慢计算时间，因为它必须在python中而不是cython中进行编译（无疑比cython快），但是我不知道该怎么做。

我想知道是否有办法加快此操作？

p.s。我正在使用Python 3.5.2

Python：Pandas，这是计算数据框中子字符串出现次数的最快方法

0 个答案: