Question

我需要尽可能快的方法来从数据框中整理字符串，然后创建一个主列表。

给出以下数据框：

import pandas as pd
d=['Hello', 'Helloworld']
f=pd.DataFrame({'strings':d})
f
    strings
0   Hello
1   Helloworld

我想生成一个列表字符串（长度为3），如下所示：（包括所有可能的3个字母组合。）

[['Hel', 'ell', 'llo'],['Hel', 'ell', 'llo', 'low', 'owo', 'wor', 'orl', 'rld']]

...以及所有唯一值的主列表，如下所示：

['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']

我可以这样做，但我怀疑有更快的方法：

#Shingle into strings of exactly 3
def shingle(word):
    r = [word[i:i + 3] for i in range(len(word) - 3 + 1)]
    return [''.join(t) for t in r]
#Shingle (i.e. "hello" -> "hel","ell",'llo')
r=[shingle(w) for w in f['strings']]
#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(r))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq

['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']

提前致谢！

Answer 1

我迟到了 4 年，但这是一个答案。我认为不可能确定“最快”的方式，因为这在很大程度上取决于硬件和算法。（它可能属于类似于的内容。）

但是，我需要处理超过 1100 万个文件。我将每个单词放在一个 numpy 数组中并运行以下代码。

shingles = set()

for i in range(words.shape[0] - w + 1):
    a = words[i:i + w]
    shingles.add(tuple(a))

这段代码在大约 6 小时内处理了 272 亿个单词。

从熊猫专栏中最快捷的方式

1 个答案: