给出以下数据框:
import pandas as pd
d=['Hello', 'Helloworld']
f=pd.DataFrame({'strings':d})
f
strings
0 Hello
1 Helloworld
我希望将每个字符串拆分为3个字符的块,并将它们用作标题来创建1或0的矩阵,具体取决于给定行是否具有3个字符的块。
像这样:
Strings Hel low orl
0 Hello 1 0 0
1 Helloworld 1 1 1
注意字符串" Hello"对于"低"为0列,因为它只为精确的部分匹配分配1。如果有超过1个匹配(即如果字符串是" HelHel",它仍然只分配1(尽管知道如何计算它并因此分配2也是很好的)。
最终,我试图通过SKLearn在LSHForest中为我们准备数据。 因此,我期待许多不同的字符串值。
这是我迄今为止所做的尝试:
#Split into chunks of exactly 3
def split(s, chunk_size):
a = zip(*[s[i::chunk_size] for i in range(chunk_size)])
return [''.join(t) for t in a]
cols=[split(s,3) for s in f['strings']]
cols
[['Hel'], ['Hel', 'low', 'orl']]
#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(cols))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq
['orl', 'Hel', 'low']
现在,我需要做的就是为 colsunq 中的每个元素在 f 中创建一个列,如果'字符串中的字符串为&#,则添加1 39; column与每个给定列标题的chunk匹配。
提前致谢!
注意: 如果需要搭便车:
#Shingle into strings of exactly 3
def shingle(word):
a = [word[i:i + 3] for i in range(len(word) - 3 + 1)]
return [''.join(t) for t in a]
#Shingle (i.e. "hello" -> "hel","ell",'llo')
a=[shingle(w) for w in f['strings']]
#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(a))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq
['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']
答案 0 :(得分:2)
def str_chunk(s, k):
i, j = 0, k
while j <= len(s):
yield s[i:j]
i, j = j, j + k
def chunkit(s, k):
return [_ for _ in str_chunk(s, k)]
def count_chunks(s, k):
return pd.value_counts(chunkit(s, k))
演示
f.strings.apply(chunkit, k=3)
0 [Hel]
1 [Hel, low, orl]
Name: strings, dtype: object
f.strings.apply(count_chunks, k=3).fillna(0)
shingling
def str_shingle(s, k):
i, j = 0, k
while j <= len(s):
yield s[i:j]
i, j = i + 1, j + 1
def shingleit(s, k):
return [_ for _ in str_shingle(s, k)]
def count_shingles(s, k):
return pd.value_counts(shingleit(s, k))
f.strings.apply(count_shingles, k=3).fillna(0)