Question

给出以下数据框：

import pandas as pd
d=['Hello', 'Helloworld']
f=pd.DataFrame({'strings':d})
f
    strings
0   Hello
1   Helloworld

我希望将每个字符串拆分为3个字符的块，并将它们用作标题来创建1或0的矩阵，具体取决于给定行是否具有3个字符的块。

像这样：

    Strings     Hel     low     orl
0   Hello         1       0       0
1   Helloworld    1       1       1

注意字符串＆＃34; Hello＆＃34;对于＆＃34;低＆＃34;为0列，因为它只为精确的部分匹配分配1。如果有超过1个匹配（即如果字符串是＆＃34; HelHel＆＃34;，它仍然只分配1（尽管知道如何计算它并因此分配2也是很好的）。

最终，我试图通过SKLearn在LSHForest中为我们准备数据。因此，我期待许多不同的字符串值。

这是我迄今为止所做的尝试：

#Split into chunks of exactly 3
def split(s, chunk_size):
    a = zip(*[s[i::chunk_size] for i in range(chunk_size)])
    return [''.join(t) for t in a]
cols=[split(s,3) for s in f['strings']]
cols

[['Hel'], ['Hel', 'low', 'orl']]

#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(cols))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq

['orl', 'Hel', 'low']

现在，我需要做的就是为 colsunq 中的每个元素在 f 中创建一个列，如果＆＃39;字符串中的字符串为＆＃，则添加1 39; column与每个给定列标题的chunk匹配。

提前致谢！

注意： 如果需要搭便车：

#Shingle into strings of exactly 3
def shingle(word):
    a = [word[i:i + 3] for i in range(len(word) - 3 + 1)]
    return [''.join(t) for t in a]
#Shingle (i.e. "hello" -> "hel","ell",'llo')
a=[shingle(w) for w in f['strings']]
#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(a))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq
['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']

Answer 1

def str_chunk(s, k):
    i, j = 0, k
    while j <= len(s):
        yield s[i:j]
        i, j = j, j + k

def chunkit(s, k):
    return [_ for _ in str_chunk(s, k)]

def count_chunks(s, k):
    return pd.value_counts(chunkit(s, k))

演示

f.strings.apply(chunkit, k=3)

0              [Hel]
1    [Hel, low, orl]
Name: strings, dtype: object

f.strings.apply(count_chunks, k=3).fillna(0)

shingling

def str_shingle(s, k):
    i, j = 0, k
    while j <= len(s):
        yield s[i:j]
        i, j = i + 1, j + 1

def shingleit(s, k):
    return [_ for _ in str_shingle(s, k)]

def count_shingles(s, k):
    return pd.value_counts(shingleit(s, k))

f.strings.apply(count_shingles, k=3).fillna(0)

大熊猫N-Grams到列

1 个答案: