大熊猫N-Grams到列

时间:2016-11-16 22:55:04

标签: python pandas

给出以下数据框:

import pandas as pd
d=['Hello', 'Helloworld']
f=pd.DataFrame({'strings':d})
f
    strings
0   Hello
1   Helloworld

我希望将每个字符串拆分为3个字符的块,并将它们用作标题来创建1或0的矩阵,具体取决于给定行是否具有3个字符的块。

像这样:

    Strings     Hel     low     orl
0   Hello         1       0       0
1   Helloworld    1       1       1

注意字符串" Hello"对于"低"为0列,因为它只为精确的部分匹配分配1。如果有超过1个匹配(即如果字符串是" HelHel",它仍然只分配1(尽管知道如何计算它并因此分配2也是很好的)。

最终,我试图通过SKLearn在LSHForest中为我们准备数据。 因此,我期待许多不同的字符串值。

这是我迄今为止所做的尝试:

#Split into chunks of exactly 3
def split(s, chunk_size):
    a = zip(*[s[i::chunk_size] for i in range(chunk_size)])
    return [''.join(t) for t in a]
cols=[split(s,3) for s in f['strings']]
cols

[['Hel'], ['Hel', 'low', 'orl']]

#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(cols))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq

['orl', 'Hel', 'low']

现在,我需要做的就是为 colsunq 中的每个元素在 f 中创建一个列,如果'字符串中的字符串为&#,则添加1 39; column与每个给定列标题的chunk匹配。

提前致谢!

注意: 如果需要搭便车:

#Shingle into strings of exactly 3
def shingle(word):
    a = [word[i:i + 3] for i in range(len(word) - 3 + 1)]
    return [''.join(t) for t in a]
#Shingle (i.e. "hello" -> "hel","ell",'llo')
a=[shingle(w) for w in f['strings']]
#Get all elements into one list:
import itertools
colsunq=list(itertools.chain.from_iterable(a))
#Remove duplicates:
colsunq=list(set(colsunq))
colsunq
['wor', 'Hel', 'ell', 'owo', 'llo', 'rld', 'orl', 'low']

1 个答案:

答案 0 :(得分:2)

def str_chunk(s, k):
    i, j = 0, k
    while j <= len(s):
        yield s[i:j]
        i, j = j, j + k

def chunkit(s, k):
    return [_ for _ in str_chunk(s, k)]

def count_chunks(s, k):
    return pd.value_counts(chunkit(s, k))

演示

f.strings.apply(chunkit, k=3)

0              [Hel]
1    [Hel, low, orl]
Name: strings, dtype: object

f.strings.apply(count_chunks, k=3).fillna(0)

enter image description here

shingling

def str_shingle(s, k):
    i, j = 0, k
    while j <= len(s):
        yield s[i:j]
        i, j = i + 1, j + 1

def shingleit(s, k):
    return [_ for _ in str_shingle(s, k)]

def count_shingles(s, k):
    return pd.value_counts(shingleit(s, k))

f.strings.apply(count_shingles, k=3).fillna(0)

enter image description here