Question

我是scikit learn and numpy的新手。如何表示由列表/数组字符组成的数据集，例如

[["aa bb","a","bbb","à"], [bb cc","c","ddd","à"], ["kkk","a","","a"]]

到dtype float的数组吗？

Answer 1

我认为你所寻找的是你的单词的数字表示。你可以使用gensim并将每个单词映射到一个标记id，并从中创建你的numpy数组，如下所示：

import numpy as np
from gensim import corpora 

toconvert = [["aa bb","a","bbb","à"], ["bb", "cc","c","ddd","à"], ["kkk","a","","a"]]

# convert your list of lists into token id's. For example, 'aa bb' could be represented as a 2, a as a 1, etc.
tdict = corpora.Dictionary(toconvert)

# given nested structure, you can append nested numpy arrays
newlist = []
for l in toconvert:
    tmplist = []
    for word in l:
        # append to intermediate list the id for the given word under observation
        tmplist.append(tdict.token2id[word])
    # convert to numpy array and append to main list
    newlist.append(np.array(tmplist).astype(float)) # type float

print(newlist) # desired output: [array([ 2.,  0.,  1.,  0.]), array([ 5.,  3.,  4.,  6.,  0.]), array([ 7.,  0.,  8.,  0.])]

# and to see what id's represent which strings:
tdict[0] # 'a'

列表/ numpy浮点数组的字符串数组

1 个答案: