Question

我想在使用sklearn变换时根据列表中元素的数量获得一个热门数据。

代码：

from sklearn.feature_extraction.text import CountVectorizer
from itertools import chain


x = [['1234', '5678', '910', 'baba'], ['8', '1'], 
     [], ['9', '3'], [], ['7', '6'], [], []]
vector = CountVectorizer(token_pattern=r".+",  min_df=1, max_df=1.0, lowercase=False,
                 max_features=None)
vec = [xxx for xx in x for xxx in xx]
vector.fit(chain.from_iterable([vec]))
print(vector.get_feature_names())
new = []
for xx in x:
    new.append(vector.transform(xx))
for x in new:
    for xx in x.toarray():
        print(xx)

当前输出：

['1', '1234', '3', '5678', '6', '7', '8', '9', '910', 'baba']
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 1 0 0 0]
[1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0]

我的预期输出：

['1', '1234', '3', '5678', '6', '7', '8', '9', '910', 'baba']
[0 1 0 1 0 0 0 0 1 1]
[1 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 1 0 0]
[0 0 0 0 1 1 0 0 0 0]

有没有一种使用我的代码的方法？我尝试过多次更改，但是很遗憾，没有运气。不知何故，我的大脑现在停止处理任何东西。

Answer 1

您不需要为此任务进行显式for循环。您也可以从MultiLabelBinarizer库中使用sklearn。它不处理空列表，因此请先将其过滤掉。

这是熊猫的一个例子：

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

L = [['1234', '5678', '910', 'baba'], ['8', '1'], 
     [], ['9', '3'], [], ['7', '6'], [], []]

s = pd.Series(list(filter(None, L)))

mlb = MultiLabelBinarizer()

res = pd.DataFrame(mlb.fit_transform(s),
                   columns=mlb.classes_,
                   index=s.index)

print(res)

   1  1234  3  5678  6  7  8  9  910  baba
0  0     1  0     1  0  0  0  0    1     1
1  1     0  0     0  0  0  1  0    0     0
2  0     0  1     0  0  0  0  1    0     0
3  0     0  0     0  1  1  0  0    0     0

Answer 2

您可以尝试使用相交和 np isin

相交函数将给出封闭元素，而isin将创建布尔列表

mask = ['1', '1234', '3', '5678', '6', '7', '8', '9', '910', 'baba']
for xx in x:
    if len(xx)>1:
        print(np.isin(mask,np.array(list(set(xx).intersection(set(mask))))).astype(int))

出局：

[0 1 0 1 0 0 0 0 1 1]
[1 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 1 0 0]
[0 0 0 0 1 1 0 0 0 0]

修饰列表

#if you have big lists of elements you can flatten by 
sum(x,[])

出局：

['1234', '5678', '910', 'baba', '8', '1', '9', '3', '7', '6']

Answer 3

对于将来的读者：

我以某种超级朴素的方式解决了它。

以下是代码：

从sklearn.feature_extraction.text导入CountVectorizer 从itertools导入链中

x = [['1234', '5678', '910', 'baba'], ['8', '1'], 
     [], ['9', '3'], [], ['7', '6'], [], []]
vector = CountVectorizer(token_pattern=r"\S*\d+\S*",  min_df=1, max_df=1.0, lowercase=False,
                 max_features=None)
vec = [xxx for xx in x for xxx in xx]
vector.fit(chain.from_iterable([vec]))
print(vector.get_feature_names())
new = []
for xx in x:
    new.append(" ".join(xx))

neww = vector.transform(new)

print(neww.toarray())

如何从字典字符串中打印特定键。

3 个答案: