我想在使用sklearn变换时根据列表中元素的数量获得一个热门数据。
代码:
from sklearn.feature_extraction.text import CountVectorizer
from itertools import chain
x = [['1234', '5678', '910', 'baba'], ['8', '1'],
[], ['9', '3'], [], ['7', '6'], [], []]
vector = CountVectorizer(token_pattern=r".+", min_df=1, max_df=1.0, lowercase=False,
max_features=None)
vec = [xxx for xx in x for xxx in xx]
vector.fit(chain.from_iterable([vec]))
print(vector.get_feature_names())
new = []
for xx in x:
new.append(vector.transform(xx))
for x in new:
for xx in x.toarray():
print(xx)
当前输出:
['1', '1234', '3', '5678', '6', '7', '8', '9', '910', 'baba']
[0 1 0 0 0 0 0 0 0 0]
[0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 1 0 0 0]
[1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0]
[0 0 1 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0]
我的预期输出:
['1', '1234', '3', '5678', '6', '7', '8', '9', '910', 'baba']
[0 1 0 1 0 0 0 0 1 1]
[1 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 1 0 0]
[0 0 0 0 1 1 0 0 0 0]
有没有一种使用我的代码的方法?我尝试过多次更改,但是很遗憾,没有运气。不知何故,我的大脑现在停止处理任何东西。
答案 0 :(得分:1)
您不需要为此任务进行显式for
循环。您也可以从MultiLabelBinarizer
库中使用sklearn
。它不处理空列表,因此请先将其过滤掉。
这是熊猫的一个例子:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
L = [['1234', '5678', '910', 'baba'], ['8', '1'],
[], ['9', '3'], [], ['7', '6'], [], []]
s = pd.Series(list(filter(None, L)))
mlb = MultiLabelBinarizer()
res = pd.DataFrame(mlb.fit_transform(s),
columns=mlb.classes_,
index=s.index)
print(res)
1 1234 3 5678 6 7 8 9 910 baba
0 0 1 0 1 0 0 0 0 1 1
1 1 0 0 0 0 0 1 0 0 0
2 0 0 1 0 0 0 0 1 0 0
3 0 0 0 0 1 1 0 0 0 0
答案 1 :(得分:1)
您可以尝试使用相交和 np isin
相交函数将给出封闭元素,而isin将创建布尔列表
mask = ['1', '1234', '3', '5678', '6', '7', '8', '9', '910', 'baba']
for xx in x:
if len(xx)>1:
print(np.isin(mask,np.array(list(set(xx).intersection(set(mask))))).astype(int))
出局:
[0 1 0 1 0 0 0 0 1 1]
[1 0 0 0 0 0 1 0 0 0]
[0 0 1 0 0 0 0 1 0 0]
[0 0 0 0 1 1 0 0 0 0]
修饰列表
#if you have big lists of elements you can flatten by
sum(x,[])
出局:
['1234', '5678', '910', 'baba', '8', '1', '9', '3', '7', '6']
答案 2 :(得分:0)
对于将来的读者:
我以某种超级朴素的方式解决了它。
以下是代码:
从sklearn.feature_extraction.text导入CountVectorizer 从itertools导入链中
x = [['1234', '5678', '910', 'baba'], ['8', '1'],
[], ['9', '3'], [], ['7', '6'], [], []]
vector = CountVectorizer(token_pattern=r"\S*\d+\S*", min_df=1, max_df=1.0, lowercase=False,
max_features=None)
vec = [xxx for xx in x for xxx in xx]
vector.fit(chain.from_iterable([vec]))
print(vector.get_feature_names())
new = []
for xx in x:
new.append(" ".join(xx))
neww = vector.transform(new)
print(neww.toarray())