Question

示例1：

          a
0  [w1, w3]
1  [w2, w4]
[[1 0 1 0]
 [0 1 0 1]]

呈现：

a = ['w1 w2' , 'w1 w3']

df=pd.DataFrame({'a': a })
print(df)

mlb = MultiLabelBinarizer()
print(np.array(mlb.fit_transform(df['a'].as_matrix())))

示例2：

   a
0  w1 w2
1  w1 w3
[[1 1 1 0 1]
 [1 1 0 1 1]]

呈现：

1,2,3,w

示例1似乎是对字级别的数据帧进行热编码。什么是示例2计算？它似乎也是一个热门编码，但不是单词级别？我最初认为它是在字符级别，但数据框包含字符[[1 1 1 0 1] [1 1 0 1 1]]，每个数组元素包含4个字符：

MultiLabelBinarizer

长度为5个字符。

以上代码使用int indexList = iList.IndexOf(this); int countList = iList.Count; for (int i = indexList; i < countList; i++) { var newLoc = (iList.IndexOf(iList[i]) - 1) * 65; iList[i].Top = newLoc; } iList.Remove(this); http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

Answer 1

我认为你的例子2是字母级别

df['a'].as_matrix()
array(['w1 w2', 'w1 w3'], dtype=object)

包含空格的每个字母将转换为一个热键，因此有五个

Answer 2

MultilabelBinarizer期望可迭代的迭代可以y提供in documentation

现在，它不会检查提供的值的类型，而是直接使用itertools.chain.from_iterable来查找元素。

来自source code：

classes = sorted(set(itertools.chain.from_iterable(y)))

因此，当您提供单词列表时，该单词中的输出类将为'characters'。

import itertools

# Single word
classes = set(itertools.chain.from_iterable('word'))
print(classes)
Output: {'d', 'o', 'r', 'w'}

# List of words
classes = set(itertools.chain.from_iterable(['word1', 'word2']))
print(classes)
Output: {'1', '2', 'd', 'o', 'r', 'w'}

# List of list of words
classes = set(itertools.chain.from_iterable([['word1', 'word2'], ['word3']]))
print(classes)
Output: {'word1', 'word2', 'word3'}

这是一个热编码？

2 个答案: