Question

这是我实现的一键编码：

%reset -f

import numpy as np 
import pandas as pd

sentences = []
s1 = 'this is sentence 1'
s2 = 'this is sentence 2'

sentences.append(s1)
sentences.append(s2)

def get_all_words(sentences) : 

  unf = [s.split(' ') for s in sentences]

  all_words = []

  for f in unf : 
    for f2 in f : 
      all_words.append(f2)

  return all_words



def get_one_hot(s , s1 , all_words) : 
  flattened = []
  one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
  for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] : 
    for aa in a : 
      flattened.append(aa)

  return flattened

all_words = get_all_words(sentences)

print(get_one_hot(sentences , s1 , all_words))

print(get_one_hot(sentences , s2 , all_words))

这将返回：

[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]

可以看到，稀疏向量是小句子的返回值。看来编码是在字符级别而不是单词级别进行的？如何正确对单词下方进行热编码？

我认为编码应该是？：

s1 -> 1, 1, 1, 1
s2 -> 1, 1, 1, 0

Answer 1

字符级编码

这是因为循环：

  for f in unf : 
    for f2 in f : 
      all_words.append(f2)

f2遍历字符串f的字符。实际上，您可以将整个函数重写为：

def get_all_words(sentences) :
  unf = [s.split(' ') for s in sentences]
  return list(set([word for sen in unf for word in sen]))

正确的一键编码

此循环

  for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] : 
    for aa in a : 
      flattened.append(aa)

实际上是一个很长的向量。让我们看一下one_hot_encoded_df = pd.get_dummies(list(set(all_words)))的输出：

   1  2  is  sentence  this
0  0  1   0         0     0
1  0  0   0         0     1
2  1  0   0         0     0
3  0  0   1         0     0
4  0  0   0         1     0

上面的循环是从该数据框中选择相应的列，并将其追加到输出flattened上。我的建议是简单地利用pandas功能，使您可以对几列进行子集处理，然后进行汇总，并裁剪为0或1，以获得单编码的矢量：

def get_one_hot(s , s1 , all_words) :
  flattened = []
  one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
  return one_hot_encoded_df[s1.split(' ')].T.sum().clip(0,1).values

输出将是：

[0 1 1 1 1]
[1 1 0 1 1]

分别用于两个句子。解释这些的方法是：从one_hot_encoded_df数据帧的行索引中，我们知道对于2使用0，对于this使用1，对于1使用2，依此类推。因此，输出[0 1 1 1 1]表示单词包中除2以外的所有项目，您可以使用输入'this is sentence 1'

进行确认。

一句热门的编码句子

1 个答案: