我想将一个句子转换为一个单热矢量数组。 这些向量将是字母表的一个热门表示。 它看起来如下:
"hello" # h=7, e=4 l=11 o=14
会变成
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
不幸的是,sklearn的OneHotEncoder不接受输入字符串。
答案 0 :(得分:6)
这是回归神经网络中的一项常见任务,如果你想使用它,tensorflow
就是为了这个目的而有一个特定的功能。
In [24]: alphabets = {'a' : 0, 'b': 1, 'c':2, 'd':3, 'e':4, 'f':5, 'g':6, 'h':7, 'i':8, 'j':9, 'k':10, 'l':11, 'm':12, 'n':13, 'o':14}
In [25]: idxs = [alphabets[ch] for ch in 'hello']
In [26]: idxs
Out[26]: [7, 4, 11, 11, 14]
# @divakar's approach
In [153]: idxs = np.fromstring("hello",dtype=np.uint8)-97
In [13]: one_hot = tf.one_hot(idxs, 26, dtype=tf.uint8)
In [14]: sess = tf.InteractiveSession()
In [15]: one_hot.eval()
Out[15]:
array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
答案 1 :(得分:5)
只需将传递的字符串中的字母与给定的字母表进行比较:
def string_vectorizer(strng, alphabet=string.ascii_lowercase):
vector = [[0 if char != letter else 1 for char in alphabet]
for letter in strng]
return vector
请注意,使用自定义字母表(例如" defbcazk"时,将按照每个元素出现在原始列表中的顺序排列列。)
string_vectorizer('hello')
的输出:
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
答案 2 :(得分:2)
你问过"句子"但是你的例子只提供了一个单词,所以我不确定你想对空格做些什么。但就单个单词而言,您的示例可以通过以下方式实现:
def onehot(ltr):
return [1 if i==ord(ltr) else 0 for i in range(97,123)]
def onehotvec(s):
return [onehot(c) for c in list(s.lower())]
onehotvec("hello")
[[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
答案 3 :(得分:2)
使用pandas,您可以通过传递分类系列来使用pd.get_dummies:
<payloadFactory media-type="xml">
<format>
<book category="cooking">
<title lang="en">$1</title>
<author>$2</author>
<year>$3</year>
</book>
</format>
<args>
<arg evaluator="xml" expression="//book/title/text()"/>
<arg evaluator="xml" expression="//book/author/text()"/>
<arg evaluator="xml" expression="//book/year/text()"/>
</args>
</payloadFactory>
答案 4 :(得分:2)
这是一种使用NumPy broadcasting
为我们提供(N,26)
形数组的矢量化方法 -
ints = np.fromstring("hello",dtype=np.uint8)-97
out = (ints[:,None] == np.arange(26)).astype(int)
如果您正在寻找性能,我建议使用初始化数组然后分配 -
out = np.zeros((len(ints),26),dtype=int)
out[np.arange(len(ints)), ints] = 1
示例运行 -
In [153]: ints = np.fromstring("hello",dtype=np.uint8)-97
In [154]: ints
Out[154]: array([ 7, 4, 11, 11, 14], dtype=uint8)
In [155]: out = (ints[:,None] == np.arange(26)).astype(int)
In [156]: print out
[[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]]