根据令牌观察结果创建逻辑列

时间:2020-11-03 22:47:55

标签: python awk

这是语料库的摘录:

w1, a1, c, f, s, w, q , p 
w2, p, z, l, c, t, w, k, e, d, a2 
w3, z, s, b, t, l
w4, a3, l, h, k, s, e, b
...

我正在寻找以下输出:

lemma, a1, a2, a3, b, c, d, e, f, h, k, l, p, q, s, t, w, z
w1,    T,   F, F,  F, T, F, F, T, F, F, F, T, T, T, F, F, F 
w2,    F,   T, F,  F, T, T, T, F, F, T, T, T, F, F, F, T, T
...

通常,我会在python中使用collections.Counter,但是字典上的单词以及我的许多引理具有相同的值(是相同的引理)。不同的python实现甚至awk实现都将有所帮助。

1 个答案:

答案 0 :(得分:0)

使用sklearn.preprocessing.MultiLabelBinarizer可以起作用:

data = '''w1, a1, c, f, s, w, q , p 
w2, p, z, l, c, t, w, k, e, d, a2 
w3, z, s, b, t, l
w4, a3, l, h, k, s, e, b'''

# Assuming "index" is first element in each row, splitting on comma
index = [row.split(',')[0] for row in data.split('\n')]
# For MultiLabelBinarizer, need an array-of-arrays/list-of-lists representation,
# so split each string in each row accordingly
data = [list(map(lambda x: x.strip(), row.split(',')[1:])) for row in data.split('\n')]

from sklearn.preprocessing import MultiLabelBinarizer

enc = MultiLabelBinarizer()
X = enc.fit_transform(data)

import pandas as pd
df = pd.DataFrame(data=X, columns=enc.classes_, index=index)
print(df)
    a1  a2  a3  b  c  d  e  f  h  k  l  p  q  s  t  w  z
w1   1   0   0  0  1  0  0  1  0  0  0  1  1  1  0  1  0
w2   0   1   0  0  1  1  1  0  0  1  1  1  0  0  1  1  1
w3   0   0   0  1  0  0  0  0  0  0  1  0  0  1  1  0  1
w4   0   0   1  1  0  0  1  0  1  1  1  0  0  1  0  0  0

如果要将向量表示为布尔值,只需使用:

df = pd.DataFrame(data=X.astype(bool), columns=enc.classes_, index=index)
print(df)
       a1     a2     a3      b      c      d      e      f      h      k      l      p      q      s      t      w      z
w1   True  False  False  False   True  False  False   True  False  False  False   True   True   True  False   True  False
w2  False   True  False  False   True   True   True  False  False   True   True   True  False  False   True   True   True
w3  False  False  False   True  False  False  False  False  False  False   True  False  False   True   True  False   True
w4  False  False   True   True  False  False   True  False   True   True   True  False  False   True  False  False  False