这是语料库的摘录:
w1, a1, c, f, s, w, q , p
w2, p, z, l, c, t, w, k, e, d, a2
w3, z, s, b, t, l
w4, a3, l, h, k, s, e, b
...
我正在寻找以下输出:
lemma, a1, a2, a3, b, c, d, e, f, h, k, l, p, q, s, t, w, z
w1, T, F, F, F, T, F, F, T, F, F, F, T, T, T, F, F, F
w2, F, T, F, F, T, T, T, F, F, T, T, T, F, F, F, T, T
...
通常,我会在python中使用collections.Counter
,但是字典上的单词以及我的许多引理具有相同的值(是相同的引理)。不同的python实现甚至awk实现都将有所帮助。
答案 0 :(得分:0)
使用sklearn.preprocessing.MultiLabelBinarizer
可以起作用:
data = '''w1, a1, c, f, s, w, q , p
w2, p, z, l, c, t, w, k, e, d, a2
w3, z, s, b, t, l
w4, a3, l, h, k, s, e, b'''
# Assuming "index" is first element in each row, splitting on comma
index = [row.split(',')[0] for row in data.split('\n')]
# For MultiLabelBinarizer, need an array-of-arrays/list-of-lists representation,
# so split each string in each row accordingly
data = [list(map(lambda x: x.strip(), row.split(',')[1:])) for row in data.split('\n')]
from sklearn.preprocessing import MultiLabelBinarizer
enc = MultiLabelBinarizer()
X = enc.fit_transform(data)
import pandas as pd
df = pd.DataFrame(data=X, columns=enc.classes_, index=index)
print(df)
a1 a2 a3 b c d e f h k l p q s t w z
w1 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0
w2 0 1 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1
w3 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1
w4 0 0 1 1 0 0 1 0 1 1 1 0 0 1 0 0 0
如果要将向量表示为布尔值,只需使用:
df = pd.DataFrame(data=X.astype(bool), columns=enc.classes_, index=index)
print(df)
a1 a2 a3 b c d e f h k l p q s t w z
w1 True False False False True False False True False False False True True True False True False
w2 False True False False True True True False False True True True False False True True True
w3 False False False True False False False False False False True False False True True False True
w4 False False True True False False True False True True True False False True False False False