Question

我有一份清单，例如

dictionary_test = {'A': ['hello', 'byebye', 'howdy'], 'B': ['bonjour', 'hello', 'ciao'], 'C': ['ciao', 'hello', 'byebye']}

我想将其转换为布尔从属关系矩阵以进行进一步分析。最好将d keys作为列名，将列表项列为行名：

         A    B    C
  hello  1    1    1
 byebye  1    0    1
  howdy  1    0    0
bonjour  0    1    0
   ciao  0    1    1

是否可以在Python中执行（最好是我可以将矩阵写入.csv文件）？我认为这是我与numpy有关的事情，对吗？

另一个问题是字典的大小是未知的（键的数量和列表中元素的数量都不同）。

Answer 1

您可以使用pandas。这是一个例子。

>>> import pandas as pd
>>> dictionary_test = {'A': ['hello', 'byebye', 'howdy'], 'B': ['bonjour', 'hello', 'ciao'], 'C': ['ciao', 'hello', 'byebye']}
>>> values = list(set([ x for y in dictionary_test.values() for x in y]))
>>> data = {}
>>> for key in dictionary_test.keys():
...  data[key] = [ True if value in dictionary_test[key] else False for value in values ]
... 
>>> pd.DataFrame(data, index=values)
             A      B      C
ciao     False   True   True
howdy     True  False  False
bonjour  False   True  False
hello     True   True   True
byebye    True  False   True

如果您希望按特定顺序排列。只需手动设置values。

Answer 2

这类似于Xin的答案，而是遍历每个索引（每个单词）并检查原始dictionary_test中的给定列是否包含该单词。

import pandas as pd

dictionary_test = {'A': ['hello', 'byebye', 'howdy'], 'B': ['bonjour', 'hello', 'ciao'], 'C': ['ciao', 'hello', 'byebye']}

df = pd.DataFrame(dictionary_test)

# all possible words (all possibles indices
words = {word for col in df.columns for word in df[col]}

# create a new DataFrame with the words as the index
d = pd.DataFrame(index = words)

# check whether a given column in your raw data contains a given index
# 1 if yes, 0 if no
for idx in d.index:
    for col in df.columns:
        d.loc[idx, col] = 1 if idx in set(df[col]) else 0

结果：

d
Out[6]: 
           A    B    C
hello    1.0  1.0  1.0
byebye   1.0  0.0  1.0
bonjour  0.0  1.0  0.0
howdy    1.0  0.0  0.0
ciao     0.0  1.0  1.0

编辑：为了响应您获取值为空列表的键的ValueError: arrays must all be same length，您可以执行以下操作：

# find how long the longest list is
longest_list_len = max(map(len, dictionary_test.values()))
dictionary_test = {key: value + [None] * (longest_list_len - len(value)) for key, value in dictionary_test.items()}

您基本上只需填写dictionary_test中数组之间的长度差异。然后只需将words分配行更改为：

# Exclude the `None`s we added above to ensure equal length
words = {word for col in df.columns for word in df[col] if word != None}

继续执行其余的代码！

布尔矩阵形式Python的列表字典

2 个答案: