Question

我有一个数据框df，包含4211行和1列：

     bow
0   [(6,1),(8,3),(9,1),...]   
1   [(1,1),(3,1),(10,1),...]   
2   [(9,2),(12,3),(13,1),...]
...

每行代表一个文档，bow中的列表是该文档中的word id及其对应的occurrence times，采用词袋格式。例如，在第一个文档中，带有id 6的单词出现了一次，带有id 8的单词出现了3次。完全有5000个单词和4211个文档。现在，我想将此数据框转换为大doc-word矩阵，其大小为4211 * 5000。 m_ij=n表示文档i中n次出现j次的单词{{1}}。我怎样才能快速实现它？提前致谢！

Answer 1

转换为numpy数组应加快速度（但我还没有对你的类型和大小的数据进行测试）。

我认为word id在单行中不会出现多次。

# 1. allocating space for the output array:
output_arr = np.zeros(shape = (len(df), 5000), dtype = int)
# 2. converting DF to np.array (arr_df will be of shape (len(df),1)):
arr_df = np.array(df)
# 3. iterating:
for i in range(len(arr_df)):
    # arr_df[i] is a np.array containing a list so we have to use arr_df[i][0] to get to the tuples:
    idx, values = zip(*arr_df[i][0])
    output_arr[i,idx] = val

如何使用python-pandas快速将数据帧转换为大矩阵？

1 个答案: