Question

所以，我有一本字典：

d = {'col1': [1,2,3,4,5], 'colnames':['a','b','d','e']}
list_of_ids = [1,2]

我正在尝试创建一个DataFrame，如：

id, col1, colnames
1,  1,     a
1, 2,    b
1,3, c

...
2,1,a
2,1,b

.. and so on

因此，基本上，对于列表中的每个元素，生成所有可能的列条目。

我怎么能用熊猫做到这一点？

Answer 1

如果我理解正确，您可以直接使用itertools.product。

from itertools import product

df = pd.DataFrame(list(product(list_of_ids, d['col1'], d['colnames'])), 
                  columns=['id', 'col1', 'colnames'])

#     id col1  colnames
# 0    1    1         a
# 1    1    1         b
# 2    1    1         d
# 3    1    1         e
# 4    1    2         a
# ...

根据您当前的输入大小，这种方法似乎足够合理。但是，如果您打算在更大的数据集上执行此操作，则您将要使用像piRSquared's这样的NumPy解决方案。

Answer 2

使用numpy.repeat

# the data
d = {'col1': np.arange(1, 6), 'colnames':list('abde'), 'id': [1, 2]}

# calculate length of each sub-list
lengths = {k: len(v) for k, v in d.items()}

# calculate product of all lengths...
# ... then the product of all but current.
# this provides the value we must repeat by.
p = np.product(list(lengths.values()))
p_ = {k: p // v for k, v in lengths.items()}

# perform the repeat within a dictionary comprehension
# and pass to the dataframe constructor
pd.DataFrame({k: np.repeat(v, p_[k]) for k, v in d.items()})

    col1 colnames  id
0      1        a   1
1      1        a   1
2      1        a   1
3      1        a   1
4      1        a   1
5      1        a   1
6      1        a   1
7      1        a   1
8      2        a   1
9      2        a   1
10     2        b   1
11     2        b   1
...

给定数据的时间

包含更大的数据

Answer 3

更短的方法，但

import pandas as pd

out = []
for x in range(1,3):
    for y in range(1,6):
        for z in 'abde':
            out.append([x, y, z]) 
df = pd.DateFrame(out)

在适当的地方替换你的列表/字典电话，你应该很好

如何从字典和pandas中的列表创建新的数据框

3 个答案: