Question

我有一个原始数据集，其中的信息存储为dict列表，在列中（这是一个mongodb提取）。这是专栏：

[{u'domain_id': ObjectId('A'),  u'p': 1}, 
{u'domain_id': ObjectId('B'),  u'p': 2},
{u'domain_id': ObjectId('B'),  u'p': 3},
... 
{u'domain_id': ObjectId('CG'),  u'p': 101}]

我只对前10个字典感兴趣（＆＃39; p＆＃39;值从1到10）。输出数据框应如下所示：

index |  A  | ... |  B
------------------------
0     |  1  | ... | 2
1     | Nan | ... | Nan
2     | Nan | ... | 3

例如：对于原始DataFrame的每一行，我为每个domain_id创建一个列，并将其与相应的＆＃39; p＆＃39;值。我可以为几个＆＃39; p＆＃39;提供相同的domain_id。值，在这种情况下，我只保留第一个（较小的＆＃39; p＆＃39;）

这是我当前的代码，可能更容易理解：

first = True
for i in df.index[:]: # for each line of original Dataframe
    temp_list = df["positions"][i] # this is the column with the list of dict inside
    col_list = []
    data_list = []
    for j in range(10): # get the first 10 values
        try:
            if temp_list[j]["domain_id"] not in col_list: # check if domain_id already exist
                col_list.append(temp_list[j]["domain_id"])
                data_list.append(temp_list[j]["p"])
        except IndexError as e:
            print e
    df_temp = pd.DataFrame([np.transpose(data_list)],columns = col_list) # create a temporary DataFrame for this line of the original DataFrame
    if first:
        df_kw = df_temp
        first = False
    else:
#             pass
        df_kw = pd.concat([df_kw,df_temp], axis=0, ignore_index=True) # concat all the temporary DataFrame : now I have my output Dataframe, with the same number of lines as my original DataFrame.

这一切都运行正常，但它非常慢，因为我有15k行并最终有10k列。

我确信（或者至少我非常希望）有一个更简单更快的解决方案：任何建议都将受到高度赞赏。

Answer 1

我找到了一个不错的解决方案：缓慢的部分是连接，因此首先创建数据帧然后更新值会更有效。

创建DataFrame：

for i in df.index[:]:
    temp_list = df["positions"][i]
    for j in range(10):
        try:
#             if temp_list[j]["domain_id"] not in col_list:
            col_list.append(temp_list[j]["domain_id"])
        except IndexError as e:
            print e

df_total = pd.DataFrame(index=df.index, columns=set(col_list))

更新值：

for i in df.index[:]:
    temp_list = df["positions"][i]
    col_list = []
    for j in range(10):
        try:
            if temp_list[j]["domain_id"] not in col_list: # avoid overwriting values
                df_total.loc[i, temp_list[j]["domain_id"]] = temp_list[j]["p"]
                col_list.append(temp_list[j]["domain_id"])
        except IndexError as e:
            print e

在我的计算机上创建一个15k x 6k的DataFrame大约需要6秒钟，并且填充它需要27秒。我跑了1个多小时后杀了以前的解决方案，所以这真的更快。

pandas可以加快列表列中列的创建速度

1 个答案: