Question

我具有以下类型的数据框

test = {0:['loc_a','loc_b','loc_c'],1:['new_list','new_list','change_list'],2:['abc','abc','abc'],3:['def','change_list','def'],4:['ghi','def','ghi'],5:['change_list','ghi','jkl'],6:['jkl','jkl','mno'],7:['mno','pqr','pqr']}
test = pd.DataFrame(test)

我需要将数据框处理为以下格式

test2 = {'location':['loc_a','loc_a','loc_a','loc_a','loc_a'],'list_type':['new_list','new_list','new_list','change_list','change_list'],'value':['abc','def','ghi','jkl','mno']}
test2 = pd.DataFrame(test2)

真的很需要帮助，因为我被困在这里，是否可以通过处理每一行将数据转换为所需的数据帧形式？

Answer 1

IIUC，这实际上并不是您想要包含在DataFrame中的数据类型。如果您可以控制如何收集和存储这些数据，我可能会尝试在将其存储为表格格式之前进行一些预处理。我能想到的唯一解决方案是遍历行并存储新的输出，以便稍后将其转换为数据框。

data = []

for row in test.itertuples():
    loc = row[1]
    
    changelist_idx = row.index("change_list") + 1
    try:
        newlist_idx = row.index("new_list") + 1
    except ValueError:
        newlist_idx = changelist_idx
    
    newlist_values = row[newlist_idx:changelist_idx - 1]
    changelist_values = row[changelist_idx:]
    
    for value in newlist_values:
        data.append([loc, "new_list", value])
    for value in changelist_values:
        data.append([loc, "change_list", value])
    
out = pd.DataFrame(data, columns=["location", "list_type", "value"])
print(out)

   location    list_type value
0     loc_a     new_list   abc
1     loc_a     new_list   def
2     loc_a     new_list   ghi
3     loc_a  change_list   jkl
4     loc_a  change_list   mno
5     loc_b     new_list   abc
6     loc_b  change_list   def
7     loc_b  change_list   ghi
8     loc_b  change_list   jkl
9     loc_b  change_list   pqr
10    loc_c  change_list   abc
11    loc_c  change_list   def
12    loc_c  change_list   ghi
13    loc_c  change_list   jkl
14    loc_c  change_list   mno
15    loc_c  change_list   pqr

Answer 2

数据需要重塑以获得所需的格式；首先，您需要melt，做一个groupby（确保它没有排序，所以保留了当前顺序）：

我建议您独立运行每一行，以便您了解转换的每个阶段；您甚至可以想出一种更好的方法，甚至摆脱不必要的步骤：

(
    test.melt(0)
    .groupby(0, sort=False)["value"]
    .agg(",".join)
    .str.split(",")
    .explode()
    .rename_axis("location")
    .reset_index()
    .assign(list_type=lambda x: x.loc[x["value"].str.contains("list"), "value"])
    .ffill()
    .query("value!=list_type")
)



  location  value   list_type
1   loc_a   abc     new_list
2   loc_a   def     new_list
3   loc_a   ghi     new_list
5   loc_a   jkl     change_list
6   loc_a   mno     change_list
8   loc_b   abc     new_list
10  loc_b   def     change_list
11  loc_b   ghi     change_list
12  loc_b   jkl     change_list
13  loc_b   pqr     change_list
15  loc_c   abc     change_list
16  loc_c   def     change_list
17  loc_c   ghi     change_list
18  loc_c   jkl     change_list
19  loc_c   mno     change_list
20  loc_c   pqr     change_list

Answer 3

以下是使用熊猫.iterrows()的简单答案。希望能为您提供参考。

# Collect the items you want to stack in a list.
test['lst'] = test[range(1, 8)].values.tolist()

df_all = pd.DataFrame()
# Create the desired DataFrame for each location.
for index, row in test.iterrows():
    loc = row[0]
    df_by_loc = pd.DataFrame()
    for lst in row['lst']:
        df = pd.DataFrame({'location': [loc]})
        # If lst is 'new_list' or 'change_list', create list_type.
        if lst is 'new_list':
            list_type = 'new_list'
        elif lst is 'change_list':
            list_type = 'change_list'
        else:
            # In cases other than the above, the item of 'list_type' and 'value' is actually created.
            df['list_type'] = list_type
            df['value'] = lst
        
        # Create a DataFrame for each location by concating.
        if lst not in ['new_list', 'change_list']:
            df_by_loc = pd.concat([df_by_loc, df], axis=0)
    
    df_all = pd.concat([df_all, df_by_loc], axis=0)

结果如下。

print(df_all)
  location    list_type value
0    loc_a     new_list   abc
0    loc_a     new_list   def
0    loc_a     new_list   ghi
0    loc_a  change_list   jkl
0    loc_a  change_list   mno
0    loc_b     new_list   abc
0    loc_b  change_list   def
0    loc_b  change_list   ghi
0    loc_b  change_list   jkl
0    loc_b  change_list   pqr
0    loc_c  change_list   abc
0    loc_c  change_list   def
0    loc_c  change_list   ghi
0    loc_c  change_list   jkl
0    loc_c  change_list   mno
0    loc_c  change_list   pqr

熊猫数据框的重新排列

3 个答案: