我有一个熊猫数据框,其中的一列包含以以下方式嵌套的数据:
第一行:
result = [np.empty(0, dtype=int)] * 3000 # Empty array, so OK to use same reference
for i, j in enumerate(arr[locs]):
result[j] = coords[i]
return result
第二行:
[('QT', 0, 2, 'PERSON'), ('Billionaire Jack Ma', 102, 121, 'PERSON'), ('$14 million', 131, 142, 'MONEY'), ('U.S.', 204, 208, 'GPE'), ('33', 226, 228, 'MONEY')]
我需要将每4个组嵌套成4个单独的列,并保留原始行索引作为标识变量。
理想的输出:
这可能吗?
预先感谢
答案 0 :(得分:0)
我对性能不了解,但是如果我对您的问题了解得很清楚,那将是可行的:
result_df = pd.DataFrame(data={'org_id': [idx_val for idx_val in org_df.index for i in range(len(org_df.loc[idx_val, 'target_col']))],
'col_1': [single_tuple[1] for row_value in org_df['target_col'] for single_tuple in row_value],
'col_2': [single_tuple[1] for row_value in org_df['target_col'] for single_tuple in row_value],
'col_3': [single_tuple[2] for row_value in org_df['target_col'] for single_tuple in row_value],
'col_4': [single_tuple[3] for row_value in org_df['target_col'] for single_tuple in row_value]})
EDIT :更好的性能版本,避免重复理解:
data = {}
# For each index value, repeat n = len(row_list) times
data['org_id'] = [idx_val for idx_val in org_df.index for i in range(len(org_df.loc[idx_val, 'target_col']))]
# Extract each value of each tuple in a specific column
data['col_1'], data['col_2'], data['col_3'], data['col_4'] = zip(*[single_tuple for row_value in org_df['target_col'] for single_tuple in row_value])
result_df = pd.DataFrame(data=data)