嵌套数据框列表

时间:2020-04-21 18:38:55

标签: python pandas spacy

我有一个熊猫数据框,其中的一列包含以以下方式嵌套的数据:

第一行: result = [np.empty(0, dtype=int)] * 3000 # Empty array, so OK to use same reference for i, j in enumerate(arr[locs]): result[j] = coords[i] return result

第二行: [('QT', 0, 2, 'PERSON'), ('Billionaire Jack Ma', 102, 121, 'PERSON'), ('$14 million', 131, 142, 'MONEY'), ('U.S.', 204, 208, 'GPE'), ('33', 226, 228, 'MONEY')]

我需要将每4个组嵌套成4个单独的列,并保留原始行索引作为标识变量。

理想的输出:

enter image description here

这可能吗?

预先感谢

1 个答案:

答案 0 :(得分:0)

我对性能不了解,但是如果我对您的问题了解得很清楚,那将是可行的:

result_df = pd.DataFrame(data={'org_id': [idx_val for idx_val in org_df.index for i in range(len(org_df.loc[idx_val, 'target_col']))], 
                     'col_1': [single_tuple[1] for row_value in org_df['target_col'] for single_tuple in row_value], 
                     'col_2': [single_tuple[1] for row_value in org_df['target_col'] for single_tuple in row_value],
                     'col_3': [single_tuple[2] for row_value in org_df['target_col'] for single_tuple in row_value], 
                     'col_4': [single_tuple[3] for row_value in org_df['target_col'] for single_tuple in row_value]})

EDIT :更好的性能版本,避免重复理解:

data = {}
# For each index value, repeat n = len(row_list) times
data['org_id'] = [idx_val for idx_val in org_df.index for i in range(len(org_df.loc[idx_val, 'target_col']))]
# Extract each value of each tuple in a specific column
data['col_1'], data['col_2'], data['col_3'], data['col_4'] = zip(*[single_tuple for row_value in org_df['target_col'] for single_tuple in row_value])
result_df = pd.DataFrame(data=data)