Question

我需要从post gres服务器读取数据并将其放入数组/数据中。每行都有一个源字段和一个目标字段。我需要将它们累计添加到数组中。当我遍历数据框时，如果的源字段和目标字段不在account列中，则需要将它们添加到其中。

这是我的代码当前的样子（为简洁起见，不包括postgres部分）


# Load the data
data = pd.read_sql(sql_command, conn)

# taking a subet of the data until algorithm is perfected. 
seed = np.random.seed(42)

n = data.shape[0]
ix = np.random.choice(n,10000)
df_tmp = data.iloc[ix]

# Taking the source and destination and combining it into a list in another column 
df_tmp['accounts'] = df_tmp.apply(lambda x: [x['source'], x['destination']], axis=1)

# Attempt at cummulatively adding accounts to columns
for index, row in df_tmp.iterrows():
    if 'accounts' not in df_tmp:
        df_tmp['accounts'] = df_tmp.apply(lambda x: [x['accounts'], x['source'],x['destination']], axis=1)
    else:
         df_tmp['accounts'] =  df_tmp['accounts']

这是我的数据的样子：

问题：

这是正确的方法吗？
最后一行将有大约100万个帐户，这将使其非常昂贵。这是一种更有效的表示方式吗？

Answer 1

您可以在cumsum列上使用accounts来创建帐户值的累积串联。然后将累积列表转换为Set，以保留唯一值。

这里有一个类似的问题：Cumulative Set in PANDAS

df_tmp['accounts_acc'] = df_tmp['accounts'].cumsum().apply(set)

遍历数据框并添加值（如果它们不存在于该列索引中）

1 个答案: