熊猫:如果列值重合则求和

时间:2016-07-22 09:42:17

标签: python pandas

我正在尝试在python中执行一个简单的操作:

我有一些数据集,比如6,如果其他两列的值重合,我想要对一列的值求和。之后,我想要将已经与我拥有的数据集的数量相加的列的值除以这种情况下的6(即,计算算术平均值)。如果其他列的值不一致,我也想加0。

我在这里写下两个数据帧,例如:

Code1 Code2 Distance 0 15.0 15.0 2 1 15.0 60.0 3 2 15.0 69.0 2 3 15.0 434.0 1 4 15.0 842.0 0

Code1 Code2 Distance 0 14.0 15.0 4 1 14.0 60.0 7 2 15.0 15.0 0 3 15.0 60.0 1 4 15.0 69.0 9

第一列是df.index列。然后,我只想在“Code1”和“Code2”列重合的情况下对“距离”列求和。在这种情况下,所需的输出将是:

Code1 Code2 Distance 0 14.0 15.0 2 1 14.0 60.0 3.5 2 15.0 15.0 1 3 15.0 60.0 2 4 15.0 69.0 5.5 5 15.0 434.0 0.5 6 15.0 842.0 0

我尝试使用条件来做到这一点,但是对于两个以上的df真的很难做到。 Pandas有没有更快的方法呢?

任何帮助将不胜感激: - )

1 个答案:

答案 0 :(得分:1)

您可以将所有数据框放在列表中,然后使用reduceappendmerge全部。 看看reduce here

首先,下面为样本数据生成定义了一些函数。

import pandas
import numpy as np

# GENERATE DATA
# Code 1 between 13 and 15
def generate_code_1(n):
    return np.floor(np.random.rand(n,1) * 3 + 13)

# Code 2 between 1 and 1000
def generate_code_2(n):
    return np.floor(np.random.rand(n,1) * 1000) + 1

# Distance between 0 and 9
def generate_distance(n):
    return np.floor(np.random.rand(n,1) * 10)

# Generate a data frame as hstack of 3 arrays
def generate_data_frame(n):
    data = np.hstack([
         generate_code_1(n)
        ,generate_code_2(n)
        ,generate_distance(n)
    ])
    df = pandas.DataFrame(data=data, columns=['Code 1', 'Code 2', 'Distance'])
    # Remove possible duplications of Code 1 and Code 2. Take smallest distance in case of duplications.
    # Duplications will break merge method however will not break append method
    df = df.groupby(['Code 1', 'Code 2'], as_index=False)
    df = df.aggregate(np.min)
    return df

# Generate n data frames each with m rows in a list
def generate_data_frames(n, m, with_count=False):
    df_list = []
    for k in range(0, n):
        df = generate_data_frame(m)
        # Add count column, needed for merge method to keep track of how many cases we have seen
        if with_count:
            df['Count'] = 1
        df_list.append(df)
    return df_list

追加方法(更快,更短,更好)

df_list = generate_data_frames(94, 5)

# Append all data frames together using reduce
df_append = reduce(lambda df_1, df_2 : df_1.append(df_2), df_list)

# Aggregate by Code 1 and Code 2
df_append_grouped = df_append.groupby(['Code 1', 'Code 2'], as_index=False)
df_append_result = df_append_grouped.aggregate(np.mean)
df_append_result

合并方法

df_list = generate_data_frames(94, 5, with_count=True)

# Function to be passed to reduce. Merge 2 data frames and update Distance and Count
def merge_dfs(df_1, df_2):
    df = pandas.merge(df_1, df_2, on=['Code 1', 'Code 2'], how='outer', suffixes=('', '_y'))
    df = df.fillna(0)
    df['Distance'] = df['Distance'] + df['Distance_y']
    df['Count'] = df['Count'] + df['Count_y']
    del df['Distance_y']
    del df['Count_y']
    return df

# Use reduce to apply merge over the list of data frames
df_merge_result = reduce(merge_dfs, df_list)

# Replace distance with its mean and drop Count
df_merge_result['Distance'] = df_merge_result['Distance'] / df_merge_result['Count']
del df_merge_result['Count']
df_merge_result