如何压扁单个pandas数据帧并将它们堆叠起来以实现新的?

时间:2017-08-19 18:42:58

标签: python pandas numpy scikit-learn

我有一个函数,它接收特定年份的数据并返回数据帧。

例如:

DF

year    fruit    license     grade
1946    apple       XYZ        1
1946    orange      XYZ        1
1946    apple       PQR        3
1946    orange      PQR        1
1946    grape       XYZ        2
1946    grape       PQR        1
..
2014    grape       LMN        1

注意: 1)特定许可证值仅存在于特定年份,并且仅存在于特定水果一次(例如,XYZ仅适用于1946年,仅适用于苹果,橙子和葡萄)。 2)成绩值是明确的。

我意识到以下功能不是很有效地实现其预期目标, 但这就是我目前正在使用的。

def func(df, year):
    #1. Filter out only the data for the year needed

    df_year=df[df['year']==year]
    '''
    2. Transform DataFrame to the form:
              XYZ    PQR    ..     LMN
    apple      1      3             1
    orange     1      1             3
    grape      2      1             1
    Note that 'LMN' is just used for representation purposes. 
    It won't logically appear here because it can only appear for the year 2014.
    '''
    df_year = df_year.pivot(index='fruit',columns='license',values='grade')    

    #3. Remove all fruits that have ANY NaN values 
    df_year=df_year.dropna(axis=1, how="any")

    #4. Some additional filtering

    #5. Function to calculate similarity between fruits
    def similarity_score(fruit1, fruit2):
        agreements=np.sum(  ( (fruit1 == 1) & (fruit2 == 1) ) | \
        (  (fruit1 == 3) & (fruit2 == 3) ))

        disagreements=np.sum(  ( (fruit1 == 1) & (fruit2 == 3) ) |\
        (  (fruit1 == 3) & (fruit2 == 1) ))

        return (( (agreements-disagreements) /float(len(fruit1)) ) +1)/2)

    #6. Create Network dataframe
    network_df=pd.DataFrame(columns=['Source','Target','Weight'])

    for i,c in enumerate(combinations(df_year,2)):
        c1=df[[c[0]]].values.tolist()
        c2=df[[c[1]]].values.tolist()
        c1=[item for sublist in c1 for item in sublist]
        c2=[item for sublist in c2 for item in sublist]
        network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)]

    return network_df

运行以上命令:

df_1946=func(df,1946)
df_1946.head()

Source    Target    Weight
Apple     Orange     0.6
Apple     Grape      0.3
Orange    Grape      0.7

我想将上面的内容展平为一行:

       (Apple,Orange)  (Apple,Grape)  (Orange,Grape)  
1946        0.6             0.3            0.7

请注意,上面不会有3列,但事实上大约有5000列。

最终,我想堆叠转换后的数据帧行,以获得类似:

的内容

df_all_years

       (Apple,Orange)  (Apple,Grape)  (Orange,Grape)  
1946        0.6             0.3            0.7
1947        0.7             0.25           0.8
..
2015        0.75            0.3            0.65

这样做的最佳方式是什么?

2 个答案:

答案 0 :(得分:2)

我会稍微重新排列计算。 而不是多年来循环:

for year in range(1946, 2015):
    partial_result = func(df, year)

然后连接部分结果,你就可以得到 通过在整个DataFrame df上尽可能多地完成工作来提高性能, 在致电df.groupby(...)之前。此外,如果您可以使用内置聚合器(例如sumcount)来表达计算,则可以比使用groupby/apply的自定义函数更快地完成计算。

import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2017)

def make_df():
    N = 10000
    df = pd.DataFrame({'fruit': np.random.choice(['Apple', 'Orange', 'Grape'], size=N),
                       'grade': np.random.choice([1,2,3], p=[0.7,0.1,0.2], size=N),
                       'year': np.random.choice(range(1946,1950), size=N)})
    df['manufacturer'] = (df['year'].astype(str) + '-' 
                          + df.groupby(['year', 'fruit'])['fruit'].cumcount().astype(str))
    df = df.sort_values(by=['year'])
    return df

def similarity_score(df):
    """
    Compute the score between each pair of columns in df
    """
    agreements = {}
    disagreements = {}
    for col in IT.combinations(df,2):
        fruit1 = df[col[0]].values
        fruit2 = df[col[1]].values
        agreements[col] = ( ( (fruit1 == 1) & (fruit2 == 1) )
                            | ( (fruit1 == 3) & (fruit2 == 3) ))
        disagreements[col] = ( ( (fruit1 == 1) & (fruit2 == 3) ) 
                               | ( (fruit1 == 3) & (fruit2 == 1) ))
    agreements = pd.DataFrame(agreements, index=df.index)
    disagreements = pd.DataFrame(disagreements, index=df.index)
    numerator = agreements.astype(int)-disagreements.astype(int)
    grouped = numerator.groupby(level='year')
    total = grouped.sum()
    count = grouped.count()
    score = ((total/count) + 1)/2
    return score

df = make_df()
df2 = df.set_index(['year','fruit','manufacturer'])['grade'].unstack(['fruit'])
df2 = df2.dropna(axis=0, how="any")

print(similarity_score(df2))

产量

         Grape    Orange          
         Apple     Apple     Grape
year                              
1946  0.629111  0.650426  0.641900
1947  0.644388  0.639344  0.633039
1948  0.613117  0.630566  0.616727
1949  0.634176  0.635379  0.637786

答案 1 :(得分:1)

这是一种做pandas例程的方法,以你引用的方式来转动表格;虽然它处理了大约5,000列 - 从两个最初分离的类中组合起来 - 足够快(瓶颈步骤在我的四核MacBook上花了大约20秒),对于更大的缩放,肯定有更快的策略。此示例中的数据非常稀疏(5K列,具有来自70行年份[1947-2016]的5K随机样本),因此使用更完整的数据帧,执行时间可能会延长几秒钟。

from itertools import chain
import pandas as pd
import numpy as np
import random  # using python3 .choices()
import re

# Make bivariate data w/ 5000 total combinations (1000x5 categories)
# Also choose 5,000 randomly; some combinations may have >1 values or NaN
random_sample_data = np.array(
    [random.choices(['Apple', 'Orange', 'Lemon', 'Lime'] +
                    ['of Fruit' + str(i) for i in range(1000)],
                    k=5000),
     random.choices(['Grapes', 'Are Purple', 'And Make Wine',
                     'From the Yeast', 'That Love Sugar'],
                    k=5000),
     [random.random() for _ in range(5000)]]
).T
df = pd.DataFrame(random_sample_data, columns=[
                  "Source", "Target", "Weight"])
df['Year'] = random.choices(range(1947, 2017), k=df.shape[0])

# Three views of resulting df in jupyter notebook:
df
df[df.Year == 1947]
df.groupby(["Source", "Target"]).count().unstack()

enter image description here

要展平按年分组的数据,因为groupby需要应用函数,您可以使用临时df中介来:

  1. 将所有data.groupby("Year")推送到单独的行中,但每两列“目标”+“来源”(以后再展开)加上“重量”时会有单独的数据框。
  2. 使用zippd.core.reshape.util.cartesian_product创建一个空的正确形状的枢轴df,这将是temp_df产生的最终表格。
  3. 如,

    df_temp = df.groupby("Year").apply(
        lambda s: pd.DataFrame([(s.Target, s.Source, s.Weight)],
                               columns=["Target", "Source", "Weight"])
    ).sort_index()
    df_temp.index = df_temp.index.droplevel(1)  # reduce MultiIndex to 1-d
    
    # Predetermine all possible pairwise column category combinations
    product_ts = [*zip(*(pd.core.reshape.util.cartesian_product(
        [df.Target.unique(), df.Source.unique()])
    ))]
    
    ts_combinations = [str(x + ' ' + y) for (x, y) in product_ts]
    
    ts_combinations
    

    enter image description here

    最后,使用简单的for-for嵌套迭代(同样,不是最快的,但pd.DataFrame.iterrows可能有助于加快速度,如图所示)。由于随机抽样替换我必须处理多个值,所以你可能想要删除第二个for循环下面的条件,这是三个独立数据帧的每一年相应压缩成一行的步骤所有细胞通过旋转(“重量”)x(“目标” - “源”)关系。

    df_pivot = pd.DataFrame(np.zeros((70, 5000)),
                            columns=ts_combinations)
    df_pivot.index = df_temp.index
    
    for year, values in df_temp.iterrows():
    
        for (target, source, weight) in zip(*values):
    
            bivar_pair = str(target + ' ' + source)
            curr_weight = df_pivot.loc[year, bivar_pair]
    
            if curr_weight == 0.0:
                df_pivot.loc[year, bivar_pair] = [weight] 
            # append additional values if encountered 
            elif type(curr_weight) == list:
                df_pivot.loc[year, bivar_pair] = str(curr_weight +
                                                     [weight])
    

    enter image description here

    # Spotcheck:
    # Verifies matching data in pivoted table vs. original for Target+Source
    # combination "And Make Wine of Fruit614" across all 70 years 1947-2016
    df
    df_pivot['And Make Wine of Fruit614']
    df[(df.Year == 1947) & (df.Target == 'And Make Wine') & (df.Source == 'of Fruit614')]