我有一个函数,它接收特定年份的数据并返回数据帧。
例如:
DF
year fruit license grade
1946 apple XYZ 1
1946 orange XYZ 1
1946 apple PQR 3
1946 orange PQR 1
1946 grape XYZ 2
1946 grape PQR 1
..
2014 grape LMN 1
注意: 1)特定许可证值仅存在于特定年份,并且仅存在于特定水果一次(例如,XYZ仅适用于1946年,仅适用于苹果,橙子和葡萄)。 2)成绩值是明确的。
我意识到以下功能不是很有效地实现其预期目标, 但这就是我目前正在使用的。
def func(df, year):
#1. Filter out only the data for the year needed
df_year=df[df['year']==year]
'''
2. Transform DataFrame to the form:
XYZ PQR .. LMN
apple 1 3 1
orange 1 1 3
grape 2 1 1
Note that 'LMN' is just used for representation purposes.
It won't logically appear here because it can only appear for the year 2014.
'''
df_year = df_year.pivot(index='fruit',columns='license',values='grade')
#3. Remove all fruits that have ANY NaN values
df_year=df_year.dropna(axis=1, how="any")
#4. Some additional filtering
#5. Function to calculate similarity between fruits
def similarity_score(fruit1, fruit2):
agreements=np.sum( ( (fruit1 == 1) & (fruit2 == 1) ) | \
( (fruit1 == 3) & (fruit2 == 3) ))
disagreements=np.sum( ( (fruit1 == 1) & (fruit2 == 3) ) |\
( (fruit1 == 3) & (fruit2 == 1) ))
return (( (agreements-disagreements) /float(len(fruit1)) ) +1)/2)
#6. Create Network dataframe
network_df=pd.DataFrame(columns=['Source','Target','Weight'])
for i,c in enumerate(combinations(df_year,2)):
c1=df[[c[0]]].values.tolist()
c2=df[[c[1]]].values.tolist()
c1=[item for sublist in c1 for item in sublist]
c2=[item for sublist in c2 for item in sublist]
network_df.loc[i] = [c[0],c[1],similarity_score(c1,c2)]
return network_df
运行以上命令:
df_1946=func(df,1946)
df_1946.head()
Source Target Weight
Apple Orange 0.6
Apple Grape 0.3
Orange Grape 0.7
我想将上面的内容展平为一行:
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
请注意,上面不会有3列,但事实上大约有5000列。
最终,我想堆叠转换后的数据帧行,以获得类似:
的内容df_all_years
(Apple,Orange) (Apple,Grape) (Orange,Grape)
1946 0.6 0.3 0.7
1947 0.7 0.25 0.8
..
2015 0.75 0.3 0.65
这样做的最佳方式是什么?
答案 0 :(得分:2)
我会稍微重新排列计算。 而不是多年来循环:
for year in range(1946, 2015):
partial_result = func(df, year)
然后连接部分结果,你就可以得到
通过在整个DataFrame df
上尽可能多地完成工作来提高性能,
在致电df.groupby(...)
之前。此外,如果您可以使用内置聚合器(例如sum
和count
)来表达计算,则可以比使用groupby/apply
的自定义函数更快地完成计算。
import itertools as IT
import numpy as np
import pandas as pd
np.random.seed(2017)
def make_df():
N = 10000
df = pd.DataFrame({'fruit': np.random.choice(['Apple', 'Orange', 'Grape'], size=N),
'grade': np.random.choice([1,2,3], p=[0.7,0.1,0.2], size=N),
'year': np.random.choice(range(1946,1950), size=N)})
df['manufacturer'] = (df['year'].astype(str) + '-'
+ df.groupby(['year', 'fruit'])['fruit'].cumcount().astype(str))
df = df.sort_values(by=['year'])
return df
def similarity_score(df):
"""
Compute the score between each pair of columns in df
"""
agreements = {}
disagreements = {}
for col in IT.combinations(df,2):
fruit1 = df[col[0]].values
fruit2 = df[col[1]].values
agreements[col] = ( ( (fruit1 == 1) & (fruit2 == 1) )
| ( (fruit1 == 3) & (fruit2 == 3) ))
disagreements[col] = ( ( (fruit1 == 1) & (fruit2 == 3) )
| ( (fruit1 == 3) & (fruit2 == 1) ))
agreements = pd.DataFrame(agreements, index=df.index)
disagreements = pd.DataFrame(disagreements, index=df.index)
numerator = agreements.astype(int)-disagreements.astype(int)
grouped = numerator.groupby(level='year')
total = grouped.sum()
count = grouped.count()
score = ((total/count) + 1)/2
return score
df = make_df()
df2 = df.set_index(['year','fruit','manufacturer'])['grade'].unstack(['fruit'])
df2 = df2.dropna(axis=0, how="any")
print(similarity_score(df2))
产量
Grape Orange
Apple Apple Grape
year
1946 0.629111 0.650426 0.641900
1947 0.644388 0.639344 0.633039
1948 0.613117 0.630566 0.616727
1949 0.634176 0.635379 0.637786
答案 1 :(得分:1)
这是一种做pandas例程的方法,以你引用的方式来转动表格;虽然它处理了大约5,000列 - 从两个最初分离的类中组合起来 - 足够快(瓶颈步骤在我的四核MacBook上花了大约20秒),对于更大的缩放,肯定有更快的策略。此示例中的数据非常稀疏(5K列,具有来自70行年份[1947-2016]的5K随机样本),因此使用更完整的数据帧,执行时间可能会延长几秒钟。
from itertools import chain
import pandas as pd
import numpy as np
import random # using python3 .choices()
import re
# Make bivariate data w/ 5000 total combinations (1000x5 categories)
# Also choose 5,000 randomly; some combinations may have >1 values or NaN
random_sample_data = np.array(
[random.choices(['Apple', 'Orange', 'Lemon', 'Lime'] +
['of Fruit' + str(i) for i in range(1000)],
k=5000),
random.choices(['Grapes', 'Are Purple', 'And Make Wine',
'From the Yeast', 'That Love Sugar'],
k=5000),
[random.random() for _ in range(5000)]]
).T
df = pd.DataFrame(random_sample_data, columns=[
"Source", "Target", "Weight"])
df['Year'] = random.choices(range(1947, 2017), k=df.shape[0])
# Three views of resulting df in jupyter notebook:
df
df[df.Year == 1947]
df.groupby(["Source", "Target"]).count().unstack()
要展平按年分组的数据,因为groupby需要应用函数,您可以使用临时df中介来:
data.groupby("Year")
推送到单独的行中,但每两列“目标”+“来源”(以后再展开)加上“重量”时会有单独的数据框。zip
和pd.core.reshape.util.cartesian_product
创建一个空的正确形状的枢轴df,这将是temp_df产生的最终表格。如,
df_temp = df.groupby("Year").apply(
lambda s: pd.DataFrame([(s.Target, s.Source, s.Weight)],
columns=["Target", "Source", "Weight"])
).sort_index()
df_temp.index = df_temp.index.droplevel(1) # reduce MultiIndex to 1-d
# Predetermine all possible pairwise column category combinations
product_ts = [*zip(*(pd.core.reshape.util.cartesian_product(
[df.Target.unique(), df.Source.unique()])
))]
ts_combinations = [str(x + ' ' + y) for (x, y) in product_ts]
ts_combinations
最后,使用简单的for-for嵌套迭代(同样,不是最快的,但pd.DataFrame.iterrows
可能有助于加快速度,如图所示)。由于随机抽样替换我必须处理多个值,所以你可能想要删除第二个for循环下面的条件,这是三个独立数据帧的每一年相应压缩成一行的步骤所有细胞通过旋转(“重量”)x(“目标” - “源”)关系。
df_pivot = pd.DataFrame(np.zeros((70, 5000)),
columns=ts_combinations)
df_pivot.index = df_temp.index
for year, values in df_temp.iterrows():
for (target, source, weight) in zip(*values):
bivar_pair = str(target + ' ' + source)
curr_weight = df_pivot.loc[year, bivar_pair]
if curr_weight == 0.0:
df_pivot.loc[year, bivar_pair] = [weight]
# append additional values if encountered
elif type(curr_weight) == list:
df_pivot.loc[year, bivar_pair] = str(curr_weight +
[weight])
# Spotcheck:
# Verifies matching data in pivoted table vs. original for Target+Source
# combination "And Make Wine of Fruit614" across all 70 years 1947-2016
df
df_pivot['And Make Wine of Fruit614']
df[(df.Year == 1947) & (df.Target == 'And Make Wine') & (df.Source == 'of Fruit614')]