我有这段代码:
for index, row in df.iterrows():
for index1, row1 in df1.iterrows():
if df['budget'].iloc[index] == 0:
if df['production_companies'].iloc[index] == df1['production_companies'].iloc[index1]
and df['release_date'].iloc[index].year == df1['release_year'].iloc[index1] :
df['budget'].iloc[index] = df1['mean'].iloc[index1]
它有效,但完成需要很长时间。如何让它运行得更快? 我也尝试过:
df.where((df['budget'] != 0 and df['production_companies'] != df1['production_companies']
and df['release_date'] != df1['release_year']),
other = pd.replace(to_replace = df['budget'],
value = df1['mean'], inplace = True))
它应该更快但不起作用。我该如何实现这一目标? 谢谢!
df
看起来像这样:
budget; production_companies; release_date ;title
0; Villealfa Filmproduction Oy ;10/21/1988; Ariel
0; Villealfa Filmproduction Oy ;10/16/1986; Shadows in Paradise
4000000; Miramax Films; 12/25/1995; Four Rooms
0; Universal Pictures; 10/15/1993; Judgment Night
42000; inLoops ;1/1/2006; Life in Loops (A Megacities RMX)
...
和df1
:
production_companies; release_year; mean;
Metro-Goldwyn-Mayer (MGM); 1998; 17500000
Metro-Goldwyn-Mayer (MGM); 1999; 12500000
Metro-Goldwyn-Mayer (MGM); 2000; 12000000
Metro-Goldwyn-Mayer (MGM) ;2001 ;43500000
Metro-Goldwyn-Mayer (MGM); 2002 ;12000000
Metro-Goldwyn-Mayer (MGM) ;2003; 36000000
Metro-Goldwyn-Mayer (MGM); 2004 ;27500000
...
如果年份和制作公司相同,我想将df
中的值0替换为df1
的“平均”值。
答案 0 :(得分:1)
摆脱所有循环,您可以通过合并有效地完成此任务。这里我提供了一些示例数据,因为您提供的数据都不会实际合并。您希望确保release_date
中的df
是日期时间(如果尚未确定)。
import pandas as pd
import numpy as np
df = pd.DataFrame({'budget': [0, 100, 0, 1000, 0],
'production_company': ['Villealfa Filmproduction Oy', 'Villealfa Filmproduction Oy',
'Villealfa Filmproduction Oy', 'Miramax Films', 'Miramax Films'],
'release_date': ['10/21/1988', '10/18/1986', '12/25/1955', '1/1/2006', '4/13/2017'],
'title': ['AAA', 'BBB', 'CCC', 'DDD', 'EEE']})
df1 = pd.DataFrame({'production_companies': ['Villealfa Filmproduction Oy', 'Villealfa Filmproduction Oy',
'Villealfa Filmproduction Oy', 'Miramax Films', 'Miramax Films'],
'release_year': [1988, 1986, 1955, 2006, 2017],
'mean': [1000000, 2000000, 30000000, 4000000, 5000000]})
df['release_date'] = pd.to_datetime(df.release_date, format='%m/%d/%Y')
# budget production_company release_date title
#0 0 Villealfa Filmproduction Oy 1988-10-21 AAA
#1 100 Villealfa Filmproduction Oy 1986-10-18 BBB
#2 0 Villealfa Filmproduction Oy 1955-12-25 CCC
#3 1000 Miramax Films 2006-01-01 DDD
#4 0 Miramax Films 2017-04-13 EEE
然后,如果生产公司和年份相匹配,您希望将平均值替换为0的预算。因此合并时这是:
df.loc[df.budget==0, 'budget'] = (df.merge(df1, left_on=['production_company',
df.release_date.dt.year], right_on=['production_companies', 'release_year'], how='left')
.loc[df.budget==0, 'mean'])
# budget production_company release_date title
#0 1000000 Villealfa Filmproduction Oy 1988-10-21 AAA
#1 100 Villealfa Filmproduction Oy 1986-10-18 BBB
#2 30000000 Villealfa Filmproduction Oy 1955-12-25 CCC
#3 1000 Miramax Films 2006-01-01 DDD
#4 5000000 Miramax Films 2017-04-13 EEE
如果您没有给定制作公司和年份的mean
数据,则0
中的budget
将替换为np.NaN
,因此您可以如果你愿意,可以留下它们或将它们更换回0。
答案 1 :(得分:1)
请勿对此任务使用循环
pandas的主要好处是矢量化功能。
向量化计算的一种方法是对齐索引,然后使用pd.DataFrame.index.map
。要提取年份,您需要先转换为datetime
。
来自@ALollz的数据。
# convert release_date to datetime and calculate year
df['release_date'] = pd.to_datetime(df['release_date'])
df['year'] = df['release_date'].dt.year
# create mapping from df1
s = df1.set_index(['production_companies', 'release_year'])['mean']
# use map on selected condition
mask = df['budget'] == 0
df.loc[mask, 'budget'] = df[mask].set_index(['production_company', 'year']).index.map(s.get)
print(df)
# budget production_company release_date title year
# 0 1000000 Villealfa Filmproduction Oy 1988-10-21 AAA 1988
# 1 100 Villealfa Filmproduction Oy 1986-10-18 BBB 1986
# 2 30000000 Villealfa Filmproduction Oy 1955-12-25 CCC 1955
# 3 1000 Miramax Films 2006-01-01 DDD 2006
# 4 5000000 Miramax Films 2017-04-13 EEE 2017
答案 2 :(得分:-1)
您可以使用之前的if语句快速删除一个周期:
for index, row in df.iterrows():
if df['budget'].iloc[index] == 0:
for index1, row1 in df1.iterrows():
if df['production_companies'].iloc[index] == df1['production_companies'].iloc[index1] and df['release_date'].iloc[index].year == df1['release_year'].iloc[index1] :
df['budget'].iloc[index] = df1['mean'].iloc[index1]