我正在分析TMDB dataset on Kaggle和变量release_date
中存在的年份,与变量release_year
相比,某些条目移位了40年:
# Change to pandas datetime
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'])
tmdb_df.query('release_date > datetime.date(2015,12,31)')[['release_date', 'release_year']].head()
###
#release_date release_year
#9849 2062-10-04 1962
#9850 2062-12-10 1962
#9851 2062-06-13 1962
#9852 2062-12-25 1962
#9853 2062-10-24 1962
我使用apply
:
# Check for movies where the year on `release_date` are shifted
# when compared with `release_yer`
import datetime
# Change to pandas datetime
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'])
def aux_func(row):
"""Fix year"""
if row['release_date'].year != row['release_year']:
return row['release_date'].replace(year=row['release_year'])
else:
return row['release_date']
# Apply fix
tmdb_df['release_date'] = tmdb_df[['release_date', 'release_year']].apply(aux_func, axis=1)
但我想知道是否可以使用熊猫' transform
解决这个问题,或者是否有另一种方法。
答案 0 :(得分:1)
如果想要同年,那么首先加入没有year
的日期:
df = pd.DataFrame({'release_date':['2062-10-04','1980-12-10'],'release_year':[1962,1980]})
print (df)
release_date release_year
0 2062-10-04 1962
1 1980-12-10 1980
df['release_date'] = pd.to_datetime(df['release_year'].astype(str) +
df['release_date'].str[4:])
print (df)
release_date release_year
0 1962-10-04 1962
1 1980-12-10 1980