使用Pandas`transform`实现替代解决方案

时间:2017-12-19 14:52:27

标签: python pandas dataframe

我正在分析TMDB dataset on Kaggle和变量release_date中存在的年份,与变量release_year相比,某些条目移位了40年:

# Change to pandas datetime
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'])

tmdb_df.query('release_date > datetime.date(2015,12,31)')[['release_date', 'release_year']].head()
###
#release_date   release_year
#9849   2062-10-04  1962
#9850   2062-12-10  1962
#9851   2062-06-13  1962
#9852   2062-12-25  1962
#9853   2062-10-24  1962

我使用apply

提出了解决方案
# Check for movies where the year on `release_date` are shifted
# when compared with `release_yer`
import datetime

# Change to pandas datetime
tmdb_df['release_date'] = pd.to_datetime(tmdb_df['release_date'])

def aux_func(row):
    """Fix year"""
    if row['release_date'].year != row['release_year']:
        return row['release_date'].replace(year=row['release_year'])
    else:
         return row['release_date']

# Apply fix
tmdb_df['release_date'] = tmdb_df[['release_date', 'release_year']].apply(aux_func, axis=1)

但我想知道是否可以使用熊猫' transform解决这个问题,或者是否有另一种方法。

1 个答案:

答案 0 :(得分:1)

如果想要同年,那么首先加入没有year的日期:

df = pd.DataFrame({'release_date':['2062-10-04','1980-12-10'],'release_year':[1962,1980]})
print (df)
  release_date  release_year
0   2062-10-04          1962
1   1980-12-10          1980

df['release_date'] = pd.to_datetime(df['release_year'].astype(str) + 
                                    df['release_date'].str[4:])

print (df)

  release_date  release_year
0   1962-10-04          1962
1   1980-12-10          1980