我是Pandas的新手,希望对包含字符串的列进行排序并生成一个数值来唯一标识字符串。我的数据框看起来像这样:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
首先,我希望对'year_week'
列进行排序,以按升序排列(2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10)
,然后为每个唯一的'year_week'
字符串生成数值。
答案 0 :(得分:3)
您可以先转换to_datetime
列year_week
,然后按sort_values
排序,最后使用factorize
:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
#http://stackoverflow.com/a/17087427/2901002
df['date'] = pd.to_datetime(df.year_week + '-0', format='%Y_%W-%w')
#sort by column date
df.sort_values('date', inplace=True)
#create numerical values
df['num'] = pd.factorize(df.year_week)[0]
print (df)
key year_week date num
1 1 2015_1 2015-01-11 0
0 0 2015_10 2015-03-15 1
2 2 2015_11 2015-03-22 2
5 5 2016_3 2016-01-24 3
3 3 2016_9 2016-03-06 4
6 6 2016_9 2016-03-06 4
4 4 2016_10 2016-03-13 5
7 7 2016_10 2016-03-13 5
答案 1 :(得分:0)
## 1st method :-- This apply for large dataset
## Split the "year_week" column into 2 columns
df[['year', 'week']] =df['year_week'].str.split("_",expand=True)
## Change the datatype of newly created columns
df['year'] = df['year'].astype('int')
df['week'] = df['week'].astype('int')
## Sort the dataframe by newly created column
df= df.sort_values(['year','week'],ascending=True)
## Drop years & months column
df.drop(['year','week'],axis=1,inplace=True)
## Sorted dataframe
df
## 2nd method:--
## This apply for small dataset
## Change the datatype of column
df['year_week'] = df['year_week'].astype('str')
## Categories the string, the way you want
cats = ['2015_1', '2015_10','2015_11','2016_3','2016_9', '2016_10']
# Use pd.categorical() to categories it
df['year_week']=pd.Categorical(df['year_week'],categories=cats,ordered=True)
## Sort the 'year_week' column
df= df.sort_values('year_week')
## Sorted dataframe
df