我有一个Twitter数据集,我试图用熊猫分析,但我无法弄清楚如何转换(例如“2天”,“24小时”或“2个月”,“5年” )到日期时间格式。
我使用了以下代码:
for i df_merge['last_tweet']:
n = i['last_tweet'].split(" ") [0]
d = i['last_tweet'].split(" ") [1]
if d in ["years", "year"]:
n_days = n*365
elif d in ["months", "month"]:
n_days = n*30
答案 0 :(得分:2)
你可能想写一个辅助函数......
import numpy as np
import pandas as pd
def ym2nptimedelta(delta):
delta_cfg = {
'month': 'M',
'months': 'M',
'year': 'Y',
'years': 'Y'
}
n, item = delta.lower().split()
return np.timedelta64(n, delta_cfg.get(item))
print(pd.datetime.today() - pd.Timedelta('2 days'))
print(pd.datetime.today() - pd.Timedelta('24 hours'))
print(pd.to_datetime(pd.datetime.now()) - ym2nptimedelta('2 years'))
print(pd.to_datetime(pd.datetime.now()) - ym2nptimedelta('5 years'))
输出:
2016-03-08 20:39:34.315969
2016-03-09 20:39:34.315969
2014-03-11 09:01:10.316969
2011-03-11 15:33:34.317969
UPDATE1 (此辅助函数将处理所有可接受的numpy time-delta):
import numpy as np
import pandas as pd
def deltastr2date(delta):
delta_cfg = {
'year': 'Y',
'years': 'Y',
'month': 'M',
'months': 'M',
'week': 'W',
'weeks': 'W',
'day': 'D',
'days': 'D',
'hour': 'h',
'hours': 'h',
'min': 'm',
'minute': 'm',
'minutes': 'm',
'sec': 's',
'second': 's',
'seconds': 's',
}
n, item = delta.lower().split()
return pd.to_datetime(pd.datetime.now()) - np.timedelta64(n, delta_cfg.get(item))
print(deltastr2date('2 days'))
print(deltastr2date('24 hours'))
print(deltastr2date('2 years'))
print(deltastr2date('5 years'))
print(deltastr2date('1 week'))
print(deltastr2date('10 hours'))
print(deltastr2date('45 minutes'))
输出:
2016-03-08 20:50:01.701853
2016-03-09 20:50:01.702853
2014-03-11 09:11:37.702853
2011-03-11 15:44:01.703853
2016-03-03 20:50:01.704854
2016-03-10 10:50:01.705854
2016-03-10 20:05:01.705854
UPDATE2 (显示如何将辅助函数应用于DF列):
import numpy as np
import pandas as pd
def deltastr2date(delta):
delta_cfg = {
'year': 'Y',
'years': 'Y',
'month': 'M',
'months': 'M',
'week': 'W',
'weeks': 'W',
'day': 'D',
'days': 'D',
'hour': 'h',
'hours': 'h',
'min': 'm',
'minute': 'm',
'minutes': 'm',
'sec': 's',
'second': 's',
'seconds': 's',
}
n, item = delta.lower().split()
return pd.to_datetime(pd.datetime.now()) - np.timedelta64(n, delta_cfg.get(item))
N = 20
dt_units = ['seconds','minutes','hours','days','weeks','months','years']
# generate random list of deltas
deltas = ['{0[0]} {0[1]}'.format(tup) for tup in zip(np.random.randint(1,5,N), np.random.choice(dt_units, N))]
df = pd.DataFrame({'delta': pd.Series(deltas)})
# add new column
df['last_tweet_dt'] = df['delta'].apply(deltastr2date)
print(df)
输出:
delta last_tweet_dt
0 3 hours 2016-03-10 20:32:49.252525
1 4 days 2016-03-06 23:32:49.252525
2 3 seconds 2016-03-10 23:32:46.253525
3 1 weeks 2016-03-03 23:32:49.253525
4 1 minutes 2016-03-10 23:31:49.253525
5 2 minutes 2016-03-10 23:30:49.253525
6 4 days 2016-03-06 23:32:49.254525
7 1 years 2015-03-11 17:43:37.254525
8 2 seconds 2016-03-10 23:32:47.254525
9 3 minutes 2016-03-10 23:29:49.254525
10 1 hours 2016-03-10 22:32:49.255525
11 2 seconds 2016-03-10 23:32:47.255525
12 3 minutes 2016-03-10 23:29:49.255525
13 3 months 2015-12-10 16:05:31.255525
14 4 weeks 2016-02-11 23:32:49.256526
15 3 months 2015-12-10 16:05:31.256526
16 4 hours 2016-03-10 19:32:49.256526
17 1 years 2015-03-11 17:43:37.256526
18 2 years 2014-03-11 11:54:25.257526
19 1 minutes 2016-03-10 23:31:49.257526