从Pandas Dataframe字符串计算经过的天数

时间:2016-05-06 18:23:14

标签: python python-2.7 pandas

我有一个Pandas dataframe来存储人们的旅行日期。我想添加一个显示停留时间长度的列。为此,需要解析string,转换为datetime并减去。 Pandas似乎将datetime转换视为整个系列,而不是将strings视为我TypeError: must be string, not Series。我喜欢用非循环选项来做这个,因为实际数据集非常大,但需要一点帮助。

import pandas as pd
from datetime import datetime

df = pd.DataFrame(data=[['Bob', '12 Mar 2015 - 31 Mar 2015'], ['Jessica', '27 Mar 2015 - 31 Mar 2015']], columns=['Names', 'Day of Visit'])
df['Length of Stay'] = (datetime.strptime(df['Day of Visit'][:11], '%d %b %Y') - datetime.strptime(df['Day of Visit'][-11:], '%d %b %Y')).days + 1
print df

期望的输出:

    Names               Day of Visit  Length of Stay
0      Bob  12 Mar 2015 - 31 Mar 2015              20
1  Jessica  27 Mar 2015 - 31 Mar 2015               5

2 个答案:

答案 0 :(得分:4)

使用Series.str.extractDay of Visit列拆分为两个单独的列。 然后使用pd.to_datetime将列解析为日期。 然后可以通过减去日期列并添加1:

来计算Length of Stay
import numpy as np
import pandas as pd

df = pd.DataFrame(data=[['Bob', '12 Mar 2015 - 31 Mar 2015'], ['Jessica', '27 Mar 2015 - 31 Mar 2015']], columns=['Names', 'Day of Visit'])
tmp = df['Day of Visit'].str.extract(r'([^-]+)-(.*)', expand=True).apply(pd.to_datetime)
df['Length of Stay'] = (tmp[1] - tmp[0]).dt.days + 1
print(df)

产量

     Names               Day of Visit  Length of Stay
0      Bob  12 Mar 2015 - 31 Mar 2015              20
1  Jessica  27 Mar 2015 - 31 Mar 2015               5

regex pattern ([^-]+)-(.*)表示

(              # start group #1
  [            # begin character class
    ^-         # any character except a literal minus sign `-`
  ]            # end character class 
   +           # match 1-or-more characters from the character class
)              # end group #1
-              # match a literal minus sign 
(              # start group #2
  .*           # match 0-or-more of any character
)              # end group #2

.str.extract返回一个DataFrame,其中包含列中#1和#2组的匹配文本。

答案 1 :(得分:1)

解决方案

def length_of_stay(x):
    start, end = [datetime.strptime(d, '%d %b %Y') for d in x.split(' - ')]
    return end - start

df['Length of Stay'] = df['Day of Visit'].apply(length_of_stay)
print df