在Pandas中将持续时间的变化文本字段转换为秒

时间:2016-04-28 19:09:43

标签: python python-2.7 datetime pandas

我有一个数据框,其中包含旅行的持续时间作为文本值,如下面的Driving_duration_text列所示。

print df

                                              yelp_id driving_duration_text  \
0                    alexander-rubin-photography-napa        1 hour 43 mins   
1                             jumas-automotive-napa-2        1 hour 32 mins   
2                       larson-brothers-painting-napa        1 hour 30 mins   
3                            preferred-limousine-napa        1 hour 32 mins   
4                            cardon-y-el-tirano-miami        1 day  16 hours   
5                                    sweet-dogs-miami        1 day  3  hours 

正如你所看到的,有些是几小时写的,有些是几天写的。我怎么能把这种格式转换成秒?

2 个答案:

答案 0 :(得分:2)

<强>更新

In [150]: df['seconds'] = (pd.to_timedelta(df['driving_duration_text']
   .....:                                    .str.replace(' ', '')
   .....:                                    .str.replace('mins', 'min'))
   .....:                    .dt.total_seconds())

In [151]: df
Out[151]:
                            yelp_id driving_duration_text   seconds
0  alexander-rubin-photography-napa        1 hour 43 mins    6180.0
1           jumas-automotive-napa-2        1 hour 32 mins    5520.0
2     larson-brothers-painting-napa        1 hour 30 mins    5400.0
3          preferred-limousine-napa        1 hour 32 mins    5520.0
4          cardon-y-el-tirano-miami       1 day  16 hours  144000.0
5                  sweet-dogs-miami       1 day  3  hours   97200.0

OLD回答:

你可以这样做:

from collections import defaultdict
import re

def humantime2seconds(s):
    d = {
      'w':      7*24*60*60,
      'week':   7*24*60*60,
      'weeks':  7*24*60*60,
      'd':      24*60*60,
      'day':    24*60*60,
      'days':   24*60*60,
      'h':      60*60,
      'hr':     60*60,
      'hour':   60*60,
      'hours':  60*60,
      'm':      60,
      'min':    60,
      'mins':   60,
      'minute': 60,
      'minutes':60
    }
    mult_items = defaultdict(lambda: 1).copy()
    mult_items.update(d)

    parts = re.search(r'^(\d+)([^\d]*)', s.lower().replace(' ', ''))
    if parts:
        return int(parts.group(1)) * mult_items[parts.group(2)] + humantime2seconds(re.sub(r'^(\d+)([^\d]*)', '', s.lower()))
    else:
        return 0

df['seconds'] = df.driving_duration_text.map(humantime2seconds)

输出:

In [64]: df
Out[64]:
                            yelp_id driving_duration_text  seconds
0  alexander-rubin-photography-napa        1 hour 43 mins     6180
1           jumas-automotive-napa-2        1 hour 32 mins     5520
2     larson-brothers-painting-napa        1 hour 30 mins     5400
3          preferred-limousine-napa        1 hour 32 mins     5520
4          cardon-y-el-tirano-miami       1 day  16 hours   144000
5                  sweet-dogs-miami       1 day  3  hours    97200

答案 1 :(得分:1)

鉴于文本似乎遵循标准化格式,这是相对简单的。我们需要将字符串分开,将其组合成相关的部分,然后处理它们。

def parse_duration(duration):
    items = duration.split()
    words = items[1::2]
    counts = items[::2]
    seconds = 0
    for i, each in enumerate(words):
        seconds += get_seconds(each, counts[i])
    return seconds

def get_seconds(word, count):
    counts = {
        'second': 1,
        'minute': 60,
        'hour': 3600,
        'day': 86400
        # and so on
    }
    # Bit complicated here to handle plurals
    base = counts.get(word[:-1], counts.get(word, 0))
    return base * count