我有一个数据框,其中包含旅行的持续时间作为文本值,如下面的Driving_duration_text列所示。
print df
yelp_id driving_duration_text \
0 alexander-rubin-photography-napa 1 hour 43 mins
1 jumas-automotive-napa-2 1 hour 32 mins
2 larson-brothers-painting-napa 1 hour 30 mins
3 preferred-limousine-napa 1 hour 32 mins
4 cardon-y-el-tirano-miami 1 day 16 hours
5 sweet-dogs-miami 1 day 3 hours
正如你所看到的,有些是几小时写的,有些是几天写的。我怎么能把这种格式转换成秒?
答案 0 :(得分:2)
<强>更新强>
In [150]: df['seconds'] = (pd.to_timedelta(df['driving_duration_text']
.....: .str.replace(' ', '')
.....: .str.replace('mins', 'min'))
.....: .dt.total_seconds())
In [151]: df
Out[151]:
yelp_id driving_duration_text seconds
0 alexander-rubin-photography-napa 1 hour 43 mins 6180.0
1 jumas-automotive-napa-2 1 hour 32 mins 5520.0
2 larson-brothers-painting-napa 1 hour 30 mins 5400.0
3 preferred-limousine-napa 1 hour 32 mins 5520.0
4 cardon-y-el-tirano-miami 1 day 16 hours 144000.0
5 sweet-dogs-miami 1 day 3 hours 97200.0
OLD回答:
你可以这样做:
from collections import defaultdict
import re
def humantime2seconds(s):
d = {
'w': 7*24*60*60,
'week': 7*24*60*60,
'weeks': 7*24*60*60,
'd': 24*60*60,
'day': 24*60*60,
'days': 24*60*60,
'h': 60*60,
'hr': 60*60,
'hour': 60*60,
'hours': 60*60,
'm': 60,
'min': 60,
'mins': 60,
'minute': 60,
'minutes':60
}
mult_items = defaultdict(lambda: 1).copy()
mult_items.update(d)
parts = re.search(r'^(\d+)([^\d]*)', s.lower().replace(' ', ''))
if parts:
return int(parts.group(1)) * mult_items[parts.group(2)] + humantime2seconds(re.sub(r'^(\d+)([^\d]*)', '', s.lower()))
else:
return 0
df['seconds'] = df.driving_duration_text.map(humantime2seconds)
输出:
In [64]: df
Out[64]:
yelp_id driving_duration_text seconds
0 alexander-rubin-photography-napa 1 hour 43 mins 6180
1 jumas-automotive-napa-2 1 hour 32 mins 5520
2 larson-brothers-painting-napa 1 hour 30 mins 5400
3 preferred-limousine-napa 1 hour 32 mins 5520
4 cardon-y-el-tirano-miami 1 day 16 hours 144000
5 sweet-dogs-miami 1 day 3 hours 97200
答案 1 :(得分:1)
鉴于文本似乎遵循标准化格式,这是相对简单的。我们需要将字符串分开,将其组合成相关的部分,然后处理它们。
def parse_duration(duration):
items = duration.split()
words = items[1::2]
counts = items[::2]
seconds = 0
for i, each in enumerate(words):
seconds += get_seconds(each, counts[i])
return seconds
def get_seconds(word, count):
counts = {
'second': 1,
'minute': 60,
'hour': 3600,
'day': 86400
# and so on
}
# Bit complicated here to handle plurals
base = counts.get(word[:-1], counts.get(word, 0))
return base * count