Question

我有一个如下的python列表，

list_fs = ['drwxrwx---   - uname 0 2017-08-25 12:10 hdfs://filepath=2011-01-31 16%3A06%3A09.0',
 'drwxrwx---   - uname 0 2017-08-29 14:12 hdfs://filepath=2011-02-28 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-29 14:20 hdfs://filepath=2011-03-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-29 14:32 hdfs://filepath=2011-04-30 10%3A00%3A00',
 'drwxrwx---   - uname 0 2018-02-20 13:57 hdfs://filepath=2011-05-31 08%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-29 15:02 hdfs://filepath=2011-05-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-29 15:06 hdfs://filepath=2011-06-30 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 10:38 hdfs://filepath=2011-07-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 10:42 hdfs://filepath=2011-08-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 11:08 hdfs://filepath=2011-09-30 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 11:11 hdfs://filepath=2011-10-31 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 11:15 hdfs://filepath=2011-11-30 10%3A00%3A00',
 'drwxrwx---   - uname 0 2017-08-31 11:16 hdfs://filepath=2011-12-31 10%3A00%3A00']

我需要将时间戳和文件路径提取到pandas数据帧中。 timestamp列需要采用timestamp数据类型和如下所示。

这样做的最佳方式是什么？

Answer 1

import pandas as pd

df = pd.DataFrame(list_fs)

df['Timestamp_ordered'] = [re.findall('\d+-\d+-\d+ \d+:\d+',i)[0] for i in list_fs]
df['FilePath'] = [re.findall('hdfs:.*', i)[0] for i in list_fs]

df = df[['Timestamp_ordered', 'FilePath']].sort_values('Timestamp_ordered')

Answer 2

我认为使用regex可能是一个很好的方法（使用str.findall）：

import pandas as pd

df = pd.DataFrame(list_fs)

df['Timestamp_ordered'] = df[0].str.findall('\d{4}-\d{2}-\d{2} \d{2}:\d{2}').apply(pd.Series)
df['FilePath'] = df[0].str.findall('(\w{4}://.*)').apply(pd.Series)
# Sort and get rid of old column
df = df[['Timestamp_ordered', 'FilePath']].sort_values('Timestamp_ordered')

返回：

>>> df
   Timestamp_ordered                                   FilePath
0   2017-08-25 12:10  hdfs://filepath=2011-01-31 16%3A06%3A09.0
1   2017-08-29 14:12    hdfs://filepath=2011-02-28 10%3A00%3A00
2   2017-08-29 14:20    hdfs://filepath=2011-03-31 10%3A00%3A00
3   2017-08-29 14:32    hdfs://filepath=2011-04-30 10%3A00%3A00
5   2017-08-29 15:02    hdfs://filepath=2011-05-31 10%3A00%3A00
6   2017-08-29 15:06    hdfs://filepath=2011-06-30 10%3A00%3A00
7   2017-08-31 10:38    hdfs://filepath=2011-07-31 10%3A00%3A00
8   2017-08-31 10:42    hdfs://filepath=2011-08-31 10%3A00%3A00
9   2017-08-31 11:08    hdfs://filepath=2011-09-30 10%3A00%3A00
10  2017-08-31 11:11    hdfs://filepath=2011-10-31 10%3A00%3A00
11  2017-08-31 11:15    hdfs://filepath=2011-11-30 10%3A00%3A00
12  2017-08-31 11:16    hdfs://filepath=2011-12-31 10%3A00%3A00
4   2018-02-20 13:57    hdfs://filepath=2011-05-31 08%3A00%3A00

Answer 3

我认为一个非常直接的解决方案是：

df = pd.DataFrame({'list_fs': list_fs})
df['TimeStamp_Ordered'] = df.list_fs.str.split(' ').str[6]  +' ' + df.list_fs.str.split(' ').str[7]
df['TimeStamp_Ordered'] = pd.to_datetime(df['TimeStamp_Ordered']
df['FilePath'] = df.list_fs.str.split(' ').str[8] + ' ' + df.list_fs.str.split(' ').str[9]
df = df.drop('list_fs', axis=1)
df = df.sort_values(by='TimeStamp_Ordered',ascending=True)

输出：

     TimeStamp_Ordered                                   FilePath
0  2017-08-25 12:10:00  hdfs://filepath=2011-01-31 16%3A06%3A09.0
1  2017-08-29 14:12:00    hdfs://filepath=2011-02-28 10%3A00%3A00
2  2017-08-29 14:20:00    hdfs://filepath=2011-03-31 10%3A00%3A00
3  2017-08-29 14:32:00    hdfs://filepath=2011-04-30 10%3A00%3A00
5  2017-08-29 15:02:00    hdfs://filepath=2011-05-31 10%3A00%3A00
6  2017-08-29 15:06:00    hdfs://filepath=2011-06-30 10%3A00%3A00
7  2017-08-31 10:38:00    hdfs://filepath=2011-07-31 10%3A00%3A00
8  2017-08-31 10:42:00    hdfs://filepath=2011-08-31 10%3A00%3A00
9  2017-08-31 11:08:00    hdfs://filepath=2011-09-30 10%3A00%3A00
10 2017-08-31 11:11:00    hdfs://filepath=2011-10-31 10%3A00%3A00
11 2017-08-31 11:15:00    hdfs://filepath=2011-11-30 10%3A00%3A00
12 2017-08-31 11:16:00    hdfs://filepath=2011-12-31 10%3A00%3A00
4  2018-02-20 13:57:00    hdfs://filepath=2011-05-31 08%3A00%3A00

从List提取值到Pandas DF

3 个答案: