我已经解析了一个拉动相关数据的文本文件。然后我将变量(dlOrbit2,imageId3,imageStart4,imageEnd4)组合在一起,在列表中创建了一系列4个字符串。
combined = str(','.join([dlOrbit2, imageId3, imageStart4, imageEnd4]))
strSplit = combined.split(',')
print strSplit
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53']
['46290', '514628', '2016-10-26 13:12:54', '2016-10-26 13:13:13']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
我想在第一列中匹配和分组元素。所以,46284 x 4,46288 x 6,46290 x 2,46291 x 4.在这些组中,我希望从元素2和元素3的最新时间得到最早的时间。所以期望的输出将是:
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:54:57']
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:13:13']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
此列表将始终为4个元素,但分组元素的数量(第一列)将始终更改。
我要将这些结果导出为CSV文件。但是,我只需要上述部分的帮助。
答案 0 :(得分:1)
使用pandas:
import pandas as pd
dat = [['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'],
['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]
df = pd.DataFrame(dat).drop_duplicates()
df_times = df.groupby([0]).agg({2:min,3:max}).reset_index()
df_times.merge(df,on=[0,2])[[0,1,2,'3_x']]
输出:
0 46284 514607 2016-10-26 02:43:46 2016-10-26 02:48:39
1 46288 514626 2016-10-26 09:48:26 2016-10-26 09:54:57
2 46290 514628 2016-10-26 13:12:34 2016-10-26 13:13:13
3 46291 514738 2016-10-26 14:56:39 2016-10-26 14:59:06
答案 1 :(得分:1)
作为Python的新手,我希望在使用Big Hammers之前看到带有基本python功能的示例。
如果没有模块导入可以在不到十几行代码中完成,我希望能够学到第一行。
或许操纵具有双索引的列表列表不被理解?
combined = [['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'], ['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]
combined[0][0] # double index
Out[28]: '46284'
combined[2][2:] # slice
Out[29]: ['2016-10-26 02:43:46', '2016-10-26 02:48:39']
max(combined[2][2:]) # duck type order comparison
Out[30]: '2016-10-26 02:48:39'
为什么不在分组之前定义函数在输入列表上使用这些基本的Python工具?
答案 2 :(得分:0)
data = [
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'],
['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
]
from itertools import groupby, tee
import pprint
res = []
for k, g in groupby(data, key=lambda x: x[0]):
it1, it2, it3 = tee(g, 3)
res.append(next(it1)[:2] + [min(x[2] for x in it2), max(x[3] for x in it3)])
pprint.pprint(res)
输出:
[['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:54:57'],
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:13:13'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]
for k, g in groupby(data, key=lambda x: x[0])
将根据第一列对连续行进行分组。它将返回一个元组,其中第一项是用于分组的键,第二项是组项目上的迭代器。
it1, it2, it3 = tee(g, 3)
会将组迭代器拆分为三个迭代器,每个迭代器将返回完全相同的项。最后,通过从第一个分组项目中取前两列并运行min
&来构建结果。 max
超过另外两个迭代器。