我有一个像这样的DataFrame:
In[2]: import pandas as pd
...: flow = {
...: 'Date':['09/19','09/19','09/19','09/19','09/19','09/19','10/19','10/19','10/19','10/19','10/19','10/19','10/19'],
...: 'Time':['23:00','23:10','23:20','23:30','23:40','23:50','00:00','00:10','00:20','00:30','00:40','00:50','01:00'],
...: 'Name':['P10 ','P10 ','P10 ','P10 ','P5 ','P5 ','P5 ','P10 ','P10 ','P10 ','P6 ','P6 ','P6 '],
...: 'Data':['10000','10002','10004','10005','10007','10008','10010','10012','10013','10014','10020','10022','10023']
...: }
...: flowdata = pd.DataFrame(flow)
...: flowdata = flowdata[['Date', 'Time', 'Name', 'Data']] # To preserve the columns order
...:
In[3]: flowdata
Out[3]:
Date Time Name Data
0 09/19 23:00 P10 10000
1 09/19 23:10 P10 10002
2 09/19 23:20 P10 10004
3 09/19 23:30 P10 10005
4 09/19 23:40 P5 10007
5 09/19 23:50 P5 10008
6 10/19 00:00 P5 10010
7 10/19 00:10 P10 10012
8 10/19 00:20 P10 10013
9 10/19 00:30 P10 10014
10 10/19 00:40 P6 10020
11 10/19 00:50 P6 10022
12 10/19 01:00 P6 10023
我想将其分成其他基于"连续"的数据框架。值为'Name'
列的行。
我尝试使用以下代码并获取此信息:
In[3]: flowdata[flowdata['Name'] == 'P5 ']
Out[3]:
Date Time Name Data
4 09/19 23:40 P5 10007
5 09/19 23:50 P5 10008
6 10/19 00:00 P5 10010
问题在我尝试使用名称'P10 '
(对于此情况)切片时出现。我在日期和时间(从索引3到7)跳了一下。
In[4]: flowdata[flowdata['Name'] == 'P10 ']
Out[4]:
Date Time Name Data
0 09/19 23:00 P10 10000
1 09/19 23:10 P10 10002
2 09/19 23:20 P10 10004
3 09/19 23:30 P10 10005
7 10/19 00:10 P10 10012
8 10/19 00:20 P10 10013
9 10/19 00:30 P10 10014
我希望得到两个基于"连续"的数据框架。列'Name'
的值的行。像这样:
DataFrame 1 for First Name "P10":
Date Time Name Data
0 09/19 23:00 P10 10000
1 09/19 23:10 P10 10002
2 09/19 23:20 P10 10004
3 09/19 23:30 P10 10005
DataFrame 2 for Second Name "P10":
Date Time Name Data
7 10/19 00:10 P10 10012
8 10/19 00:20 P10 10013
9 10/19 00:30 P10 10014
我找了一种方法用一些内置功能或方法做到这一点,我没有找到方法。所以我决定迭代行,检查条件并制作用于切片主DataFrame的索引列表。我得到这个代码:
In[6]: name_list_with_start_end_indexes = []
...: current_name = flowdata.iloc[0]['Name']
...: current_start_index = flowdata.index[0]
...: for i in flowdata.index:
...: next_name = flowdata.loc[i]['Name']
...: if not (current_name == next_name):
...: current_end_index = i - 1
...: name_list_with_start_end_indexes.append([current_name, current_start_index, current_end_index])
...: current_start_index = i
...: current_name = next_name
...: name_list_with_start_end_indexes.append([current_name,current_start_index, i])
...:
In[7]: name_list_with_start_end_indexes
Out[7]:
[['P10 ', 0, 3],
['P5 ', 4, 6],
['P10 ', 7, 9],
['P6 ', 10, 12]]
In[8]: name_A = name_list_with_start_end_indexes[2]
In[9]: name_A
Out[9]:
['P10 ', 7, 9]
In[10]: flowdata[name_A[1]:name_A[2]+1]
Out[10]:
Date Time Name Data
7 10/19 00:10 P10 10012
8 10/19 00:20 P10 10013
9 10/19 00:30 P10 10014
问题是这段代码以13000行缓慢运行(带有此数据的文件通常具有此行数并且有11列)。
有人知道更好的方法来获得相同的结果但更快
提前致谢。
答案 0 :(得分:2)
如何标记这些群组?
如果您没问题,可以这样做:
In [20]: flowdata['group'] = (flowdata['Name'] != flowdata['Name'].shift()).astype(int).cumsum()
In [21]: flowdata
Out[21]:
Date Time Name Data group
0 09/19 23:00 P10 10000 1
1 09/19 23:10 P10 10002 1
2 09/19 23:20 P10 10004 1
3 09/19 23:30 P10 10005 1
4 09/19 23:40 P5 10007 2
5 09/19 23:50 P5 10008 2
6 10/19 00:00 P5 10010 2
7 10/19 00:10 P10 10012 3
8 10/19 00:20 P10 10013 3
9 10/19 00:30 P10 10014 3
10 10/19 00:40 P6 10020 4
11 10/19 00:50 P6 10022 4
12 10/19 01:00 P6 10023 4
然后您可以通过执行以下操作来访问这些组:
In [24]: flowdata[flowdata['group'] == 1]
Out[24]:
Date Time Name Data group
0 09/19 23:00 P10 10000 1
1 09/19 23:10 P10 10002 1
2 09/19 23:20 P10 10004 1
3 09/19 23:30 P10 10005 1
这里的想法是将每一行与前一行进行比较,这要归功于shift
:如果该行的Name
与上一行不同,则比较将为{{1然后转换为1,感谢True
。
然后,我们使用cumsum
递增计算1的数量(因此.astype(int)
值,如上所述。)
为了使其更易于理解,我们实际上会计算 True
更改次数,每次从一个组切换到另一个组时都会递增。