Question

我有一个像这样的DataFrame：

In[2]: import pandas as pd
  ...: flow = {
  ...:     'Date':['09/19','09/19','09/19','09/19','09/19','09/19','10/19','10/19','10/19','10/19','10/19','10/19','10/19'],
  ...:     'Time':['23:00','23:10','23:20','23:30','23:40','23:50','00:00','00:10','00:20','00:30','00:40','00:50','01:00'],
  ...:     'Name':['P10  ','P10  ','P10  ','P10  ','P5   ','P5   ','P5   ','P10  ','P10  ','P10  ','P6   ','P6   ','P6   '],
  ...:     'Data':['10000','10002','10004','10005','10007','10008','10010','10012','10013','10014','10020','10022','10023']
  ...: }
  ...: flowdata = pd.DataFrame(flow)
  ...: flowdata = flowdata[['Date', 'Time', 'Name', 'Data']]  # To preserve the columns order
  ...: 

In[3]: flowdata
Out[3]:   
     Date   Time   Name   Data
0   09/19  23:00  P10    10000
1   09/19  23:10  P10    10002
2   09/19  23:20  P10    10004
3   09/19  23:30  P10    10005
4   09/19  23:40  P5     10007
5   09/19  23:50  P5     10008
6   10/19  00:00  P5     10010
7   10/19  00:10  P10    10012
8   10/19  00:20  P10    10013
9   10/19  00:30  P10    10014
10  10/19  00:40  P6     10020
11  10/19  00:50  P6     10022
12  10/19  01:00  P6     10023

我想将其分成其他基于＆＃34;连续＆＃34;的数据框架。值为'Name'列的行。我尝试使用以下代码并获取此信息：

In[3]: flowdata[flowdata['Name'] == 'P5   ']
Out[3]: 
    Date   Time   Name   Data
4  09/19  23:40  P5     10007
5  09/19  23:50  P5     10008
6  10/19  00:00  P5     10010

问题在我尝试使用名称'P10 '（对于此情况）切片时出现。我在日期和时间（从索引3到7）跳了一下。

In[4]: flowdata[flowdata['Name'] == 'P10  ']
Out[4]: 
    Date   Time   Name   Data
0  09/19  23:00  P10    10000
1  09/19  23:10  P10    10002
2  09/19  23:20  P10    10004
3  09/19  23:30  P10    10005
7  10/19  00:10  P10    10012
8  10/19  00:20  P10    10013
9  10/19  00:30  P10    10014

我希望得到两个基于＆＃34;连续＆＃34;的数据框架。列'Name'的值的行。像这样：

DataFrame 1 for First Name "P10":
        Date   Time   Name   Data
    0  09/19  23:00  P10    10000
    1  09/19  23:10  P10    10002
    2  09/19  23:20  P10    10004
    3  09/19  23:30  P10    10005

DataFrame 2 for Second Name "P10":
        Date   Time   Name   Data
    7  10/19  00:10  P10    10012
    8  10/19  00:20  P10    10013
    9  10/19  00:30  P10    10014

我找了一种方法用一些内置功能或方法做到这一点，我没有找到方法。所以我决定迭代行，检查条件并制作用于切片主DataFrame的索引列表。我得到这个代码：

In[6]: name_list_with_start_end_indexes = []
  ...: current_name = flowdata.iloc[0]['Name']
  ...: current_start_index = flowdata.index[0]
  ...: for i in flowdata.index:
  ...:     next_name = flowdata.loc[i]['Name']
  ...:     if not (current_name == next_name):
  ...:         current_end_index = i - 1
  ...:         name_list_with_start_end_indexes.append([current_name, current_start_index, current_end_index])
  ...:         current_start_index = i
  ...:         current_name = next_name
  ...: name_list_with_start_end_indexes.append([current_name,current_start_index, i])
  ...: 
In[7]: name_list_with_start_end_indexes
Out[7]: 
    [['P10  ', 0, 3], 
     ['P5   ', 4, 6], 
     ['P10  ', 7, 9], 
     ['P6   ', 10, 12]]

In[8]: name_A = name_list_with_start_end_indexes[2]
In[9]: name_A
Out[9]: 
['P10  ', 7, 9]
In[10]: flowdata[name_A[1]:name_A[2]+1]
Out[10]: 

    Date   Time   Name   Data
7  10/19  00:10  P10    10012
8  10/19  00:20  P10    10013
9  10/19  00:30  P10    10014

问题是这段代码以13000行缓慢运行（带有此数据的文件通常具有此行数并且有11列）。

有人知道更好的方法来获得相同的结果但更快

提前致谢。

Answer 1

如何标记这些群组？

如果您没问题，可以这样做：

In [20]: flowdata['group'] = (flowdata['Name'] != flowdata['Name'].shift()).astype(int).cumsum()

In [21]: flowdata
Out[21]:
     Date   Time   Name   Data  group
0   09/19  23:00  P10    10000      1
1   09/19  23:10  P10    10002      1
2   09/19  23:20  P10    10004      1
3   09/19  23:30  P10    10005      1
4   09/19  23:40  P5     10007      2
5   09/19  23:50  P5     10008      2
6   10/19  00:00  P5     10010      2
7   10/19  00:10  P10    10012      3
8   10/19  00:20  P10    10013      3
9   10/19  00:30  P10    10014      3
10  10/19  00:40  P6     10020      4
11  10/19  00:50  P6     10022      4
12  10/19  01:00  P6     10023      4

然后您可以通过执行以下操作来访问这些组：

In [24]: flowdata[flowdata['group'] == 1]
Out[24]:
    Date   Time   Name   Data  group
0  09/19  23:00  P10    10000      1
1  09/19  23:10  P10    10002      1
2  09/19  23:20  P10    10004      1
3  09/19  23:30  P10    10005      1

这里的想法是将每一行与前一行进行比较，这要归功于shift：如果该行的Name与上一行不同，则比较将为{{1然后转换为1，感谢True。然后，我们使用cumsum递增计算1的数量（因此.astype(int)值，如上所述。）

为了使其更易于理解，我们实际上会计算 True更改次数，每次从一个组切换到另一个组时都会递增。

使用Continuos数据将数据框切换到基于列值的其他DataFrame

1 个答案: