我的实验的大日志文件(.txt)(包含多达10万个条目)具有以下结构:
ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
_______________________________________________
CHANGE T 75 0 560
CHANGE T 80 0 560
CHANGE T 85 0 560
CHANGE T 90 0 560
OSL 75 20 570
OSL 75 20 580
OSL 75 20 590
OSL 75 20 600
CHANGE T 75 0 560
CHANGE T 80 0 560
CHANGE T 85 0 560
CHANGE T 90 0 560
我使用 pandas 中的 read_table 将日志文件加载到python中。我想根据第一列的值将结果数据帧分成较小的数据帧。所以结果看起来像这样:
**DATAFRAME 1:**
CHANGE T 75 0 560
CHANGE T 80 0 560
CHANGE T 85 0 560
CHANGE T 90 0 560
**DATAFRAME 2:**
OSL 75 20 570
OSL 75 20 580
OSL 75 20 590
OSL 75 20 600
**DATAFRAME 3:**
CHANGE T 75 0 560
CHANGE T 80 0 560
CHANGE T 85 0 560
CHANGE T 90 0 560
首先,我尝试使用第一列值更改的索引来拆分它们:
indexSplit = [] # list containing the boundry indices
prevRoutine = log['ROUTINE'][0] # log is the complete dataframe
i = 1
while i < len(log):
if prevRoutine != log['ROUTINE'][i]:
indexSplit.append(i)
prevRoutine = log['ROUTINE'][i]
然而,考虑到日志文件的大小,这样做需要花费大量时间(显然)。我想知道是否有一种优雅的方式与熊猫一起做到这一点?我一直遇到的问题是第一列的值在多个系列中使用。我总是将 dataframe 1 和 dataframe 3 作为一个结果。
答案 0 :(得分:3)
您可以使用list comprehension
groupby
来创建循环groups
对象和s
。比较ne
(与!=
相同但速度更快)shift
ed列和cumsum
得到输出:
s = df['ROUTINE'].ne(df['ROUTINE'].shift()).cumsum()
print (s)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 3
Name: ROUTINE, dtype: int32
dfs = [g for i,g in df.groupby(df['ROUTINE'].ne(df['ROUTINE'].shift()).cumsum())]
print (dfs)
[ ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
0 CHANGE T 75 0 560
1 CHANGE T 80 0 560
2 CHANGE T 85 0 560
3 CHANGE T 90 0 560, ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
4 OSL 75 20 570
5 OSL 75 20 580
6 OSL 75 20 590
7 OSL 75 20 600, ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
8 CHANGE T 75 0 560
9 CHANGE T 80 0 560
10 CHANGE T 85 0 560
11 CHANGE T 90 0 560]
print (dfs[0])
ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
0 CHANGE T 75 0 560
1 CHANGE T 80 0 560
2 CHANGE T 85 0 560
3 CHANGE T 90 0 560
print (dfs[1])
ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
4 OSL 75 20 570
5 OSL 75 20 580
6 OSL 75 20 590
7 OSL 75 20 600
print (dfs[2])
ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
8 CHANGE T 75 0 560
9 CHANGE T 80 0 560
10 CHANGE T 85 0 560
11 CHANGE T 90 0 560
解决方案很复杂,因为如果第一列使用groupby
只能获得2组:
dfs = [g for i,g in df.groupby('ROUTINE')]
print (dfs)
[ ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
0 CHANGE T 75 0 560
1 CHANGE T 80 0 560
2 CHANGE T 85 0 560
3 CHANGE T 90 0 560
8 CHANGE T 75 0 560
9 CHANGE T 80 0 560
10 CHANGE T 85 0 560
11 CHANGE T 90 0 560, ROUTINE TEMPERATURE VOLTAGE WAVELENGTH
4 OSL 75 20 570
5 OSL 75 20 580
6 OSL 75 20 590
7 OSL 75 20 600]