我有一些凌乱的传感器读数数据看起来像这样。每条记录(不同长度)由“----”分隔并堆叠在一起。有没有办法将它压缩成一个数据帧,其中每一行都是一个记录?
test = pd.DataFrame({"Messy":["21/12/2017 11:12:48","Port:4","Reading 1: 1","----","21/12/2017 11:13:48","Port:4","Reading 1: 2","Reading 2: 2.5","----"]})
test
Messy
0 21/12/2017 11:12:48
1 Port:4
2 Reading 1: 1
3 ----
4 21/12/2017 11:13:48
5 Port:4
6 Reading 1: 2
7 Reading 2: 2.5
8 ----
我想拥有的是这样的:
target = pd.DataFrame({"Time":["21/12/2017 11:12:48","21/12/2017 11:13:48"],"Port":["Port:4","Port:4"],"Field1":['Reading 1: 1','Reading 1: 2'],"Field2":['','Reading 2: 2.5']})
target
Field1 Feild2 Port Time
0 Reading 1: 1 Port:4 21/12/2017 11:12:48
1 Reading 1: 2 Reading 2: 2.5 Port:4 21/12/2017 11:13:48
答案 0 :(得分:2)
显然它确实依赖于数据,但您可以尝试:
#check separator
m = test['Messy'].str.startswith('----')
#create groups
test['g'] = m.cumsum()
#filter separator rows
df = test[~m].copy()
#count groups
df['c'] = df.groupby('g').cumcount()
print (df)
Messy g c
0 21/12/2017 11:12:48 0 0
1 Port:4 0 1
2 Reading 1: 1 0 2
4 21/12/2017 11:13:48 1 0
5 Port:4 1 1
6 Reading 1: 2 1 2
7 Reading 2: 2.5 1 3
#pivoting
df = df.pivot('g','c','Messy')
print (df)
c 0 1 2 3
g
0 21/12/2017 11:12:48 Port:4 Reading 1: 1 None
1 21/12/2017 11:13:48 Port:4 Reading 1: 2 Reading 2: 2.5
答案 1 :(得分:2)
以下是一个解决方案。你的数据很混乱。此方法假定您的所有数据都以4列为一组进行组织。
import numpy as np, pandas as pd
test = pd.DataFrame({"Messy":["21/12/2017 11:12:48","Port:4","Reading 1: 1","----","21/12/2017 11:13:48","Port:4","Reading 1: 2","Reading 2: 2.5","----"]})
lst = [np.hstack(np.hstack(i)) for i in zip((test.iloc[4*i:4*i+4].values \
for i in range(int(len(test.index)/4))))]
df = pd.DataFrame(lst, columns=['Date', 'Port', 'Field1', 'Field2']).replace({'----': ''})
# Date Port Field1 Field2
# 0 21/12/2017 11:12:48 Port:4 Reading 1: 1
# 1 21/12/2017 11:13:48 Port:4 Reading 1: 2 Reading 2: 2.5
答案 2 :(得分:2)
假设您最多有4列且所有记录的顺序相同,则使用logging.basicConfig(level=logging.INFO)
,re
和io
是另一种解决方案:
pandas
您可以通过在import pandas as pd
import io
import re
d = {"Messy":["21/12/2017 11:12:48","Port:4","Reading 1: 1","----",
"21/12/2017 11:13:48","Port:4","Reading 1: 2","Reading 2: 2.5",
"----"]}
test = pd.read_csv(io.StringIO(re.sub(r',----,?','\n', ','.join(d['Messy']))),
names=['Time','Port','Field1','Field2'])
In [13]:
print(test)
Out[13]:
Time Port Field1 Field2
0 21/12/2017 11:12:48 Port:4 Reading 1: 1 NaN
1 21/12/2017 11:13:48 Port:4 Reading 1: 2 Reading 2: 2.5
功能的名称list
属性中添加更多列名来扩展此解决方案,例如如果数据中的记录中最多有10列,则只需将它们映射到10个列名称。