我正在尝试使用pandas
以下列格式拼写一个~500mb制表符分隔的数据文件:
+-------+---------+-------+---------+-------+---------+
| Time1 | Sensor1 | Time2 | Sensor2 | Time3 | Sensor3 |
+-------+---------+-------+---------+-------+---------+
| 0 | x | 0 | y | 0 | z |
| 1 | x | 2 | y | 0.5 | z |
| 2 | x | 4 | y | 1 | z |
| 3 | x | | | 1.5 | z |
| 4 | x | | | 2 | z |
| 5 | x | | | 2.5 | z |
| | | | | 3 | z |
| | | | | 3.5 | z |
| | | | | 4 | z |
| | | | | 4.5 | z |
| | | | | 5 | z |
+-------+---------+-------+---------+-------+---------+
我想在一个时间轴上获取所有传感器值,如下所示:
+------+---------+---------+---------+
| Time | Sensor1 | Sensor1 | Sensor3 |
+------+---------+---------+---------+
| 0 | x | y | z |
| 0.5 | NaN | NaN | z |
| 1 | x | NaN | z |
| 1.5 | NaN | NaN | z |
| 2 | x | y | z |
| 2.5 | NaN | NaN | z |
| 3 | x | NaN | z |
| 3.5 | NaN | NaN | z |
| 4 | x | y | z |
| 4.5 | NaN | NaN | z |
| 5 | x | NaN | z |
+------+---------+---------+---------+
我已经开始使用以下代码了。循环部分工作正常(虽然它需要相当长的时间)。但是,concat
部分会导致大量重复时间索引,并且不会将多个传感器值合并为一行。
import pandas as pd
dfList = []
numberOfChannels = 3
for x in range(0,numberOfChannels):
columns = [numberOfChannels]
frame = pd.read_table('testinput.csv',
usecols = [x*2, x*2+1],
index_col = 0)
frame.index.name = 'time'
frame.index = pd.to_timedelta(frame.index, unit = 'ms')
dfList.append(frame)
df = pd.concat(dfList)
有没有更好的方法来实现这个目标?
答案 0 :(得分:1)
您可以创建系列列表,然后使用pandas.concat
将它们合并为一个数据框。
该解决方案在功能上与@DyZ相同,但布局不同。
series_list = [df.set_index('Time'+str(i))['Sensor'+str(i)].dropna() \
for i in range(1, int(len(df.columns)/2) + 1)]
res = pd.concat(series_list, axis=1)\
.rename_axis('Time').reset_index()
<强>设置强>
df = pd.DataFrame({'Time1': [0, 1, 2, 3, 4, 5, np.nan, np.nan, np.nan, np.nan, np.nan],
'Sensor1': ['x', 'x', 'x', 'x', 'x', 'x', np.nan, np.nan, np.nan, np.nan, np.nan],
'Time2': [0, 2, 4, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Sensor2': ['y', 'y', 'y', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'Time3': [0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5],
'Sensor3': ['z', 'z', 'z', 'z', 'z', 'z', 'z', 'z', 'z', 'z', 'z']})
<强>结果强>
print(res)
Time Sensor1 Sensor2 Sensor3
0 0.0 x y z
1 0.5 NaN NaN z
2 1.0 x NaN z
3 1.5 NaN NaN z
4 2.0 x y z
5 2.5 NaN NaN z
6 3.0 x NaN z
7 3.5 NaN NaN z
8 4.0 x y z
9 4.5 NaN NaN z
10 5.0 x NaN z
答案 1 :(得分:0)
以下代码对我有用:
df = pd.read_table('testinput.csv')
pd.concat([df[['Time{}'.format(i), 'Sensor{}'.format(i)]]\
.set_index('Time{}'.format(i)) \
for i in range(1, numberOfChannels + 1)], axis=1)\
.dropna(how='all')
# Sensor1 Sensor2
#0.0 1.0 1.0
#1.0 2.0 NaN
#2.0 1.0 2.0
#3.0 2.0 NaN
#4.0 1.0 1.0
#5.0 2.0 NaN
#6.0 1.0 NaN