设备“MOR4”的“顶部”和“底部”温度/湿度传感器在8月10日到11日之间放错位置。
在“长格式”数据集中纠正此问题的最pythonic方法是什么?
数据结构:
数据样本
data.head()
bottom_temperature bottom_humidity top_temperature top_humidity external_temperature published_at external_humidity short_id weight
0 34.48 44.81 33.56 47.62 17.88 2017-10-07 23:11:27 17.88 MOR1 NaN
1 34.89 42.89 33.89 43.86 18.06 2017-10-09 03:16:05 18.06 MOR5 NaN
2 34.87 41.90 33.81 42.88 18.19 2017-10-09 03:31:41 18.19 MOR5 NaN
3 34.79 43.05 33.93 44.68 18.00 2017-10-09 03:00:37 18.00 MOR20 NaN
4 34.92 42.53 34.04 44.68 18.19 2017-10-09 03:47:11 18.19 MOR6 NaN
df.dtypes
bottom_temperature float64
bottom_humidity float64
top_temperature float64
top_humidity float64
external_temperature float64
published_at datetime64[ns]
external_humidity float64
short_id object
weight float64
dtype: object
带有标识开关的垂直线的图表:
# MOR4 - bottom and top sensors switched on aug 10 and switched back on 11th
fig, axarr = plt.subplots()
fig.autofmt_xdate()
plt.plot(df.loc[df['short_id'] == 'MOR4']['published_at'], df.loc[df['short_id'] == 'MOR4']['bottom_temperature'], label = "Bottom Temperature C")
plt.plot(df.loc[df['short_id'] == 'MOR4']['published_at'], df.loc[df['short_id'] == 'MOR4']['top_temperature'], label = "Top Temperature")
plt.plot(df.loc[df['short_id'] == 'MOR4']['published_at'], df.loc[df['short_id'] == 'MOR4']['bottom_humidity'], label = "Bottom Humidity %")
plt.plot(df.loc[df['short_id'] == 'MOR4']['published_at'], df.loc[df['short_id'] == 'MOR4']['top_humidity'], label = "Top Humidity %")
plt.plot(df.loc[df['short_id'] == 'MOR4']['published_at'], df.loc[df['short_id'] == 'MOR4']['weight'], label = "Weight kg")
#add vertical line
plt.axvline(datetime.datetime(2017, 8, 10, 13, 10))
#add vertical line
plt.axvline(datetime.datetime(2017, 8, 11, 14, 10))
#specify date
axarr.set_xlim([datetime.date(2017, 8, 10), datetime.date(2017, 8, 12)])
#add title, legend
#plt.title('MOR1, Noticed on Aug 23')
axarr.legend(loc ='best',prop={'size': 6})
plt.show()
问题:
在数据框中,如何在指定日期之间切换“bottom_humidity”,“bottom_temperature”的值为“top_humidity”,“bottom_humidity”(第一个日期:2017-8-10,13:10。第二个日期:2017- 8-11,14:10?
换句话说:
在两条垂直线之间,绿线实际上是深蓝色线,反之亦然,同样适用于浅蓝色和红色线,并且希望在两个标识日期之间的数据框中更改它。
答案 0 :(得分:1)
以下两种方式......
df = pd.DataFrame({'top': [5,6,3,4,5, 2,2,1,3,1, 7,6,5],
'bottom':[2,2,1,3,1, 5,6,3,4,5, 1,2,1],
'other': [1,2,3,4,5,6,7,8,9,10,11,12,13]})
1)如果top总是大于......那么使用max / min:
df['new_top'] = df[['top', 'bottom']].max(axis=1)
df['new_bottom'] = df[['top', 'bottom']].min(axis=1)
2)(非常脏)手动识别点并构建列:
df['new_top2'] = pd.concat([ df.iloc[:4]['top'], df.iloc[4:10]['bottom'], df.iloc[10:]['top'] ])
df['new_bottom2'] = pd.concat([ df.iloc[:4]['bottom'], df.iloc[4:10]['top'], df.iloc[10:]['bottom'] ])
根据您提供的有限信息,并且您还没有提供任何您尝试过的事情,很难给您一个好的答案......
答案 1 :(得分:1)
您可以使用布尔掩码来获取相关行:
m = (df['published_at'] >= '2017-8-10 13:10') & (df['published_at'] <= '2017-8-11 14:10') & (df['short_id'] == 'MOR4')
然后只需切换这些行的字段:
cols_orig = ['bottom_temperature', 'bottom_humidity', 'top_temperature', 'top_humidity']
cols_mod = ['top_temperature', 'top_humidity', 'bottom_temperature', 'bottom_humidity']
df.loc[m, cols_orig] = df.loc[m, cols_mod].values
答案 2 :(得分:1)
如果您首先将时间戳设置为索引,则会使事情变得更容易:
data = data.set_index('published_at')
然后你可以像这样更改有问题的片段:
data.loc['2017-8-10 13:10':'2017-8-11 14:10','bottom_humidity'] = \
data.loc['2017-8-10 13:10':'2017-8-11 14:10','top_humidity'].values
如果您愿意,可以为此定义时间片并多次使用:
snafu = slice('2017-8-10 13:10','2017-8-11 14:10')
data.loc[snafu,'bottom_humidity'] = data.top_humidity
data.loc[snafu,'bottom_temperature'] = data.top_temperature
或交换这样的值:
data.loc[snafu,['bottom_temperature','top_temperature'] = \
data.loc[snafu,['top_temperature','bottom_temperature']].values