使用此代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None
pd.options.display.float_format = '{:.2f}'.format
dateparse = lambda x: pd.datetime.strptime(x,'%Y%m%d%H%M')
a = pd.read_csv(r'C:\Users\Leonardo\Desktop\Nova pasta\TU_boia0401.out', parse_dates = ['data'], index_col = 0, date_parser = dateparse)
输出是这样的:
index hs
2015-02-23 14:50:00 0.99
2015-02-23 15:50:00 0.96
2015-02-23 16:50:00 1.04
2015-02-23 17:50:00 0.96
. .
. .
. .
2017-09-01 12:40:00 1.25
直到这里一切都很好,但通过绘制一些东西来分析所有数据都被注意到了。这就是问题所在:
在2015-03-06附近可以看到,有很多不应该存在的重复值。查看数据框架,这是可以看到的:
2015-03-04 10:50:00 1.18
2015-03-04 11:50:00 1.18
2015-03-04 12:50:00 1.18
2015-03-04 13:50:00 1.18
它在数据帧中重复了很多次。主要目标是过滤这些BAD数据并将其从数据帧中删除,并将np.nan设置为每个连续3次(或者也超过3次)重复整个数据帧的值。输出预期是这样的:
index hs
2015-02-23 14:50:00 0.99
2015-02-23 15:50:00 0.96
2015-02-23 16:50:00 1.04
2015-02-23 17:50:00 0.96
. .
. .
. .
2015-03-04 10:50:00 1.18
2015-03-04 11:50:00 nan
2015-03-04 12:50:00 nan
2015-03-04 13:50:00 nan
. .
. .
. .
2016-01-20 12:40:00 0.98
2016-01-20 12:50:00 nan
2016-01-20 13:00:00 nan
2016-01-20 13:10:00 nan
. .
. .
. .
2017-09-01 12:40:00 1.25
如果有人可以提供帮助,我将感激不尽。
答案 0 :(得分:3)
这将NaN设置为大于或等于n
的所有前向重复项(例如3)。
# Set-up.
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 1), index=pd.DatetimeIndex(start='2017-01-01', freq='min', periods=10), columns=['hs'])
df.loc[3:6] = df.iat[2, 0]
df.loc[8:10] = df.iat[7, 0]
>>> df
hs
2017-01-01 00:00:00 1.764052
2017-01-01 00:01:00 0.400157
2017-01-01 00:02:00 0.978738
2017-01-01 00:03:00 0.978738 # Duplicate x3
2017-01-01 00:04:00 0.978738 # Duplicate x3
2017-01-01 00:05:00 0.978738 # Duplicate x3
2017-01-01 00:06:00 0.950088
2017-01-01 00:07:00 -0.151357
2017-01-01 00:08:00 -0.151357 # Duplicate x2
2017-01-01 00:09:00 -0.151357 # Duplicate x2
# Set forward duplicates to NaN.
n = 3
bool_mask = df.hs.shift() == df.hs
df = df.assign(
mask=bool_mask,
group=(bool_mask != bool_mask.shift()).cumsum())
filter_groups = df.groupby('group')[['mask']].sum().query('mask >= {}'.format(n)).index
df.loc[df.group.isin(filter_groups), 'hs'] = np.nan
df = df[['hs']]
>>> df
hs
2017-01-01 00:00:00 1.764052
2017-01-01 00:01:00 0.400157
2017-01-01 00:02:00 0.978738
2017-01-01 00:03:00 NaN
2017-01-01 00:04:00 NaN
2017-01-01 00:05:00 NaN
2017-01-01 00:06:00 0.950088
2017-01-01 00:07:00 -0.151357
2017-01-01 00:08:00 -0.151357
2017-01-01 00:09:00 -0.151357
它创建一个布尔掩码来检查重复项,并通过assign
将此列添加到数据框中。该代码还添加了一个group
列,用于查找连续的重复区域(使用 shift-cumsum 模式创建)。将groupby应用于组,将mask
布尔值相加。这给出了连续重复的次数。然后使用query
过滤这些结果,找到重复次数大于或等于n
的那些结果(例如3)。
最后,对于连续计数超过3的组,数据框使用loc
将hs
设置为NaN
。仅通过hs
从数据框中获取df = df[['hs']]
即可排除临时列。
答案 1 :(得分:2)
<强>设置强>
借用@ Alexander的数据框
np.random.seed(0)
df = pd.DataFrame(
np.random.randn(10, 1),
pd.DatetimeIndex(start='2017-01-01', freq='min', periods=10),
['hs'])
df.loc[4:6] = df.iat[3, 0]
<强>解决方案强>
使用pd.DataFrame.mask
和pd.DataFrame.diff
注意:这是一种通用解决方案,可以同时为所有列执行相同的任务。
df.mask(df.diff() == 0)
hs
2017-01-01 00:00:00 1.764052
2017-01-01 00:01:00 0.400157
2017-01-01 00:02:00 0.978738
2017-01-01 00:03:00 2.240893
2017-01-01 00:04:00 NaN
2017-01-01 00:05:00 NaN
2017-01-01 00:06:00 0.950088
2017-01-01 00:07:00 -0.151357
2017-01-01 00:08:00 -0.103219
2017-01-01 00:09:00 0.410599
更大的例子
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(5, size=(10, 5)).astype(float),
pd.DatetimeIndex(start='2017-01-01', freq='min', periods=10),
).add_prefix('col')
df
col0 col1 col2 col3 col4
2017-01-01 00:00:00 0.0 3.0 2.0 3.0 2.0
2017-01-01 00:01:00 2.0 3.0 2.0 3.0 0.0
2017-01-01 00:02:00 2.0 0.0 0.0 4.0 0.0
2017-01-01 00:03:00 2.0 2.0 0.0 4.0 1.0
2017-01-01 00:04:00 3.0 2.0 4.0 4.0 4.0
2017-01-01 00:05:00 4.0 3.0 3.0 3.0 4.0
2017-01-01 00:06:00 3.0 1.0 3.0 0.0 4.0
2017-01-01 00:07:00 4.0 2.0 2.0 0.0 2.0
2017-01-01 00:08:00 4.0 0.0 4.0 1.0 4.0
2017-01-01 00:09:00 4.0 2.0 2.0 0.0 2.0
df.mask(df.diff() == 0)
col0 col1 col2 col3 col4
2017-01-01 00:00:00 0.0 3.0 2.0 3.0 2.0
2017-01-01 00:01:00 2.0 NaN NaN NaN 0.0
2017-01-01 00:02:00 NaN 0.0 0.0 4.0 NaN
2017-01-01 00:03:00 NaN 2.0 NaN NaN 1.0
2017-01-01 00:04:00 3.0 NaN 4.0 NaN 4.0
2017-01-01 00:05:00 4.0 3.0 3.0 3.0 NaN
2017-01-01 00:06:00 3.0 1.0 NaN 0.0 NaN
2017-01-01 00:07:00 4.0 2.0 2.0 NaN 2.0
2017-01-01 00:08:00 NaN 0.0 4.0 1.0 4.0
2017-01-01 00:09:00 NaN 2.0 2.0 0.0 2.0
答案 2 :(得分:2)
如果要控制窗口大小,可以使用滚动对象。这个想法是,如果n个连续的元素是相同的,它们的标准偏差将是0.其余的是
$MyCredentials = Get-Credential -Credential ''
Start-Process powershell.exe -Credential $MyCredentials -ArgumentList "Start-Process powershell.exe -verb runas"
Start-Process explorer.exe -Credential $MyCredentials
Start-Process 'C:\Program Files (x86)\Microsoft Configuration Manager\AdminConsole\bin\Microsoft.ConfigurationManagement.exe' -Credential $MyCredentials
对于系列,successive = (ser.where(np.isclose(ser.rolling(3).std(), 0, atol=10**-6))
.bfill(limit=2).notnull())
ser[successive] = np.nan
ser
这会产生
ser = pd.Series([1, 1, 1, 2, 2, 2, 1, 1, 3, 3, 3, 3, 1, 2, 1, 3, 2, 1, 1, 1])
答案 3 :(得分:1)
你可以做一个for循环来跟踪重复某些事情的次数:
replacement_value = np.nan
last_value = None
number_of_repetitions = 0
for index in range(len(values)):
if value == last_value:
if number_of_repetitions == 2:
#if we previously had 2 repetitions, we should replace both the current and the previous values
values[index-1] = replacement_value
values[index] = replacement_value
if number_of_repetitions == 3:
#if this is the third or more repetition, we've already replaced the previous value, so we just need to handle the current one
values[index] = replacement_value
else:
number_of_repetitions = number_of_repetitions+1
#if it hasn't reach 3 yet, we should increment every time we see a repetition
#but we don't need to keep track after 3
else:
#if this is a new value, we should reset
number_of_repetitions = 1
last_value = value