Pandas重新采样功能问题从分钟到毫秒的采样

时间:2018-06-06 15:31:30

标签: python pandas interpolation missing-data resampling

我在使用pandas resample函数时遇到了麻烦。我有微小的采样数据,我试图以0.7秒的频率重新采样。我尝试使用' 700L'选项但行为不符合预期。 我举了一个例子:

import pandas as pd
from datetime import datetime
import pytz
import numpy as np
import matplotlib.pyplot as plt

def convert_2_datetime(timestamp, timezoneid):
    """

    :param timestamp: UTC format in milliseconds (data.index.values)
    :param timezoneid: timezone object from CTX (for example pytz.timezone(ctx.inp.assets[0].properties['timezoneid']))
    :return: vector of datetimes
    """

    if isinstance(timestamp,int) or isinstance(timestamp,float):
        utctime = datetime.utcfromtimestamp(timestamp / 1000).replace(tzinfo=pytz.utc)
        output = utctime.astimezone(pytz.timezone(timezoneid.zone))
    else:
        utctime = [datetime.utcfromtimestamp(i / 1000).replace(tzinfo=pytz.utc) for i in timestamp]
        output = [i.astimezone(pytz.timezone(timezoneid.zone)) for i in utctime]

    return output

# minute sampled data
v1 = [0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0]
data = pd.DataFrame({'v1':np.array(v1)},index=np.arange(start=1,stop=len(v1)+1)*60000)

data['ts']= convert_2_datetime(timestamp=data.index.values,timezoneid=pytz.timezone('UTC'))
data.set_index('ts',inplace=True)
data07 = data.resample(rule='700L',closed={'right','left'}).interpolate(method='linear')
data06 = data.resample(rule='600L',closed={'right','left'}).interpolate(method='linear')
data11 = data.resample(rule='1100L',closed={'right','left'}).interpolate(method='linear')

plt.show()

data07.v1.plot(style='.',label='700 ms')
data06.v1.plot(style='.', label = '600 ms')
data11.v1.plot(style='.', label = '1500 ms')
data.v1.plot(style='x',label='original')
plt.legend()

print('Finish!')

如果我使用' 600L'重新取样,最终结果是正确的。 (示例中的data06);使用' 700L'是不正确的。 (示例中的data07)。见下图:

enter image description here

我遗漏了重采样功能的一些内容?

非常感谢大家!

1 个答案:

答案 0 :(得分:3)

解决方法

在您的情况下,我认为您应该在插值之前对方法进行重新取样,例如在#!/usr/bin/env bash case $BASH_VERSION in ''|[123].*) echo "ERROR: Bash 4.0 or newer required" >&2; exit 1;; esac declare -A old new read_to_array() { local line local -n dest=$1 local -n comment_dest=$2 declare -g -A "$1" "$2" while IFS= read -r line; do case $line in "") continue;; "#"*" = "*) line=${line#"#"}; comment_dest[${line%%" = "*}]=$line;; "#"*) continue;; *" = "*) dest[${line%%" = "*}]=${line#*" = "};; *) echo "Ignoring unrecognized line: $line" >&2 esac done } read_to_array old old_comments <old.txt read_to_array new new_comments <new.txt declare -A done=( ) for key in "${!new[@]}"; do # if commented out in old, leave it that way if [[ ${old_comments[$key]} ]]; then echo "#$key = ${new[$key]}" continue fi # key exists in both old and new; use old if [[ ${old[$key]} ]]; then echo "$key = ${old[$key]}" continue fi # key is only in new; keep it echo "$key = ${new[$key]}" done for key in "${!new_comments[@]}"; do # if present at all in old, we were already emitted [[ ${old[$key]} ]] && continue echo "${new_comments[$key]}" done 上。我认为这只与mean的输出和resample读取的方式有关。例如,以下似乎有效:

interpolate

此图显示它有效:

data07 = data.resample('700L').mean().interpolate()
data06 = data.resample('600L').mean().interpolate()
data10 = data.resample('1000L').mean().interpolate()

enter image description here

解释(种类......)

使用包括data07.v1.plot(style='.',label='700 ms', alpha=0.75, ms=3,zorder=2) data06.v1.plot(style='^',label='600 ms', alpha=0.5, zorder=1) data10.v1.plot(style='^',label='1000 ms', alpha=0.5, zorder=0, ms=10) data.v1.plot(style='x',label='original', ms=10) plt.legend() 在内的任何方法对您的数据进行重新采样时,无论您的数据重新采样在哪里,都会得到mean()

NaN

当您致电>>> data.resample('700L').mean().head() v1 ts 1970-01-01 00:00:59.500000+00:00 0.0 1970-01-01 00:01:00.200000+00:00 NaN 1970-01-01 00:01:00.900000+00:00 NaN 1970-01-01 00:01:01.600000+00:00 NaN 1970-01-01 00:01:02.300000+00:00 NaN 时,它将使用适当的线性插值填充interpolate

NaN

当您直接在>>> data.resample('700l').mean().interpolate().head() v1 ts 1970-01-01 00:00:59.500000+00:00 0.0 1970-01-01 00:01:00.200000+00:00 0.0 1970-01-01 00:01:00.900000+00:00 0.0 1970-01-01 00:01:01.600000+00:00 0.0 1970-01-01 00:01:02.300000+00:00 0.0 的输出上致电interpolate时,resample的行为似乎与预期不符,在开头会给出一堆interpolate个,然后从最大(1)向下逐渐倾斜。不确定原因:

NaN