与循环python数据帧中的下一个/上一个值进行比较

时间:2018-03-02 09:44:09

标签: python loops dataframe comparison

如何将值与循环中的下一个或上一个项目进行比较? 我需要总结列中连续出现的重复次数。

之后我需要创建“频率表”,以便dfoutput schould看起来像在底部图片。

此代码不起作用,因为我无法与其他项目进行比较。

也许有另一种简单的方法可以在没有循环的情况下做到这一点?

sumrep=0

df = pd.DataFrame(data = {'1' : [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],'2' : [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.index= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]   # It will be easier to assign repetitions in output df - index will be equal to number of repetitions

dfoutput = pd.DataFrame(0,index=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],columns=['1','2'])

#example for column 1
for val1 in df.columns[1]:                           
    if val1 == 1 and val1 ==0:   #can't find the way to check NEXT val1 (one row below) in column 1 :/
        if sumrep==0:            
            dfoutput.loc[1,1]=dfoutput.loc[1,1]+1   #count only SINGLE occurences of values and assign it to proper row number 1 in dfoutput
        if sumrep>0:
            dfoutput.loc[sumrep,1]=dfoutput.loc[sumrep,1]+1   #count repeated occurences greater then 1 and assign them to proper row in dfoutput
            sumrep=0
    elif val1 == 1 and df[val1+1]==1 :
        sumrep=sumrep+1

第1列的所需输出表 - dfoutput:

enter image description here

我不明白为什么没有任何简单的方法来移动Excel中的VBA中的偏移函数等数据框:/

1 个答案:

答案 0 :(得分:1)

您可以使用定义的函数here来执行快速运行长度编码:

import numpy as np


def rlencode(x, dropna=False):
    """
    Run length encoding.
    Based on http://stackoverflow.com/a/32681075, which is based on the rle 
    function from R.

    Parameters
    ----------
    x : 1D array_like
        Input array to encode
    dropna: bool, optional
        Drop all runs of NaNs.

    Returns
    -------
    start positions, run lengths, run values

    """
    where = np.flatnonzero
    x = np.asarray(x)
    n = len(x)
    if n == 0:
        return (np.array([], dtype=int), 
                np.array([], dtype=int), 
                np.array([], dtype=x.dtype))

    starts = np.r_[0, where(~np.isclose(x[1:], x[:-1], equal_nan=True)) + 1]
    lengths = np.diff(np.r_[starts, n])
    values = x[starts]

    if dropna:
        mask = ~np.isnan(values)
        starts, lengths, values = starts[mask], lengths[mask], values[mask]

    return starts, lengths, values

使用此功能,您的任务变得更加容易:

import pandas as pd
from collections import Counter
from functools import partial

def get_frequency_of_runs(col, value=1, index=None):
     _, lengths, values = rlencode(col)
     return pd.Series(Counter(lengths[np.where(values == value)]), index=index)

df = pd.DataFrame(data={'1': [0,0,1,0,1,1,0,1,1,0,1,1,1,1,0],
                        '2': [0,0,1,1,1,1,0,0,1,0,1,1,0,1,0]})
df.apply(partial(get_frequency_of_runs, index=df.index)).fillna(0)
#       1    2
# 0   0.0  0.0
# 1   1.0  2.0
# 2   2.0  1.0
# 3   0.0  0.0
# 4   1.0  1.0
# 5   0.0  0.0
# 6   0.0  0.0
# 7   0.0  0.0
# 8   0.0  0.0
# 9   0.0  0.0
# 10  0.0  0.0
# 11  0.0  0.0
# 12  0.0  0.0
# 13  0.0  0.0
# 14  0.0  0.0