重组Pandas DataFrame

时间:2016-10-07 01:23:09

标签: python pandas dataframe

我被建议从类结构,定义我自己的类,到pandas DataFrame领域,我想要对我的数据进行许多操作。

此时我有一个如下所示的数据框:

   ID   Name    Recording   Direction   Duration    Distance    Path Raw
    0   129 Houston Woodlands   X   12.3    8   HWX.txt
    1   129 Houston Woodlands   Y   12.3    8   HWY.txt
    2   129 Houston Woodlands   Z   12.3    8   HWZ.txt
    3   129 Houston Downtown    X   11.8    10  HDX.txt
    4   129 Houston Downtown    Y   11.8    10  HDY.txt
    5   129 Houston Downtown    Z   11.8    10  HDZ.txt
    ... ... ... ..  ..  ... ... ...
    2998    333 Chicago Downtown    X   3.4 50  CDX.txt
    2999    333 Chicago Downtown    Y   3.4 50  CDY.txt
    3000    333 Chicago Downtown    Z   3.4 50  CDZ.txt

当时没关系,但是,我想在加载文件/数组(添加列)后对所有XYZ进行分组,除此之外,添加带有数组操作产品的新列(例如FFT)

最后,我想要一个看起来像这样的DataFrame:

    ID  Name    Recording   Duration    Distance    Rawx    Rawy    Raxz    FFT-Rawx    FFT-Rawy    FFT-Raxz
0   129 Houston Woodlands   12.3    8   HWX.txt HWY.txt HWZ.txt FFT-HWX.txt FFT-HWY.txt FFT-HWZ.txt
1   129 Houston Downtown    11.8    10  HDX.txt HDY.txt HDZ.txt FFT-HDX.txt FFT-HDY.txt FFT-HDZ.txt
... ... ... ..  ... ... ... ... ... ... ... ...
1000    333 Chicago Downtown    3.4 50  CDX.txt CDY.txt CDZ.txt FFT-CDX.txt FFT-CDY.txt FFT-CDZ.txt

知道怎么做?

不幸的是,并非所有细胞都具有这种漂亮的结构。

而不是

HDX HDY HDZ

我可以拥有“随机名称”。但是,我知道它们按此顺序排列:

首先是Z,第二个是Y,第三个是X。每条记录都有这三个信号,然后是下一条记录。

我正在思考以下几点:

k =1
for row in df:
    if k % 3 == 0:
    # Do something
    elif k % 3 == 2:
    # Do something
    else:
    # Do something
    k += 1

但是,我不知道是否有一个选项可以将空列添加到已存在的数据帧并通过循环填充它。如果有这样的选择,请告诉我。

2 个答案:

答案 0 :(得分:1)

我想我有一个部分答案!关于FFT(快速傅立叶变换?)以及数据来自哪里,我对你想要的东西感到有些困惑。

然而,我得到了其他一切。

首先,我要制作一些样本数据。

import pandas as pd

df = pd.DataFrame({"ID": [0, 1, 2, 3, 4, 5], "Name":[129, 129, 129, 129, 129, 129], 
         "Recording":['Houston Woodlands', 'Houston Woodlands', 'Houston Woodlands', 
                     'Houston Downtown', 'Houston Downtown', 'Houston Downtown'], 
         "Direction": ["X", "Y", "Z", "X", "Y", "Z"], "Duration":[12.3, 12.3, 12.3, 11.8, 11.8, 11.8], 
         "Path_Raw":["HWX.txt", "HWY.txt", "HWZ.txt", 'HDX.txt', 'HDY.txt', 'HDZ.txt'], 
         "Distance": [8, 8, 8, 10, 10, 10]})

现在我将定义一些新功能。我把它们分开了,所以它们会更容易定制。基本上,我正在调用.unique并将每个Path Raw保存为一个新变量。

def splitunique0(group):
    ulist = group.unique()
    return(ulist[0])


def splitunique1(group):
    ulist = group.unique()
    return(ulist[1])


def splitunique2(group):
    ulist = group.unique()
    return(ulist[2])


dothis = {"Duration":"first", "Distance":"first", 'Path_Raw': {'Rawx': splitunique0, 
                                                           'Rawy': splitunique1, 
                                                          'Raxz': splitunique2}}

new = df.groupby(["Name", "Recording"]).agg(dothis)

new.columns = ["Duration", "Distance", "Raxz", "Rawx", "Rawy"]

这是完成的数据帧! Duration Distance Raxz Rawx Rawy Name Recording
129 Houston Downtown 11.8 10 HDZ.txt HDX.txt HDY.txt Houston Woodlands 12.3 8 HWZ.txt HWX.txt HWY.txt

答案 1 :(得分:1)

考虑连接pandas.pivot_tables列表。但是,在连接之前,必须通过 Raw 值公共词干 - HW.txt HD.txt CD.txt - 使用正则表达式分组:

from io import StringIO
import pandas as pd
import re

df = pd.read_csv(StringIO('''
ID,Name,Recording,Direction,Duration,Distance,Path,Raw
0,129,Houston,Woodlands,X,12.3,8,HWX.txt
1,129,Houston,Woodlands,Y,12.3,8,HWY.txt
2,129,Houston,Woodlands,Z,12.3,8,HWZ.txt
3,129,Houston,Downtown,X,11.8,10,HDX.txt
4,129,Houston,Downtown,Y,11.8,10,HDY.txt
5,129,Houston,Downtown,Z,11.8,10,HDZ.txt
6,333,Chicago,Downtown,X,3.4,50,CDX.txt
7,333,Chicago,Downtown,Y,3.4,50,CDY.txt
8,333,Chicago,Downtown,Z,3.4,50,CDZ.txt'''))

# UNIQUE 'RAW' STEM GROUPINGS
grp = set([re.sub(r'X|Y|Z', '', i) for i in df['Raw'].tolist()])

dfList = []
for i in grp:    
    # FILTER FOR 'RAW' VALUES THAT CONTAIN STEMS 
    temp = df[df['Raw'].isin([i.replace('.txt', txt+'.txt') for txt in ['X','Y','Z']])]    
    # RUN PIVOT (LONG TO WIDE)
    temp = temp.pivot_table(values='Raw', 
                            index=['Name', 'Recording', 'Direction','Distance', 'Path'],
                            columns=['Duration'], aggfunc='min')
    dfList.append(temp)

# CONCATENATE (STACK) DFS IN LIST 
finaldf = pd.concat(dfList).reset_index()

# RENAME AND CREATE FFT COLUMNS
finaldf = finaldf.rename(columns={'X': 'Rawx', 'Y': 'Rawy', 'Z': 'Rawz'})
finaldf[['FFT-Rawx', 'FFT-Rawy', 'FFT-Rawz']] = 'FFT-' + finaldf[['Rawx', 'Rawy', 'Rawz']]

<强>输出

# Duration  Name Recording  Direction  Distance  Path     Rawx     Rawy     Rawz     FFT-Rawx     FFT-Rawy     FFT-Rawz
# 0          129   Houston   Downtown      11.8    10  HDX.txt  HDY.txt  HDZ.txt  FFT-HDX.txt  FFT-HDY.txt  FFT-HDZ.txt
# 1          129   Houston  Woodlands      12.3     8  HWX.txt  HWY.txt  HWZ.txt  FFT-HWX.txt  FFT-HWY.txt  FFT-HWZ.txt
# 2          333   Chicago   Downtown       3.4    50  CDX.txt  CDY.txt  CDZ.txt  FFT-CDX.txt  FFT-CDY.txt  FFT-CDZ.txt