如何在数据帧pandas中选择“train”作为数据类型的数据?

时间:2017-11-12 13:33:04

标签: python pandas dataframe

我有一个这样的数据框:

                             0                 1                 2
filename     CF02_B1_D1_M3.wav  F02_B2_D1_M2.wav  F02_B3_D6_M3.wav   
datatype                 train             train              test   
label                        1                 1                 6   
feature0               18.2796           17.8995           18.0531   
feature1              -3.92135          -15.5039          -31.0344   
feature2               13.6118         -0.741729           7.87929   
feature3              -7.25019        -0.0536188          -18.6119   
feature4              -11.7736          -6.73465          0.682173   
feature5               -18.265           4.39842           -5.3771   

以下是代码:

from python_speech_features import mfcc
import scipy.io.wavfile as wav
import numpy as np
import os
import pandas as pd

filenames, datatype, labels = [], [], []
fitur = [[] for i in range(2392)]
path = "C:\Users\HEWLETT PACKARD"
for item in os.listdir(path):
    if item.endswith('.wav'):
        parts = item.split('_')
        (rate, sig) = wav.read(item)
        mfcc_feat = mfcc(sig, rate, nfilt=26, numcep=13)
        feat = np.asarray(mfcc_feat[:, :])
        feature = feat.ravel()

        if parts[1][1] == '3': 
            data_type='test'
            label=parts[2][1]
        else:
            data_type='train'
            label=parts[2][1]

        filenames.append(item)
        datatype.append(data_type)
        labels.append(label)
        for i in range(2392):
            fitur[i].append(np.squeeze(feature[i]))

dataset = [filenames, datatype, labels]
dataset.extend(fitur)
column = ['filename', 'datatype', 'label']
column.extend(['feature'+str(i) for i in range(2392)])
dataset = [(col, val) for col, val in zip(column, dataset)]
df = pd.DataFrame.from_items(dataset, columns = column)

df = df.transpose()
print df

我提取了一些wav文件的功能。我把它们分成了火车和测试数据。然后我把它们放在一个数据帧中。 如何选择“train”作为其数据类型的数据?

2 个答案:

答案 0 :(得分:2)

IIUC:

In [24]: df.loc['feature0':, df.columns[df.loc['datatype']=='train']]
Out[24]:
                  0           1
feature0    18.2796     17.8995
feature1   -3.92135    -15.5039
feature2    13.6118   -0.741729
feature3   -7.25019  -0.0536188
feature4   -11.7736    -6.73465
feature5    -18.265     4.39842
feature6   -18.1045    -1.88591
feature7   -10.3347    -12.4131
feature8   -15.5189     1.84178
feature9   -13.8793    -2.21513
feature10  -11.2372    -14.6925
feature11  -13.1699     7.65947
feature12  -13.2874     3.11805
feature13    18.529     17.9096

您可能还希望对其进行转置以使其更适合机器学习:

In [36]: col_mask = df.loc['datatype']=='train'

In [37]: col_mask
Out[37]:
0     True
1     True
2    False
Name: datatype, dtype: bool

In [38]: df.loc['feature0':, df.columns[col_mask]].T.set_index(df.loc['filename'][col_mask])
Out[38]:
                  feature0  feature1   feature2    feature3  feature4    ...     feature9 feature10 feature11 feature12 feature13
filename                                                                 ...
CF02_B1_D1_M3.wav  18.2796  -3.92135    13.6118    -7.25019  -11.7736    ...     -13.8793  -11.2372  -13.1699  -13.2874    18.529
F02_B2_D1_M2.wav   17.8995  -15.5039  -0.741729  -0.0536188  -6.73465    ...     -2.21513  -14.6925   7.65947   3.11805   17.9096

[2 rows x 14 columns]

答案 1 :(得分:1)

我相信你需要:

df = df.loc[:, df.loc['datatype'] == 'train']
print (df)
                           1                 2
filename   CF02_B1_D1_M3.wav  F02_B2_D1_M2.wav
datatype               train             train
label                      1                 1
feature0             18.2796           17.8995
feature1            -3.92135          -15.5039
feature2             13.6118         -0.741729
feature3            -7.25019        -0.0536188
feature4            -11.7736          -6.73465
feature5             -18.265           4.39842
feature6            -18.1045          -1.88591
feature7            -10.3347          -12.4131
feature8            -15.5189           1.84178
feature9            -13.8793          -2.21513
feature10           -11.2372          -14.6925
feature11           -13.1699           7.65947
feature12           -13.2874           3.11805
feature13             18.529           17.9096

然后如果需要按名称删除前3行:

df = df.drop(['filename','datatype','label'])

或者通过职位:

df = df.iloc[3:]
print (df)
                  1           2
feature0    18.2796     17.8995
feature1   -3.92135    -15.5039
feature2    13.6118   -0.741729
feature3   -7.25019  -0.0536188
feature4   -11.7736    -6.73465
feature5    -18.265     4.39842
feature6   -18.1045    -1.88591
feature7   -10.3347    -12.4131
feature8   -15.5189     1.84178
feature9   -13.8793    -2.21513
feature10  -11.2372    -14.6925
feature11  -13.1699     7.65947
feature12  -13.2874     3.11805
feature13    18.529     17.9096