使用源自文件名的唯一标签创建pandas DataFrame

时间:2015-12-02 11:57:07

标签: python numpy pandas

我是熊猫新手,我正在尝试使用它为卷积神经网络创建数据集。我想要实现的是一个DataFrame,其中每列代表数据项的标签。

首先,我找到所有数据项并将它们各自的路径读入两个词组

video_path='/home/richard/Documents/datasets/ucf_sports/mod'

all_videos_path = []
all_videos = []

for root, dirs, files in os.walk(video_path):
    for file in files:
        if file.endswith(".avi"):
            all_videos.append(os.path.join(root, file))
            all_videos_path.append(root)

所以all_videos_path输出是这样的:

['/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/004',
'/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/001',
'/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/003',
'/home/richard/Documents/datasets/ucf_sports/mod/Lifting/004',
'/home/richard/Documents/datasets/ucf_sports/mod/Lifting/001']

然后我使用以下方法找到数据项的标签:

all_labels = map(lambda x: x.split('/')[8], all_videos_path)

然后我找到了使用的唯一标签:

unique_labels = np.unique(all_labels)

输出:

array(['GolfSwing','Lifting'], 
  dtype='|S13')

然后我使用以下方法创建一系列独特标签:

label_dict = pd.Series(range(len(unique_labels)), index=unique_labels)

输出:

GolfSwing        0
Lifting          1
dtype: int64

所以现在我想创建一个DataFrame,它具有唯一标签作为列标题,所有数据项都分类到各自的列中。正如您所看到的,某些类别具有不同数量的数据,因此每列需要有不同的行。我一直在尝试创建一个DataFrame,但没有运气。这实际上是在熊猫中可以实现的吗?如果是这样的话我该怎么做呢?

提前致谢。

1 个答案:

答案 0 :(得分:1)

您想通过pivot来转发数据框架的IIUC。但是不同的行是问题 - 你得到NaN值:

import pandas as pd

all_videos_path = ['/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/004',
'/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/001',
'/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/003',
'/home/richard/Documents/datasets/ucf_sports/mod/Lifting/004',
'/home/richard/Documents/datasets/ucf_sports/mod/Lifting/001']

#create dataframe with list all_videos_path
df =  pd.DataFrame({'links': all_videos_path})
#create new column with labels
df['labels'] = df['links'].str.split('/').str[7]
print df
                                               links     labels
0  /home/richard/Documents/datasets/ucf_sports/mo...  GolfSwing
1  /home/richard/Documents/datasets/ucf_sports/mo...  GolfSwing
2  /home/richard/Documents/datasets/ucf_sports/mo...  GolfSwing
3  /home/richard/Documents/datasets/ucf_sports/mo...    Lifting
4  /home/richard/Documents/datasets/ucf_sports/mo...    Lifting

#
df = df.pivot(index='links', columns='labels', values='labels').reset_index()
print df
labels                                              links  GolfSwing  Lifting
0       /home/richard/Documents/datasets/ucf_sports/mo...  GolfSwing      NaN
1       /home/richard/Documents/datasets/ucf_sports/mo...  GolfSwing      NaN
2       /home/richard/Documents/datasets/ucf_sports/mo...  GolfSwing      NaN
3       /home/richard/Documents/datasets/ucf_sports/mo...        NaN  Lifting
4       /home/richard/Documents/datasets/ucf_sports/mo...        NaN  Lifting

df.loc[df['GolfSwing'].notnull() , 'GolfSwing'] = df['links']
df.loc[df['Lifting'].notnull() , 'Lifting'] = df['links']
del df['links']
print df
labels                                          GolfSwing  \
0       /home/richard/Documents/datasets/ucf_sports/mo...   
1       /home/richard/Documents/datasets/ucf_sports/mo...   
2       /home/richard/Documents/datasets/ucf_sports/mo...   
3                                                     NaN   
4                                                     NaN   

labels                                            Lifting  
0                                                     NaN  
1                                                     NaN  
2                                                     NaN  
3       /home/richard/Documents/datasets/ucf_sports/mo...  
4       /home/richard/Documents/datasets/ucf_sports/mo...