我是熊猫新手,我正在尝试使用它为卷积神经网络创建数据集。我想要实现的是一个DataFrame,其中每列代表数据项的标签。
首先,我找到所有数据项并将它们各自的路径读入两个词组
video_path='/home/richard/Documents/datasets/ucf_sports/mod'
all_videos_path = []
all_videos = []
for root, dirs, files in os.walk(video_path):
for file in files:
if file.endswith(".avi"):
all_videos.append(os.path.join(root, file))
all_videos_path.append(root)
所以all_videos_path
输出是这样的:
['/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/004',
'/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/001',
'/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/003',
'/home/richard/Documents/datasets/ucf_sports/mod/Lifting/004',
'/home/richard/Documents/datasets/ucf_sports/mod/Lifting/001']
然后我使用以下方法找到数据项的标签:
all_labels = map(lambda x: x.split('/')[8], all_videos_path)
然后我找到了使用的唯一标签:
unique_labels = np.unique(all_labels)
输出:
array(['GolfSwing','Lifting'],
dtype='|S13')
然后我使用以下方法创建一系列独特标签:
label_dict = pd.Series(range(len(unique_labels)), index=unique_labels)
输出:
GolfSwing 0
Lifting 1
dtype: int64
所以现在我想创建一个DataFrame,它具有唯一标签作为列标题,所有数据项都分类到各自的列中。正如您所看到的,某些类别具有不同数量的数据,因此每列需要有不同的行。我一直在尝试创建一个DataFrame,但没有运气。这实际上是在熊猫中可以实现的吗?如果是这样的话我该怎么做呢?
提前致谢。
答案 0 :(得分:1)
您想通过pivot
来转发数据框架的IIUC。但是不同的行是问题 - 你得到NaN
值:
import pandas as pd
all_videos_path = ['/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/004',
'/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/001',
'/home/richard/Documents/datasets/ucf_sports/mod/GolfSwing/Golf-Swing-Side/003',
'/home/richard/Documents/datasets/ucf_sports/mod/Lifting/004',
'/home/richard/Documents/datasets/ucf_sports/mod/Lifting/001']
#create dataframe with list all_videos_path
df = pd.DataFrame({'links': all_videos_path})
#create new column with labels
df['labels'] = df['links'].str.split('/').str[7]
print df
links labels
0 /home/richard/Documents/datasets/ucf_sports/mo... GolfSwing
1 /home/richard/Documents/datasets/ucf_sports/mo... GolfSwing
2 /home/richard/Documents/datasets/ucf_sports/mo... GolfSwing
3 /home/richard/Documents/datasets/ucf_sports/mo... Lifting
4 /home/richard/Documents/datasets/ucf_sports/mo... Lifting
#
df = df.pivot(index='links', columns='labels', values='labels').reset_index()
print df
labels links GolfSwing Lifting
0 /home/richard/Documents/datasets/ucf_sports/mo... GolfSwing NaN
1 /home/richard/Documents/datasets/ucf_sports/mo... GolfSwing NaN
2 /home/richard/Documents/datasets/ucf_sports/mo... GolfSwing NaN
3 /home/richard/Documents/datasets/ucf_sports/mo... NaN Lifting
4 /home/richard/Documents/datasets/ucf_sports/mo... NaN Lifting
df.loc[df['GolfSwing'].notnull() , 'GolfSwing'] = df['links']
df.loc[df['Lifting'].notnull() , 'Lifting'] = df['links']
del df['links']
print df
labels GolfSwing \
0 /home/richard/Documents/datasets/ucf_sports/mo...
1 /home/richard/Documents/datasets/ucf_sports/mo...
2 /home/richard/Documents/datasets/ucf_sports/mo...
3 NaN
4 NaN
labels Lifting
0 NaN
1 NaN
2 NaN
3 /home/richard/Documents/datasets/ucf_sports/mo...
4 /home/richard/Documents/datasets/ucf_sports/mo...