操作从dict(tuple-float)创建的DataFrame

时间:2016-06-15 04:16:19

标签: python python-3.x dictionary pandas dataframe

我一直在尝试从具有以下结构的字典创建一个DataFrame。

imgAlbmShrDict = {('10', 'photo_album_57'): 20.0,
 ('10', 'photo_album_8'): 20.0,
 ('1061', 'photo_album_29'): 100.0,
 ('1061', 'photo_album_90'): 90.0,
 ('1102', 'photo_album_29'): 80.0,
 ('1102', 'photo_album_90'): 60.0,
 ('1300', 'photo_album_15'): 100.0,
 ('1300', 'photo_album_89'): 60.0,
 ('1301', 'photo_album_15'): 88.88888888888889,
 ('1301', 'photo_album_89'): 60.0
}

pd.DataFrame(imgAlbmShrDict,index=['Proportion']).transpose()

                    Proportion
10   photo_album_57   20.000000
     photo_album_8    20.000000
1061 photo_album_29  100.000000
     photo_album_90   90.000000
1102 photo_album_29   80.000000
     photo_album_90   60.000000
1300 photo_album_15  100.000000
     photo_album_89   60.000000
1301 photo_album_15   88.888889
     photo_album_89   60.000000

输出正是我需要的,但我无法仅从数据框中提取前两列。 第一列实际上是图像ID,第二列是出现图像的相册。

我需要帮助来访问列和添加列的方法,同时保留结构。

必需的输出:

                     Proportion  URL
10   photo_album_57   20.000000  www.something.com/10.jpeg
     photo_album_8    20.000000
1061 photo_album_29  100.000000  www.something.com/1061.jpeg
     photo_album_90   90.000000
1102 photo_album_29   80.000000  www.something.com/1102.jpeg
     photo_album_90   60.000000
1300 photo_album_15  100.000000  www.something.com/1300.jpeg
     photo_album_89   60.000000
1301 photo_album_15   88.888889  www.something.com/1301.jpeg
     photo_album_89   60.000000

1 个答案:

答案 0 :(得分:2)

您可以使用get_level_values,因为前两列是Multiindex

print (df.index.get_level_values(0))
Index(['10', '10', '1061', '1061', '1102', '1102', '1300', '1300', '1301',
       '1301'],
      dtype='object')

df['URL'] = 'www.something.com/' + df.index.get_level_values(0) + '.jpg'
print (df) 
                     Proportion                         URL
10   photo_album_57   20.000000    www.something.com/10.jpg
     photo_album_8    20.000000    www.something.com/10.jpg
1061 photo_album_29  100.000000  www.something.com/1061.jpg
     photo_album_90   90.000000  www.something.com/1061.jpg
1102 photo_album_29   80.000000  www.something.com/1102.jpg
     photo_album_90   60.000000  www.something.com/1102.jpg
1300 photo_album_15  100.000000  www.something.com/1300.jpg
     photo_album_89   60.000000  www.something.com/1300.jpg
1301 photo_album_15   88.888889  www.something.com/1301.jpg
     photo_album_89   60.000000  www.something.com/1301.jpg

也许需要drop_duplicates

df = df.drop_duplicates(subset='URL')
print (df) 
                     Proportion                         URL
10   photo_album_57   20.000000    www.something.com/10.jpg
1061 photo_album_29  100.000000  www.something.com/1061.jpg
1102 photo_album_29   80.000000  www.something.com/1102.jpg
1300 photo_album_15  100.000000  www.something.com/1300.jpg
1301 photo_album_15   88.888889  www.something.com/1301.jpg

使用reset_index和设置列名称的另一个解决方案:

df.reset_index(inplace=True)
df.columns = ['ID','Album','Proportion']
df['URL'] = 'www.something.com/' + df['ID'] + '.jpg'
print (df)
     ID           Album  Proportion                         URL
0    10  photo_album_57   20.000000    www.something.com/10.jpg
1    10   photo_album_8   20.000000    www.something.com/10.jpg
2  1061  photo_album_29  100.000000  www.something.com/1061.jpg
3  1061  photo_album_90   90.000000  www.something.com/1061.jpg
4  1102  photo_album_29   80.000000  www.something.com/1102.jpg
5  1102  photo_album_90   60.000000  www.something.com/1102.jpg
6  1300  photo_album_15  100.000000  www.something.com/1300.jpg
7  1300  photo_album_89   60.000000  www.something.com/1300.jpg
8  1301  photo_album_15   88.888889  www.something.com/1301.jpg
9  1301  photo_album_89   60.000000  www.something.com/1301.jpg

EDIT1:

感谢stephen寻求解决方案。

我尝试boolean indexing使用Index.duplicated

,让它变得更好
mask = ~df.index.get_level_values(0).duplicated()
print (mask)
[ True False  True False  True False  True False  True False]

subindex = df.index[mask]

df.loc[subindex, 'URL'] = 'www.something.com/' + subindex.get_level_values(0) + '.jpg'
df.URL.fillna('', inplace=True)
print (df)
                     Proportion                         URL
10   photo_album_57   20.000000    www.something.com/10.jpg
     photo_album_8    20.000000                            
1061 photo_album_29  100.000000  www.something.com/1061.jpg
     photo_album_90   90.000000                            
1102 photo_album_29   80.000000  www.something.com/1102.jpg
     photo_album_90   60.000000                            
1300 photo_album_15  100.000000  www.something.com/1300.jpg
     photo_album_89   60.000000                            
1301 photo_album_15   88.888889  www.something.com/1301.jpg
     photo_album_89   60.000000