我正在尝试获取嵌套的DataFrame并将其转换为嵌套的Dictionary。
这是我的原始DataFrame,具有以下唯一值:
输入:df.head(5)
输出:
reviewerName title reviewerRatings
0 Charles Harry Potter Book Seven News:... 3.0
1 Katherine Harry Potter Boxed Set, Books... 5.0
2 Lora Harry Potter and the Sorcerer... 5.0
3 Cait Harry Potter and the Half-Blo... 5.0
4 Diane Harry Potter and the Order of... 5.0
输入:len(df['reviewerName'].unique())
输出:66130
考虑到66130个不合格值中的每个值都有多个值(即“查尔斯”出现3次),我选择了66130个唯一的“ reviewerName”并将它们全部分配为 key 在新的嵌套DataFrame中,然后在同一嵌套DataFrame中,使用“标题”和“ reviewerRatings”作为另一层key:value分配 value 。
输入:df = df.set_index(['reviewerName', 'title']).sort_index()
输出:
reviewerRatings
reviewerName title
Charles Harry Potter Book Seven News:... 3.0
Harry Potter and the Half-Blo... 3.5
Harry Potter and the Order of... 4.0
Katherine Harry Potter Boxed Set, Books... 5.0
Harry Potter and the Half-Blo... 2.5
Harry Potter and the Order of... 5.0
...
230898 rows x 1 columns
作为对 first question,我尝试将嵌套的DataFrame转换为嵌套的Dictionary。
上面新的嵌套DataFrame列索引在第一行(第3列)中显示“ reviewerRatings”,在第二行(第1和2列)中显示“ reviewerName”和“ title”,当我运行{{1}时}方法,输出显示df.to_dict()
输入:{reviewerRatingsIndexName: {(reviewerName, title): reviewerRatings}}
输出:
df.to_dict()
但是对于下面想要的输出,我希望将输出获取为{'reviewerRatings':
{
('Charles', 'Harry Potter Book Seven News:...'): 3.0,
('Charles', 'Harry Potter and the Half-Blo...'): 3.5,
('Charles', 'Harry Potter and the Order of...'): 4.0,
('Katherine', 'Harry Potter Boxed Set, Books...'): 5.0,
('Katherine', 'Harry Potter and the Half-Blo...'): 2.5,
('Katherine', 'Harry Potter and the Order of...'): 5.0,
...}
}
,这正是我在嵌套DataFrame中排序的方式。
{reviewerName: {title: reviewerRating}}
有什么方法可以操纵嵌套的DataFrame或嵌套的Dictionary,以便当我运行{'Charles':
{'Harry Potter Book Seven News:...': 3.0,
'Harry Potter and the Half-Blo...': 3.5,
'Harry Potter and the Order of...': 4.0},
'Katherine':
{'Harry Potter Boxed Set, Books...': 5.0,
'Harry Potter and the Half-Blo...': 2.5,
'Harry Potter and the Order of...': 5.0},
...}
方法时,它会显示df.to_dict()
。
谢谢!
答案 0 :(得分:4)
将groupby
与lambda函数一起用于dictionaries
,reviewerName
,然后输出Series
由to_dict
转换:
print (df)
reviewerName title reviewerRatings
0 Charles Harry Potter Book Seven News:... 3.0
1 Charles Harry Potter Boxed Set, Books... 5.0
2 Charles Harry Potter and the Sorcerer... 5.0
3 Katherine Harry Potter and the Half-Blo... 5.0
4 Katherine Harry otter and the Order of... 5.0
d = (df.groupby('reviewerName')['title','reviewerRatings']
.apply(lambda x: dict(x.values))
.to_dict())
print (d)
{
'Charles': {
'Harry Potter Book Seven News:...': 3.0,
'Harry Potter Boxed Set, Books...': 5.0,
'Harry Potter and the Sorcerer...': 5.0
},
'Katherine': {
'Harry Potter and the Half-Blo...': 5.0,
'Harry otter and the Order of...': 5.0
}
}
答案 1 :(得分:1)
有两种方法。您可以将groupby
与to_dict
一起使用,也可以将collections.defaultdict
与行进行迭代。值得注意的是,后者并不是不必要效率较低。
groupby
+ to_dict
从每个groupby
对象构造一个序列,并将其转换为字典以提供一系列字典值。最后,通过另一个to_dict
调用将其转换为词典字典。
res = df.groupby('reviewerName')\
.apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
.to_dict()
collections.defaultdict
定义defaultdict
个对象中的一个dict
,并逐行迭代数据框。
from collections import defaultdict
res = defaultdict(dict)
for row in df.itertuples(index=False):
res[row.reviewerName][row.title] = row.reviewerRatings
由于defaultdict
是dict
的子类,因此不需要将生成的defaultdict
转换回常规dict
。
基准测试是根据数据进行设置的。您应该使用自己的数据进行测试,看看哪种方法最有效。
# Python 3.6.5, Pandas 0.19.2
from collections import defaultdict
from random import sample
# construct sample dataframe
np.random.seed(0)
n = 10**4 # number of rows
names = np.random.choice(['Charles', 'Lora', 'Katherine', 'Matthew',
'Mark', 'Luke', 'John'], n)
books = [f'Book_{i}' for i in sample(range(10**5), n)]
ratings = np.random.randint(0, 6, n)
df = pd.DataFrame({'reviewerName': names, 'title': books, 'reviewerRatings': ratings})
def jez(df):
return df.groupby('reviewerName')['title','reviewerRatings']\
.apply(lambda x: dict(x.values))\
.to_dict()
def jpp1(df):
return df.groupby('reviewerName')\
.apply(lambda x: x.set_index('title')['reviewerRatings'].to_dict())\
.to_dict()
def jpp2(df):
dd = defaultdict(dict)
for row in df.itertuples(index=False):
dd[row.reviewerName][row.title] = row.reviewerRatings
return dd
%timeit jez(df) # 33.5 ms per loop
%timeit jpp1(df) # 17 ms per loop
%timeit jpp2(df) # 21.1 ms per loop