我有一个由产品名称和唯一客户电子邮件组成的字典,他们购买了这样的项目:
customer_emails = {
'Backpack':['customer1@gmail.com','customer2@gmail.com','customer3@yahoo.com','customer4@msn.com'],
'Baseball Bat':['customer1@gmail.com','customer3@yahoo.com','customer5@gmail.com'],
'Gloves':['customer2@gmail.com','customer3@yahoo.com','customer4@msn.com']}
我正在尝试迭代每个键的值,并确定其他键中匹配的电子邮件数量。我将这个字典转换为DataFrame,并使用类似这样的
得到了我想要的单列比较的答案customers[customers['Baseball Bat'].notna() == True]['Baseball Bat'].isin(customers['Gloves']).sum()
我想要完成的是创建一个基本上看起来像这样的DataFrame,这样我就可以轻松地将它用于相关图表。
Backpack Baseball Bat Gloves
Backpack 4 2 3
Baseball Bat 2 3 1
Gloves 3 1 3
我认为这样做的方法是迭代customer_emails
词典,但我不确定如何选择一个键来将其值与其他所有词进行比较,所以然后存储它。
答案 0 :(得分:6)
Start with pd.DataFrame.from_dict
:
df = pd.DataFrame.from_dict(customer_emails, orient='index').T
df
Backpack Baseball Bat Gloves
0 customer1@gmail.com customer1@gmail.com customer2@gmail.com
1 customer2@gmail.com customer3@yahoo.com customer3@yahoo.com
2 customer3@yahoo.com customer5@gmail.com customer4@msn.com
3 customer4@msn.com None None
Now, use stack
+ get_dummies
+ sum
+ dot
:
v = df.stack().str.get_dummies().sum(level=1)
v.dot(v.T)
Backpack Baseball Bat Gloves
Backpack 4 2 3
Baseball Bat 2 3 1
Gloves 3 1 3
Alternatively, switch stack
with melt
for some added performance.
v = (df.melt()
.set_index('variable')['value']
.str.get_dummies()
.sum(level=0)
)
v.dot(v.T)
variable Backpack Baseball Bat Gloves
variable
Backpack 4 2 3
Baseball Bat 2 3 1
Gloves 3 1 3
答案 1 :(得分:2)
You can first find all the counts for each product and corresponding emails, then pass the resulting dictionary to pd.DataFrame
:
import pandas as pd
emails = {'Baseball Bat': ['customer1@gmail.com', 'customer3@yahoo.com', 'customer5@gmail.com'], 'Backpack': ['customer1@gmail.com', 'customer2@gmail.com', 'customer3@yahoo.com', 'customer4@msn.com'], 'Gloves': ['customer2@gmail.com', 'customer3@yahoo.com', 'customer4@msn.com']}
results = {a:{c:sum(h in j for h in b) for c, j in emails.items()} for a, b in emails.items()}
df = pd.DataFrame(results)
Output:
Backpack Baseball Bat Gloves
Backpack 4 2 3
Baseball Bat 2 3 1
Gloves 3 1 3
答案 2 :(得分:1)
使用相同的逻辑创建系列,然后我们使用intersection
列表
s=pd.Series(customer_emails)
pd.DataFrame(np.reshape([len(set(x).intersection(set(y)))for x in s for y in s ],(3,3)),index=s.index,columns=s.index)
Out[299]:
Backpack Baseball Bat Gloves
Backpack 4 2 3
Baseball Bat 2 3 1
Gloves 3 1 3