Question

有两个表-表A和表B：

表A：产品属性该表包含两列；第二列为产品属性。第一个是由整数表示的唯一产品ID，第二个是包含分配给该产品的属性集合的字符串。

|---------------------|-----------------------|
|      product        |       tags            |
|---------------------|-----------------------|
|          100        | chocolate, sprinkles  |
|---------------------|-----------------------|
|          101        | chocolate, filled     |
|---------------------|-----------------------|
|          102        | glazed                |
|---------------------|-----------------------|

表B：客户属性第二个表也包含两列。第一个是包含客户名称的字符串，第二个是包含产品编号的整数。第二列的产品ID与表A的第一列的产品ID相同。

customer    product
A           100
A           101
B           101
C           100
C           102
B           101
A           100
C           102

要求您创建一个与此格式匹配的表，其中单元格的内容表示客户对产品属性的出现次数。

customer    chocolate   sprinkles   filled  glazed
A               ?           ?         ?        ?
B               ?           ?         ?        ?
C               ?           ?         ?        ?

有人可以帮我用R或Python解决这个问题吗？

Answer 1

我们通过“产品”列加入，在定界符处拆分“标签”以扩展行，并使用count和spread来获得“标签”，“客户”的频率， “宽”格式

library(tidyverse)
df1 %>% 
   right_join(df2) %>% 
   separate_rows(tags) %>%
   count(tags, customer) %>% 
   spread(tags, n, fill = 0)
# A tibble: 3 x 5
#  customer chocolate filled glazed sprinkles
#  <chr>        <dbl>  <dbl>  <dbl>     <dbl>
#1 A                3      1      0         2
#2 B                2      2      0         0
#3 C                1      0      2         1

数据

df1 <- structure(list(product = 100:102, tags = c("chocolate, sprinkles", 
"chocolate, filled", "glazed")), class = "data.frame", row.names = c(NA, 
 -3L))

df2 <- structure(list(customer = c("A", "A", "B", "C", "C", "B", "A", 
 "C"), product = c(100L, 101L, 101L, 100L, 102L, 101L, 100L, 102L
 )), class = "data.frame", row.names = c(NA, -8L))

Answer 2

可以使用内置方法获取虚拟变量来大大简化python方法。然后merge，然后是groupby + sum。从@SuryaMurali提供的数据开始

import pandas as pd

df_A = pd.concat([df_A, df_A.tags.str.get_dummies(sep=', ')], 1).drop(columns='tags')
df_B.merge(df_A).drop(columns='product').groupby('customer').sum()

输出：

           filled   sprinkles  chocolate  glazed
customer                                        
A               1           2          3       0
B               2           0          2       0
C               0           1          1       2

Answer 3

在Python中：

import pandas as pd

# Creating dataframe for Table A
tableA = [(100, 'chocolate, sprinkles'), (101, 'chocolate, filled'), (102, 'glazed')]
labels = ['product', 'tags']
df_A = pd.DataFrame.from_records(tableA, columns=labels)

# Creating dataframe for Table B
tableB = [('A', 100), ('A', 101), ('B', 101),  ('C', 100), ('C', 102), ('B', 101), ('A', 100), ('C', 102)]
labels = ['customer', 'product']
df_B = pd.DataFrame.from_records(tableB, columns=labels)

new_df = pd.merge(df_A, df_B, how='inner', on='product')
new_df = (new_df.set_index(new_df.columns.drop('tags', 1)
                        .tolist()).tags.str.split(', ', expand=True).stack().reset_index()
           .rename(columns={0: 'tags'}).loc[:, new_df.columns])

final_df = new_df.pivot_table(values='tags', index=['customer'], columns=['tags'],
                      aggfunc='size')
final_df.fillna(0, inplace=True)
final_df = final_df.astype(int)

print(final_df)

输出：

tags      chocolate  filled  glazed  sprinkles
customer                                      
   A          3       1       0          2
   B          2       2       0          0
   C          1       0       2          1

使用R：

library(tidyr)
library(dplyr)
library(reshape2)
library(data.table) ## or library(reshape2)

#Creating the tables
tableA <- data.frame("product" = c(100, 101, 102),
                 "tags" = c("chocolate, sprinkles", "chocolate, filled", "glazed"))
newA = separate_rows(tableA, "tags")

tableB <- data.frame("customer" = c('A', 'A', 'B', 'C', 'C', 'B', 'A', 'C'),
                 "product" = c(100, 101, 101, 100, 102, 101, 100, 102))

joinData = merge(newA, tableB, by=c('product'))

final_df = dcast(melt(as.data.table(joinData), id.vars = c("tags", "customer")), 
             customer ~ tags, value.var = "value")
final_df

输出：

> final_dfcena
   customer chocolate filled glazed sprinkles
1:        A         3      1      0         2
2:        B         2      2      0         0
3:        C         1      0      2         1

在R或Python中合并和重塑2个数据框的行和列

3 个答案:

数据

输出：