有两个表-表A和表B:
表A:产品属性该表包含两列;第二列为产品属性。第一个是由整数表示的唯一产品ID,第二个是包含分配给该产品的属性集合的字符串。
|---------------------|-----------------------|
| product | tags |
|---------------------|-----------------------|
| 100 | chocolate, sprinkles |
|---------------------|-----------------------|
| 101 | chocolate, filled |
|---------------------|-----------------------|
| 102 | glazed |
|---------------------|-----------------------|
表B:客户属性第二个表也包含两列。第一个是包含客户名称的字符串,第二个是包含产品编号的整数。第二列的产品ID与表A的第一列的产品ID相同。
customer product
A 100
A 101
B 101
C 100
C 102
B 101
A 100
C 102
要求您创建一个与此格式匹配的表,其中单元格的内容表示客户对产品属性的出现次数。
customer chocolate sprinkles filled glazed
A ? ? ? ?
B ? ? ? ?
C ? ? ? ?
有人可以帮我用R或Python解决这个问题吗?
答案 0 :(得分:1)
我们通过“产品”列加入,在定界符处拆分“标签”以扩展行,并使用count
和spread
来获得“标签”,“客户”的频率, “宽”格式
library(tidyverse)
df1 %>%
right_join(df2) %>%
separate_rows(tags) %>%
count(tags, customer) %>%
spread(tags, n, fill = 0)
# A tibble: 3 x 5
# customer chocolate filled glazed sprinkles
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 3 1 0 2
#2 B 2 2 0 0
#3 C 1 0 2 1
df1 <- structure(list(product = 100:102, tags = c("chocolate, sprinkles",
"chocolate, filled", "glazed")), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(customer = c("A", "A", "B", "C", "C", "B", "A",
"C"), product = c(100L, 101L, 101L, 100L, 102L, 101L, 100L, 102L
)), class = "data.frame", row.names = c(NA, -8L))
答案 1 :(得分:1)
可以使用内置方法获取虚拟变量来大大简化python方法。然后merge
,然后是groupby
+ sum
。从@SuryaMurali提供的数据开始
import pandas as pd
df_A = pd.concat([df_A, df_A.tags.str.get_dummies(sep=', ')], 1).drop(columns='tags')
df_B.merge(df_A).drop(columns='product').groupby('customer').sum()
filled sprinkles chocolate glazed
customer
A 1 2 3 0
B 2 0 2 0
C 0 1 1 2
答案 2 :(得分:0)
在Python中:
import pandas as pd
# Creating dataframe for Table A
tableA = [(100, 'chocolate, sprinkles'), (101, 'chocolate, filled'), (102, 'glazed')]
labels = ['product', 'tags']
df_A = pd.DataFrame.from_records(tableA, columns=labels)
# Creating dataframe for Table B
tableB = [('A', 100), ('A', 101), ('B', 101), ('C', 100), ('C', 102), ('B', 101), ('A', 100), ('C', 102)]
labels = ['customer', 'product']
df_B = pd.DataFrame.from_records(tableB, columns=labels)
new_df = pd.merge(df_A, df_B, how='inner', on='product')
new_df = (new_df.set_index(new_df.columns.drop('tags', 1)
.tolist()).tags.str.split(', ', expand=True).stack().reset_index()
.rename(columns={0: 'tags'}).loc[:, new_df.columns])
final_df = new_df.pivot_table(values='tags', index=['customer'], columns=['tags'],
aggfunc='size')
final_df.fillna(0, inplace=True)
final_df = final_df.astype(int)
print(final_df)
输出:
tags chocolate filled glazed sprinkles
customer
A 3 1 0 2
B 2 2 0 0
C 1 0 2 1
使用R:
library(tidyr)
library(dplyr)
library(reshape2)
library(data.table) ## or library(reshape2)
#Creating the tables
tableA <- data.frame("product" = c(100, 101, 102),
"tags" = c("chocolate, sprinkles", "chocolate, filled", "glazed"))
newA = separate_rows(tableA, "tags")
tableB <- data.frame("customer" = c('A', 'A', 'B', 'C', 'C', 'B', 'A', 'C'),
"product" = c(100, 101, 101, 100, 102, 101, 100, 102))
joinData = merge(newA, tableB, by=c('product'))
final_df = dcast(melt(as.data.table(joinData), id.vars = c("tags", "customer")),
customer ~ tags, value.var = "value")
final_df
输出:
> final_dfcena
customer chocolate filled glazed sprinkles
1: A 3 1 0 2
2: B 2 2 0 0
3: C 1 0 2 1