我正在尝试将两个csv文件(客户购买数据,产品数据)作为数据框读取,然后进行组合和转动。
示例:
Customer Purchase Data:
CustomerID ProductId
1 39
1 6
2 8
3 39
3 40
Product Data:
ProductId Name
6 Car
8 House
39 Plane
40 Boat
Desired Pivot Table
ProductId Name Cust_1 Cust_2 Cust_3
6 Car 1 0 0
8 House 0 1 0
39 Plane 1 0 1
40 Boat 0 0 1
我的问题是:
可以这样做吗?
应该这样做吗?我可以在Excel中将其转换为csv。
答案 0 :(得分:6)
这是另外两个步骤。
第1步:加入两个表
using DataFrames
### Create the DataFrame
customer = DataFrame(customerid = [1, 1, 2, 3, 3],
productid = [39, 6, 8, 39, 40])
product = DataFrame(productid = [6, 8, 39, 40],
name = ["Car", "House", "Plane", "Boat"])
res = join(customer, product, on = :productid)
# 5x3 DataFrames.DataFrame
# | Row | customerid | productid | name |
# |-----|------------|-----------|---------|
# | 1 | 1 | 6 | "Car" |
# | 2 | 2 | 8 | "House" |
# | 3 | 1 | 39 | "Plane" |
# | 4 | 3 | 39 | "Plane" |
# | 5 | 3 | 40 | "Boat" |
第2步::使用" 1"添加虚拟列并取消堆叠DataFrame
(从长格式移动到宽格式)
### Add dummy column
res[:tmp] = 1
res
# 5x4 DataFrames.DataFrame
# | Row | customerid | productid | name | tmp |
# |-----|------------|-----------|---------|-----|
# | 1 | 1 | 6 | "Car" | 1 |
# | 2 | 2 | 8 | "House" | 1 |
# | 3 | 1 | 39 | "Plane" | 1 |
# | 4 | 3 | 39 | "Plane" | 1 |
# | 5 | 3 | 40 | "Boat" | 1 |
### Pivot from long to Wide
res = unstack(res, :customerid, :tmp)
# 4x5 DataFrames.DataFrame
# | Row | productid | name | 1 | 2 | 3 |
# |-----|-----------|---------|----|----|----|
# | 1 | 6 | "Car" | 1 | NA | NA |
# | 2 | 8 | "House" | NA | 1 | NA |
# | 3 | 39 | "Plane" | 1 | NA | 1 |
# | 4 | 40 | "Boat" | NA | NA | 1 |
### Finally we can replace NA by 0
[res[isna(res[col]), col] = 0 for col in [symbol("1"),
symbol("2"),
symbol("3")]]
res
# 4x5 DataFrames.DataFrame
# | Row | productid | name | 1 | 2 | 3 |
# |-----|-----------|---------|---|---|---|
# | 1 | 6 | "Car" | 1 | 0 | 0 |
# | 2 | 8 | "House" | 0 | 1 | 0 |
# | 3 | 39 | "Plane" | 1 | 0 | 1 |
# | 4 | 40 | "Boat" | 0 | 0 | 1 |
如果要更改列名,可以手动执行
names!(res, [:productid, :name, :cust_1, :cust_2, :cust_3])
答案 1 :(得分:3)
你可以。您可以DataFrames.jl使用join:
using DataFrames
cp = readtable("data/Customer_Purchase_Data.csv", separator = ' ')
p = readtable("data/Product_Data.csv", separator = ' ')
f = join(cp, p, on = :ProductId)
5x3 DataFrames.DataFrame
| Row | CustomerID | ProductId | Name |
|-----|------------|-----------|---------|
| 1 | 1 | 6 | "Car" |
| 2 | 2 | 8 | "House" |
| 3 | 1 | 39 | "Plane" |
| 4 | 3 | 39 | "Plane" |
| 5 | 3 | 40 | "Boat" |