Big Cartel有一个将订单导出到csv文件的选项。然而,这种结构对我需要进行的分析并不是很好。
以下是Big cartel csv订单下载中的列和行的子集(还有其他列对手头的问题并不重要)。
Number, Buyer name,Items,Item count,Item total,Total price,Total shipping,Total tax,Total discount
1,jim,product_name:Plate|product_option_name:Red|quantity:1|price:9.99|total:9.99,1,9.99,11.98,1.99,0,0
2,bill,product_name:Plate|product_option_name:Green|quantity:1|price:9.99|total:9.99;product_name:Plate|product_option_name:Blue|quantity:1|price:9.99|total:9.99,2,19.98,22.98,3,0,0
3,jane,product_name:Plate|product_option_name:Red|quantity:1|price:6.99|total:6.99;product_name:Thingy|product_option_name:|quantity:1|price:9.99|total:9.99;product_name:Mug|product_option_name:Grey|quantity:1|price:10.99|total:10.99;product_name:Cup|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Saucer|product_option_name:Grey|quantity:1|price:9.99|total:9.99;product_name:Stopper|product_option_name:|quantity:1|price:9.99|total:9.99,6,57.94,64.94,7,0,0
4,dale,product_name:Plate|product_option_name:Green|quantity:1|price:10.99|total:10.99,1,10.99,13.99,4.99,0,1.99
items列可以有多个“line-items”,分号(;)作为分隔符。每个“行项目”具有用管道(|)分隔的五个属性,即product_name,product_option_name,数量,价格和总数(即,用于该行)。有一个列“项目计数”,它给出了“订单项”的数量以及(订单)总价,运费,税金和折扣的列数。对于分析,我希望以下长格式的数据,其中运费,税金和折扣也被视为“产品项目”。
Number Buyer name line-item product_option_name quantity price total
1 jim Plate Red 1 9.99 9.99
1 jim shipping 1 1.99 1.99
1 jim tax 0 0 0
1 jim discount 0 0 0
2 bill Plate Green 1 9.99 9.99
2 bill Plate Blue 1 9.99 9.99
2 bill shipping 1 3 3
2 bill tax 0 0 0
2 bill discount 0 0 0
3 jane Plate Red 1 6.99 6.99
3 jane Thingy 1 9.99 9.99
3 jane Mug Grey 1 10.99 10.99
3 jane Cup Grey 1 9.99 9.99
3 jane Saucer Grey 1 9.99 9.99
3 jane Stopper 1 9.99 9.99
3 jane shipping 1 7 7
3 jane tax 0 0 0
3 jane discount 0 0 0
4 dale Plate Green 1 10.99 10.99
4 dale shipping 1 4.99 4.99
4 dale tax 0 0
4 dale discount 0 -1.99 -1.99
使用r:data.table中的tstrsplit()和来自r:splitstackshape的cSplit()似乎是解决方案,但我无法正确使用语法。我也试过了tidyverse / dplyr函数分开/传播等但我无法得到我需要的输出。
我一直在谷歌搜索并搜索所有SO问题 - 有一些解决方案(这一个R: Split Variable Column into multiple (unbalanced) columns by comma)很接近,但没有一个能让我在那里,因为大多数人采用宽'格式而不是'长' 。
答案 0 :(得分:0)
这样的事情可能会让你得到你正在寻找的东西。
library(dplyr)
library(tidyr)
library(stringr)
filepath <- # Path to datafile here
df <- read.csv(filepath, stringsAsFactors = FALSE)
cols <- paste0("col", 1:(max(str_count(df$Items, ";")) + 1))
df <- df %>%
separate(col = Items, into = cols, sep = ";", fill = "right") %>%
gather_("column", "details", cols, na.rm = TRUE) %>%
select(-column) %>%
separate(col = details, into = c("product_name", "product_option_name","quantity","price","total"), sep = "\\|", fill = "right") %>%
mutate(product_name = sub("^.*\\:", "", product_name),
product_option_name = sub("^.*\\:", "", product_option_name),
quantity = sub("^.*\\:", "", quantity),
price = sub("^.*\\:", "", price),
total = sub("^.*\\:", "", total)) %>%
gather("line", "item", c(Total.shipping, Total.discount, Total.tax, product_name)) %>%
mutate(product_option_name = ifelse(line == "product_name" & product_option_name != "", product_option_name, NA),
line_item = ifelse(line == "product_name", item, sub("^.*\\.","", line)),
price = ifelse(line == "product_name", price, item),
price = ifelse(line_item == "discount", as.numeric(price) * (-1), price),
quantity = ifelse(line_item %in% c("shipping","discount","tax") & price == "0", 0, quantity),
total = as.numeric(price) * as.numeric(quantity)) %>%
distinct() %>%
select(Number, Buyer.name, line_item, product_option_name, quantity, price, total) %>%
arrange(Number)