我的数据快照:
df <-data.frame(product_path = c("https://mycommerece.com/product/book/miracle", "https://mycommerece.com/product/book/miracle2", "https://mycommerece.com/product/gadget/airplane"), var1 = c(1,1,1), commereceurl = c("https://mycommerece.com/product/","https://mycommerece.com/product/","https://mycommerece.com/product2/"), var2 = c(1,0,0))
> df
product_path var1 commereceurl var2
1 https://mycommerece.com/product/book/miracle 1 https://mycommerece.com/product/ 1
2 https://mycommerece.com/product/book/miracle2 1 https://mycommerece.com/product/ 0
3 https://mycommerece.com/product/gadget/airplane 1 https://mycommerece.com/product2/ 0
我尝试制作的是这样的数据框:
df <-data.frame(product_path = c("https://mycommerece.com/product/book/miracle;https://mycommerece.com/product/book/miracle2", "https://mycommerece.com/product/gadget/airplane"), var1 = c(1,1), commereceurl = c("https://mycommerece.com/product/","https://mycommerece.com/product2/"), var2 = c(1,0), count_product_path = c(2,1))
> df
product_path var1
1 https://mycommerece.com/product/book/miracle;https://mycommerece.com/product/book/miracle2 1
2 https://mycommerece.com/product/gadget/airplane 1
commereceurl var2 count_product_path
1 https://mycommerece.com/product/ 1 2
2 https://mycommerece.com/product2/ 0 1
我尝试制作的一些解释。 product_path列包含唯一的URL,但根据列commereceurl,一个product_path可以基于commerceurl的值在同一个组中。所以我想把它们合并成一行,在0和1的列中如果存在则保持1。 count_product_path列是已合并的product_path的编号。
有什么办法可以吗?
答案 0 :(得分:1)
您可以使用dplyr
,stringr
和data.table
# data
df <-data.frame(product_path = c("https://mycommerece.com/product/book/miracle",
"https://mycommerece.com/product/book/miracle2",
"https://mycommerece.com/product/gadget/airplane"),
var1 = c(1,1,1),
commereceurl = c("https://mycommerece.com/product/",
"https://mycommerece.com/product/",
"https://mycommerece.com/product2/"),
var2 = c(1,0,0))
library(dplyr); library(stringr)
# step 1: group df by commereceurl, summarise product_path and create count_product_path
df2 <- df %>%
group_by(commereceurl) %>%
summarise(product_path = paste(product_path, collapse = ";")) %>%
mutate(count_product_path = str_count(product_path, pattern = "https:")) # count the pattern "https:"
# this pattern should appear once for each url
# step 2: merge df and df2 based on commereceurl
df3 <- left_join(df2, df[, -1], by = "commereceurl")
# step3: delete some rows with duplicated values on commereceurl and
# keep rows with the higher var2
library(data.table)
df.final <- setDT(df3)[df3[, .I[which.max(var2)], by = commereceurl]$V1] # final output