Question

我的数据快照：

df  <-data.frame(product_path = c("https://mycommerece.com/product/book/miracle", "https://mycommerece.com/product/book/miracle2", "https://mycommerece.com/product/gadget/airplane"), var1 = c(1,1,1), commereceurl = c("https://mycommerece.com/product/","https://mycommerece.com/product/","https://mycommerece.com/product2/"), var2 = c(1,0,0))
> df
                                     product_path var1                      commereceurl var2
1    https://mycommerece.com/product/book/miracle    1  https://mycommerece.com/product/    1
2   https://mycommerece.com/product/book/miracle2    1  https://mycommerece.com/product/    0
3 https://mycommerece.com/product/gadget/airplane    1 https://mycommerece.com/product2/    0

我尝试制作的是这样的数据框：

df  <-data.frame(product_path = c("https://mycommerece.com/product/book/miracle;https://mycommerece.com/product/book/miracle2", "https://mycommerece.com/product/gadget/airplane"), var1 = c(1,1), commereceurl = c("https://mycommerece.com/product/","https://mycommerece.com/product2/"), var2 = c(1,0), count_product_path = c(2,1))
> df
                                                                                product_path var1
1 https://mycommerece.com/product/book/miracle;https://mycommerece.com/product/book/miracle2    1
2                                            https://mycommerece.com/product/gadget/airplane    1
                       commereceurl var2 count_product_path
1  https://mycommerece.com/product/    1                  2
2 https://mycommerece.com/product2/    0                  1

我尝试制作的一些解释。 product_path列包含唯一的URL，但根据列commereceurl，一个product_path可以基于commerceurl的值在同一个组中。所以我想把它们合并成一行，在0和1的列中如果存在则保持1。 count_product_path列是已合并的product_path的编号。

有什么办法可以吗？

Answer 1

您可以使用dplyr，stringr和data.table

尝试此3个步骤的解决方案

# data
df  <-data.frame(product_path = c("https://mycommerece.com/product/book/miracle",
                                  "https://mycommerece.com/product/book/miracle2",
                                  "https://mycommerece.com/product/gadget/airplane"),
                 var1 = c(1,1,1),
                 commereceurl = c("https://mycommerece.com/product/",
                                  "https://mycommerece.com/product/",
                                  "https://mycommerece.com/product2/"),
                 var2 = c(1,0,0))

library(dplyr); library(stringr)
# step 1: group df by commereceurl, summarise product_path and create count_product_path 
df2 <- df %>%
  group_by(commereceurl) %>%
  summarise(product_path = paste(product_path, collapse = ";")) %>%
  mutate(count_product_path = str_count(product_path, pattern = "https:")) # count the pattern "https:"
# this pattern should appear once for each url

# step 2: merge df and df2 based on commereceurl
df3 <- left_join(df2, df[, -1], by = "commereceurl")

# step3: delete some rows with duplicated values on commereceurl and  
# keep rows with the higher var2
library(data.table)
df.final <- setDT(df3)[df3[, .I[which.max(var2)], by = commereceurl]$V1] # final output

二进制和同一组成一行

1 个答案: