Question

我正在尝试编写一个将类似条目组合到公共类别中的脚本。

我有数据集：

product <- c('Laptops','13" Laptops','Apple Laptops', '10 inch laptop','Laptop 13','TV','Big TV')
volume <- c(100,10,20,2,1,200,10)
dataset <- data.frame(product,volume)

看起来像：

         product volume
1        Laptops    100
2    13" Laptops     10
3  Apple Laptop      20
4 10 inch laptop      2
5      Laptop 13      1
6             TV    200
7         Big TV     10

我想要做的是将所有类别组合在一起，例如在运行脚本后我想要数据集：

         product volume
1        Laptops    113
2  Apple Laptop     20
3  TV               210

由于Apple是一个品牌，我希望它与类别分开。我不知道如何开始，但我想我需要一个for循环遍历每一行，并检查品牌名称是否在产品名称中。例如。

brandlist <- 'Apple|Samsung'
if ( grepl(brandlist, dataset$product[i])) { Skip this row }

现在我需要定义类别名称 - 我通过查看大多数搜索的产品来做，因为人们倾向于搜索类别。如果音量为>100，我们可以说一行是一个类别。

categories <- c()
for ( i in 1:count(dataset) ) {
     if ( dataset$volume[i] > 100 ) { categories <- c(categories , dataset$product[i] }}

现在我需要检查每个行名称是否有某种程度的匹配...我正在考虑某种带有数字+＆＃34的正则表达式。 +类别或其他方式。我还在考虑某种算法来检查有多少字母是不同的，例如允许4个字符不同，至少5个字符必须与该类别完全匹配，因此笔记本电脑和13＆＃34;笔记本电脑将被组合在一起，因为它们共有7个字符，并且有4个不同。

修改

我目前正在考虑以下解决方案：

我制作了一个类别列表，并创建了一个新的数据框，例如：

category <- c ('other', 'category 1', 'category 2')
volume <- c(0,0,0)
df <- data.frame(category,volume)

    category volume
1 other           0
2 category 1      0
3 category 2      0

现在我想通过循环查看上表中的结果，并匹配所有结果（基于对品牌和匹配的限制 - 它必须有1个共同点，并且可能在某些方面有所不同，并将结果放在一起在新的数据框架中。

Answer 1

在第一部分中，您可以定义类别列表，然后进行差异排除

Categories <- c("Laptop","TV")
Brands <- c("Apple")
Aggregated.df <- do.call(rbind,lapply(1:length(Categories),function(x){
    SumRow <- sum(dataset[grepl(Categories[x],dataset$product,ignore.case=TRUE),"volume"])
    Excluded <- sapply(1:length(Brands),function(y){
        SumCol <- sum(dataset[grepl(Categories[x],dataset$product,ignore.case=TRUE) & grepl(Brands[y],dataset$product,ignore.case=TRUE),"volume"])
    })
    SumRow  <- ifelse((SumRow - sum(Excluded)) < 0, 0, (SumRow - sum(Excluded)))
    Excluded.df <- NULL
    if(any(Excluded>0)){
        Which <- which(Excluded>0)
        Excluded.df <- data.frame(Product=paste(Brands[Which],Categories[x],sep=" "), volume = Excluded[Which])
    }
    Row.df <- data.frame(Product=Categories[x], volume = SumRow)
    DataFrame <- rbind(Row.df,Excluded.df)
}))

现在我需要定义类别名称 - 我通过查看大多数搜索的产品来做，因为人们倾向于搜索类别。如果音量大于100，我们可以说一行是一个类别。

Min.volume <- 100
Categories <- unique(Aggregated.df$Product[Aggregated.df$volume > Min.volume])

Answer 2

您可以尝试以下操作。首先删除"，\或" "等所有数字和符号。然后搜索品牌并提取最后的单词，如果找到品牌则更新，并使用小写打印。最后替换复数s。在最后一步进行分组和总结。当然，这是所提供的data.frame的硬编码解决方案，但我没有别的办法。

library(stringi)
library(tidyverse)
dataset %>% 
  mutate(p2=gsub("[[:digit:]]|\"","",product),
         p2=stri_trim(p2)) %>% 
  mutate(p3=grepl(brandlist, p2)) %>% 
  mutate(p4=stri_extract_last_words(p2),
         p4=ifelse(p3, grep(brandlist, p2, value=T), p4),
         p4=tolower(p4), 
         p4=stri_replace_last_fixed(p4, "s","")) %>%   
  group_by(p4) %>% 
  summarise(volume=sum(volume)) %>% 
  select(product=p4, volume)
# A tibble: 3 x 2
       product volume
         <chr>  <dbl>
1       laptop    113
2           tv    210
3 apple laptop     20

编辑：您还可以设置功能。但是你必须自己创建类别。请注意以单数和小写形式书写。

library(stringr)
foo <- function(data, product=product, volume=volume, brandlist, categories){
  data %>% 
  mutate(p1=tolower(product)) %>% 
  mutate(p2=str_extract(p1, brandlist),
         p2=ifelse(is.na(p2),"",p2)) %>% 
  mutate(p3=str_extract(p1, categories)) %>% 
  unite(Product, p2, p3, sep = " ") %>%  
  mutate(Product=str_trim(Product)) %>% 
  group_by(Product) %>% 
  summarise(volume=sum(volume))
}

foo(dataset, brandlist = 'apple|samsung',categories = "laptop|tv")
# A tibble: 3 x 2
Product volume
<chr>  <dbl>
1 apple laptop     20
2       laptop    113
3           tv    210  

foo(dataset, brandlist = 'apple|samsung',categories = "laptop|tv|big tv")
> foo(dataset, brandlist = 'apple|samsung',categories = "laptop|tv|big tv")
# A tibble: 4 x 2
Product volume
<chr>  <dbl>
1 apple laptop     20
2       big tv     10
3       laptop    113
4           tv    200

将数据集中的行组合为R中的类别

2 个答案: