合并具有相似列值但不具有相同列值的行

时间:2018-10-13 00:34:46

标签: r regex dataframe

我搞砸了过去10周的财务数据集。.我试图对每个商店描述所花费/存入的金额进行汇总。.我能够做到这一点。

 totalofeachstore <- FullStatement %>% group_by( Description) %>% 
 summarise_at(vars(Amount), funs(sum(., na.rm = TRUE)))

 totalofeachstore <- totalofeachstore %>%
 group_by(Description) %>%
 summarize(Amount = sum(Amount))  

我发现的问题是,许多商店在我的对帐单上都包含其商店编号或说明。例如。

 Arco Gas #345   -$45.54
 Arco Gas #678   -$52.72

由于商店#的总和没有像我预期的那样倒塌。有什么办法可以折叠/求和名称不相同的行?例如,在以下商店名称中..我是否可以基于关键字AMAZON折叠所有或更好的亚马逊商店,因为AMZN和AMZ的奇数分别位于列表的第4和第5。字母?

 AMAZON.COM*MT2M03AW1 AM PURCHASE AMZN.COM/BILL WA -8.08
 AMAZON.COM*MT80Z2EC0 AM PURCHASE AMZN.COM/BILL WA -13.28
 AMAZON.COM*MT8G19G51 AM PURCHASE AMZN.COM/BILL WA -31.03
 AMZ*Stride Rite PURCHASE Customerservi NY         -35.20
 AMZN MKTP US AMZN.COM/B PURCHASE AMZN.COM/BILL WA -181.08
 ARBYS 0154 PURCHASE                              -13.90
 ARCO #42472 AM PURCHASE                          -30.73
 ARCO #42493 AM PURCHASE                          -29.35
 AUNT CHILADA'S PURCHASE                          -15.98

我发现了有关折叠类似行的类似问题,但是它们并没有试图同时求和。这些问题如下。

R combine rows with similar values R: combine rows with common information

EDIT1 经过一些额外的GOOGLE搜索之后。.我发现了一些“正则表达式”建议,这些建议可能可以满足我的需求。.但是,我不知道这些工作原理以及执行?grep并没有多大帮助。看起来比我目前所理解的要复杂得多。有人可以帮我分解一下吗?

来自R中的?grep。

 grep, grepl, regexpr, gregexpr and regexec search for matches to argument 
 pattern within each element of a character vector: they differ in the 
 format of and amount of detail in the results.

 sub and gsub perform replacement of the first and all matches respectively.

 grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
 fixed = FALSE, useBytes = FALSE, invert = FALSE)

 grep("[a-z]", letters)

 txt <- c("arm","foot","lefroo", "bafoobar")
 if(length(i <- grep("foo", txt)))
 cat("'foo' appears at least once in\n\t", txt, "\n")
 i # 2 and 4
 txt[i]

EDIT2: 根据以下建议,尝试了以下代码:

  Totals2 <- totalofeachstore %>%
  +   #remove everything after a *
  +   mutate(store_name = gsub("\\*.*","",Description),
  +          #remove everything after a space and a #
  +          store_name = gsub("\\ #.*","",store_name),
  +          #remove everything after a space and a number sequence
  +          store_name = gsub("\\ [0-9].*","",store_name),
  +          #assign the other Amazon purchases to Amazon
  +          store_name = 
         ifelse(str_detect(store_name,'AMZ')==TRUE,'AMAZON.COM',store_name))

,但是以下错误不断弹出。.我不认为gsub是base以外的软件包的一部分..但这感觉就像我没有加载包含“ str_detect”或其他内容的软件包。 / p>

 Error in mutate_impl(.data, dots) : 
 Evaluation error: could not find function "str_detect".

编辑3:完美!

使用“ tidyverse”程序包修复了我收到的错误,并且一切都按所描述的进行了工作,这正是我所要寻找的。

1 个答案:

答案 0 :(得分:0)

您可以使用相当一致的模式吗?从您给出的示例来看,似乎可以使用#和*将业务与子类别分开。

因此您可以在dplyr中执行以下操作:

df <- tibble(payment_amt = c(-8.08,-13.28,-31.03,-35.20,-181.08,-13.90,-30.73,-29.35,-15.98),
               description = c('AMAZON.COM*MT2M03AW1 AM PURCHASE AMZN.COM/BILL WA',
                           'AMAZON.COM*MT80Z2EC0 AM PURCHASE AMZN.COM/BILL WA',
                           'AMAZON.COM*MT8G19G51 AM PURCHASE AMZN.COM/BILL WA',
                           'AMZ*Stride Rite PURCHASE Customerservi NY',
                           'AMZN MKTP US AMZN.COM/B PURCHASE AMZN.COM/BILL WA',
                           'ARBYS 0154 PURCHASE',
                           'ARCO #42472 AM PURCHASE', 
                           'ARCO #42493 AM PURCHASE',
                           'AUNT CHILADAS PURCHASE'))

df <- df %>%
  #remove everything after a *
  mutate(store_name = gsub("\\*.*","",description),
  #remove everything after a space and a #
         store_name = gsub("\\ #.*","",store_name),
  #remove everything after a space and a number sequence
  store_name = gsub("\\ [0-9].*","",store_name),
  #assign the other Amazon purchases to Amazon
         store_name = ifelse(str_detect(store_name,'AMZ')==TRUE,'AMAZON.COM',store_name))

df_sums <- df %>%
  group_by(store_name) %>%
  summarize(payment_amt = sum(payment_amt)) %>%
  ungroup() %>%
  arrange(payment_amt)

以下是结果:

# A tibble: 4 x 2
  store_name             payment_amt
  <chr>                        <dbl>
1 AMAZON.COM                  -269. 
2 ARCO                         -60.1
3 AUNT CHILADAS PURCHASE       -16.0
4 ARBYS          -13.9