我搞砸了过去10周的财务数据集。.我试图对每个商店描述所花费/存入的金额进行汇总。.我能够做到这一点。
totalofeachstore <- FullStatement %>% group_by( Description) %>%
summarise_at(vars(Amount), funs(sum(., na.rm = TRUE)))
或
totalofeachstore <- totalofeachstore %>%
group_by(Description) %>%
summarize(Amount = sum(Amount))
我发现的问题是,许多商店在我的对帐单上都包含其商店编号或说明。例如。
Arco Gas #345 -$45.54
Arco Gas #678 -$52.72
由于商店#的总和没有像我预期的那样倒塌。有什么办法可以折叠/求和名称不相同的行?例如,在以下商店名称中..我是否可以基于关键字AMAZON折叠所有或更好的亚马逊商店,因为AMZN和AMZ的奇数分别位于列表的第4和第5。字母?
AMAZON.COM*MT2M03AW1 AM PURCHASE AMZN.COM/BILL WA -8.08
AMAZON.COM*MT80Z2EC0 AM PURCHASE AMZN.COM/BILL WA -13.28
AMAZON.COM*MT8G19G51 AM PURCHASE AMZN.COM/BILL WA -31.03
AMZ*Stride Rite PURCHASE Customerservi NY -35.20
AMZN MKTP US AMZN.COM/B PURCHASE AMZN.COM/BILL WA -181.08
ARBYS 0154 PURCHASE -13.90
ARCO #42472 AM PURCHASE -30.73
ARCO #42493 AM PURCHASE -29.35
AUNT CHILADA'S PURCHASE -15.98
我发现了有关折叠类似行的类似问题,但是它们并没有试图同时求和。这些问题如下。
R combine rows with similar values R: combine rows with common information
EDIT1 经过一些额外的GOOGLE搜索之后。.我发现了一些“正则表达式”建议,这些建议可能可以满足我的需求。.但是,我不知道这些工作原理以及执行?grep并没有多大帮助。看起来比我目前所理解的要复杂得多。有人可以帮我分解一下吗?
来自R中的?grep。
grep, grepl, regexpr, gregexpr and regexec search for matches to argument
pattern within each element of a character vector: they differ in the
format of and amount of detail in the results.
sub and gsub perform replacement of the first and all matches respectively.
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)
grep("[a-z]", letters)
txt <- c("arm","foot","lefroo", "bafoobar")
if(length(i <- grep("foo", txt)))
cat("'foo' appears at least once in\n\t", txt, "\n")
i # 2 and 4
txt[i]
EDIT2: 根据以下建议,尝试了以下代码:
Totals2 <- totalofeachstore %>%
+ #remove everything after a *
+ mutate(store_name = gsub("\\*.*","",Description),
+ #remove everything after a space and a #
+ store_name = gsub("\\ #.*","",store_name),
+ #remove everything after a space and a number sequence
+ store_name = gsub("\\ [0-9].*","",store_name),
+ #assign the other Amazon purchases to Amazon
+ store_name =
ifelse(str_detect(store_name,'AMZ')==TRUE,'AMAZON.COM',store_name))
,但是以下错误不断弹出。.我不认为gsub是base以外的软件包的一部分..但这感觉就像我没有加载包含“ str_detect”或其他内容的软件包。 / p>
Error in mutate_impl(.data, dots) :
Evaluation error: could not find function "str_detect".
编辑3:完美!
使用“ tidyverse”程序包修复了我收到的错误,并且一切都按所描述的进行了工作,这正是我所要寻找的。 p>
答案 0 :(得分:0)
您可以使用相当一致的模式吗?从您给出的示例来看,似乎可以使用#和*将业务与子类别分开。
因此您可以在dplyr中执行以下操作:
df <- tibble(payment_amt = c(-8.08,-13.28,-31.03,-35.20,-181.08,-13.90,-30.73,-29.35,-15.98),
description = c('AMAZON.COM*MT2M03AW1 AM PURCHASE AMZN.COM/BILL WA',
'AMAZON.COM*MT80Z2EC0 AM PURCHASE AMZN.COM/BILL WA',
'AMAZON.COM*MT8G19G51 AM PURCHASE AMZN.COM/BILL WA',
'AMZ*Stride Rite PURCHASE Customerservi NY',
'AMZN MKTP US AMZN.COM/B PURCHASE AMZN.COM/BILL WA',
'ARBYS 0154 PURCHASE',
'ARCO #42472 AM PURCHASE',
'ARCO #42493 AM PURCHASE',
'AUNT CHILADAS PURCHASE'))
df <- df %>%
#remove everything after a *
mutate(store_name = gsub("\\*.*","",description),
#remove everything after a space and a #
store_name = gsub("\\ #.*","",store_name),
#remove everything after a space and a number sequence
store_name = gsub("\\ [0-9].*","",store_name),
#assign the other Amazon purchases to Amazon
store_name = ifelse(str_detect(store_name,'AMZ')==TRUE,'AMAZON.COM',store_name))
df_sums <- df %>%
group_by(store_name) %>%
summarize(payment_amt = sum(payment_amt)) %>%
ungroup() %>%
arrange(payment_amt)
以下是结果:
# A tibble: 4 x 2
store_name payment_amt
<chr> <dbl>
1 AMAZON.COM -269.
2 ARCO -60.1
3 AUNT CHILADAS PURCHASE -16.0
4 ARBYS -13.9