我的df
> df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100))
> df
food sold
1 fruit banana 99.47171
2 fruit apple 99.40878
3 fruit grape 99.28727
4 bread 99.15934
5 meat 100.53438
现在我要替换以“水果”开头的食物中的所有值,然后按食物分组并总结出售的总和。
> df %>%
+ mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>%
+ group_by(food) %>%
+ summarise(sold = sum(sold))
Source: local data frame [3 x 2]
food sold
(fctr) (dbl)
1 bread 99.15934
2 meat 100.53438
3 NA 298.16776
为什么这个命令不起作用?它给了我NA而不是水果?
答案 0 :(得分:8)
这对我有用,我认为你的数据是因素:
在制作如下数据时使用Invoke-Sqlcmd -ServerInstance "ServerName" -database "DatabaseName" -Query "EXEC dbo.sp_sample"
,或者您可以在R环境中运行stringsAsFactors=FALSE
以避免相同的情况:
options(stringsAsFactors=FALSE)
<强>输出:强>
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100),stringsAsFactors = FALSE)
df %>%
mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>%
group_by(food) %>%
summarise(sold = sum(sold))
答案 1 :(得分:2)
我们可以使用base R
执行此操作而无需转换为character
类,方法是将levels
与'fruit'分配到'fruit'并使用aggregate
来获取sum
1}}
levels(df$food)[grepl("fruit", levels(df$food))] <- "fruit"
aggregate(sold~food, df, sum)
# food sold
#1 bread 99.41637
#2 fruit 300.41033
#3 meat 100.84746
set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape",
"bread", "meat"), sold = rnorm(5, 100))
答案 2 :(得分:1)
这里有两个替代解决方案,它们使用forcats
,stringr
和正则表达式直接操作因子水平。
如果我理解正确,则问题是由于food
是replace()
无法适当处理的因素所致。
fct_collapse()
fct_collapse()
函数用于将以"fruit "
开头的所有因子级别(请注意末尾的空白)折叠为因子级别“水果”:
library(dplyr)
library(stringr)
library(forcats)
df %>%
group_by(food = fct_collapse(food, fruit = levels(food) %>% str_subset("^fruit "))) %>%
summarise(sold = sum(sold))
food sold <fct> <dbl> 1 bread 99.4 2 egg fruits 100. 3 fruit 300. 4 fruity wine 100. 5 meat 101.
请注意,使用了增强的样本数据集,其中包括边缘案例以更好地测试正则表达式。此外,分组变量是直接在group_by()
中计算的,因此可以节省事先调用mutate()
的情况。
str_replace()
,后向 还有一个更短的解决方案,它使用str_replace()
而不是replace()
以及更复杂的正则表达式。常规表达式使用 look-behind 来删除前导"fruit"
之后的所有字符(包括“ fruit”之后的空白):
df %>%
group_by(food = str_replace(food, "(?<=^fruit)( .*)", "")) %>%
summarise(sold = sum(sold))
结果与上面相同。
set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread",
"meat", "egg fruits", "fruity wine"),
sold = rnorm(7, 100))
df
food sold 1 fruit banana 99.45412 2 fruit apple 100.53659 3 fruit grape 100.41962 4 bread 99.41637 5 meat 100.84746 6 egg fruits 100.26602 7 fruity wine 100.44459
答案 3 :(得分:0)
replace
无效,因为列food
是因子变量而fruit
是未知级别。
一种可能的解决方案是使用正确的因子级别
定义数据帧列food
df <- data.frame(food =
factor(c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
levels =c("fruit banana", "fruit apple", "fruit grape", "bread", "meat", "fruit") ),
sold = rnorm(5, 100))
更容易设置stringsAsFactors = FALSE
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
sold = rnorm(5, 100),
stringsAsFactors = FALSE)
答案 4 :(得分:0)
虽然Q标记有dplyr
和stringr
,但我想提出使用data.table
的替代解决方案,因为data.table
以方便,直接的方式处理因素:
library(data.table)
setDT(df)[food %like% "^fruit", food := "fruit"][, .(sold = sum(sold)), by = food]
# food sold
#1: fruit 300.41033
#2: bread 99.41637
#3: meat 100.84746
set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
sold = rnorm(5, 100))