使用dplyr和stringr替换所有值以

时间:2017-05-04 09:15:47

标签: r dplyr stringr

我的df

> df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100))
>   df
          food      sold
1 fruit banana  99.47171
2  fruit apple  99.40878
3  fruit grape  99.28727
4        bread  99.15934
5         meat 100.53438

现在我要替换以“水果”开头的食物中的所有值,然后按食物分组并总结出售的总和。

> df %>%
+     mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>% 
+     group_by(food) %>% 
+     summarise(sold = sum(sold))
Source: local data frame [3 x 2]

    food      sold
  (fctr)     (dbl)
1  bread  99.15934
2   meat 100.53438
3     NA 298.16776

为什么这个命令不起作用?它给了我NA而不是水果?

5 个答案:

答案 0 :(得分:8)

这对我有用,我认为你的数据是因素:

在制作如下数据时使用Invoke-Sqlcmd -ServerInstance "ServerName" -database "DatabaseName" -Query "EXEC dbo.sp_sample" ,或者您可以在R环境中运行stringsAsFactors=FALSE以避免相同的情况:

options(stringsAsFactors=FALSE)

<强>输出:

df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), sold = rnorm(5, 100),stringsAsFactors = FALSE)

df %>%
mutate(food = replace(food, str_detect(food, "fruit"), "fruit")) %>% 
group_by(food) %>% 
summarise(sold = sum(sold))

答案 1 :(得分:2)

我们可以使用base R执行此操作而无需转换为character类,方法是将levels与'fruit'分配到'fruit'并使用aggregate来获取sum 1}}

levels(df$food)[grepl("fruit", levels(df$food))] <- "fruit"
aggregate(sold~food, df, sum)
#   food      sold
#1 bread  99.41637
#2 fruit 300.41033
#3  meat 100.84746

数据

set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", 
                 "bread", "meat"), sold = rnorm(5, 100))

答案 2 :(得分:1)

这里有两个替代解决方案,它们使用forcatsstringr和正则表达式直接操作因子水平。

如果我理解正确,则问题是由于foodreplace()无法适当处理的因素所致。

1。 fct_collapse()

fct_collapse()函数用于将以"fruit "开头的所有因子级别(请注意末尾的空白)折叠为因子级别“水果”:

library(dplyr)
library(stringr)
library(forcats)
df %>%
  group_by(food = fct_collapse(food, fruit = levels(food) %>% str_subset("^fruit "))) %>% 
  summarise(sold = sum(sold))
  food         sold
  <fct>       <dbl>
1 bread        99.4
2 egg fruits  100. 
3 fruit       300. 
4 fruity wine 100. 
5 meat        101.

请注意,使用了增强的样本数据集,其中包括边缘案例以更好地测试正则表达式。此外,分组变量是直接在group_by()中计算的,因此可以节省事先调用mutate()的情况。

2。 str_replace()后向

还有一个更短的解决方案,它使用str_replace()而不是replace()以及更复杂的正则表达式。常规表达式使用 look-behind 来删除前导"fruit"之后的所有字符(包括“ fruit”之后的空白):

df %>%
  group_by(food = str_replace(food, "(?<=^fruit)( .*)", "")) %>% 
  summarise(sold = sum(sold))

结果与上面相同。

增强的数据样本集

set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", 
                          "meat", "egg fruits", "fruity wine"), 
                 sold = rnorm(7, 100))
df
          food      sold
1 fruit banana  99.45412
2  fruit apple 100.53659
3  fruit grape 100.41962
4        bread  99.41637
5         meat 100.84746
6   egg fruits 100.26602
7  fruity wine 100.44459

答案 3 :(得分:0)

replace无效,因为列food是因子变量而fruit是未知级别。

一种可能的解决方案是使用正确的因子级别

定义数据帧列food
df <- data.frame(food = 
  factor(c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), 
    levels =c("fruit banana", "fruit apple", "fruit grape", "bread", "meat", "fruit") ), 
    sold = rnorm(5, 100))

更容易设置stringsAsFactors = FALSE

df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"),
             sold = rnorm(5, 100), 
             stringsAsFactors = FALSE)

答案 4 :(得分:0)

虽然Q标记有dplyrstringr,但我想提出使用data.table的替代解决方案,因为data.table以方便,直接的方式处理因素:

library(data.table)
setDT(df)[food %like% "^fruit", food := "fruit"][, .(sold = sum(sold)), by = food]
#    food      sold
#1: fruit 300.41033
#2: bread  99.41637
#3:  meat 100.84746

数据

set.seed(24)
df <- data.frame(food = c("fruit banana", "fruit apple", "fruit grape", "bread", "meat"), 
                 sold = rnorm(5, 100))