拆分字符串并为R中的变量赋值

时间:2015-09-25 01:19:00

标签: r

我希望在R中回答与this类似的问题。

我正在处理一个数据集,其中包含一个变量,该变量将字符串的30个值与括号中的数值相连接。字符串和括号的单独组合以逗号分隔。

重要提示:有时,字符串值可能会重复。

例如,在df var可能是:

id    var
1     Videos (10.1), Music (9.5), Games (8.3), Videos (1)
2     Videos (11.1), Dogs (10.5), Cats (8.4), Dogs (1)
3     Cars (12.1), Music (9.5), Games (8.5), Games (2)
4     Cars (14.1), Music (9.5), Dogs (8.6)
5     Horses (10.1), Antelope (9.5), Music (8.7)
6     Music (10.1), Videos (9.0), Games (8.9)

我想要生成的是其他列,其中var中的每个唯一字符串值都有自己的列,该列的值是括号中的数字(如果可用)。 对我来说有点棘手的是,当重复一个字符串值(例如Videos)时,我想对数值求和。

因此,在我生成的数据集中,理想的输出将是:

id   Videos   Music   Games    Dogs   Cats    Cars   Horses   Antelope
1    11.1     9.5     8.3      NA     NA      NA     NA       NA
2    11.1     NA      NA       11.5   8.4     NA     NA       NA
3    NA       9.5     10.5      NA     NA      12.1   NA       NA
4    NA       9.5     NA       8.6     NA     14.1   NA       NA
5    NA       8.7     NA       8.6     NA     NA   10.1       9.5
5    9.0       10.1   8.7      NA     NA      NA     NA       NA

关于如何在R中做这件事的任何想法?

编辑:下面包含的实际数据:

 my_df<-data.frame(id=1:20, var= c("PeopleBlogs(2.88)", "Music(3.90)", "Entertainment(3.05),Music(5.10),Music(2.28)", 
"Sports(1.02)", "NonprofitsActivism(0.20),FilmAnimation(0.58)", 
"Music(3.60),Music(1.42),Music(7.60)", "GadgetsGames(0.52)", 
"Music(9.17),PeopleBlogs(0.33),PeopleBlogs(1.58),Music(8.82),Entertainment(1.38),PeopleBlogs(0.45),PeopleBlogs(0.58),Entertainment(0.92),FilmAnimation(1.60),FilmAnimation(7.57),Music(2.28),Entertainment(3.18),Entertainment(4.98),Music(0.48),FilmAnimation(0.28),FilmAnimation(0.18),Entertainment(5.97),Entertainment(1.35)", 
"FilmAnimation(2.42),GadgetsGames(3.92)", "PeopleBlogs(4.38),GadgetsGames(15.47)", 
"Entertainment(3.52)", "PeopleBlogs(0.22),Music(1.15),PetsAnimals(3.50),PeopleBlogs(2.78),PeopleBlogs(3.27)", 
"Music(2.05),PeopleBlogs(0.20)", "Music(3.48),Music(4.65),Music(0.55)", 
"Entertainment(0.78)", "Entertainment(4.35),PeopleBlogs(2.33),Comedy(7.05),PeopleBlogs(7.27)", 
"Entertainment(0.50)", "Education(1.73)", "Education(0.67)", 
"GadgetsGames(17.35),Education(7.40),NewsPolitics(0.35)"))

3 个答案:

答案 0 :(得分:4)

这是一种方法。首先,您使用cSplit()包中的splitstackshape。您是第一次将var列拆分为,并重新整理数据格式。然后,再次将var列拆分为空格。到目前为止,您有一个data.table,而不是data.frame。使用data.table包,您可以做两件事。一个是删除()并将字符转换为数字。然后,根据您的要求,您可以按idvar_1对数字求和。最后,您在dcast()包中使用data.table并获得所需的输出。我希望这会对你有所帮助。

mydf <- data.frame(id = 1:6,
                   var = c("Videos (10.1), Music (9.5), Games (8.3), Videos (1)",
                         "Videos (11.1), Dogs (10.5), Cats (8.4), Dogs (1)",
                         "Cars (12.1), Music (9.5), Games (8.5), Games (2)",
                         "Cars (14.1), Music (9.5), Dogs (8.6)",
                         "Horses (10.1), Antelope (9.5), Music (8.7)",
                         "Music (10.1), Videos (9.0), Games (8.9)"),
                   stringsAsFactors = FALSE)

library(splitstackshape)
library(data.table)
library(magrittr)

cSplit(mydf, "var", sep = ",", direction = "long") %>%
cSplit("var", sep = " ", direction = "wide") -> foo

foo[, var_2 := as.numeric(gsub(pattern = "\\(|\\)", replacement = "", x = var_2))][,
list(total = sum(var_2)), by = list(id, var_1)] %>%
dcast(id ~ var_1, value.var = "total")


#  id Antelope Cars Cats Dogs Games Horses Music Videos
#1:  1       NA   NA   NA   NA   8.3     NA   9.5   11.1
#2:  2       NA   NA  8.4 11.5    NA     NA    NA   11.1
#3:  3       NA 12.1   NA   NA  10.5     NA   9.5     NA
#4:  4       NA 14.1   NA  8.6    NA     NA   9.5     NA
#5:  5      9.5   NA   NA   NA    NA   10.1   8.7     NA
#6:  6       NA   NA   NA   NA   8.9     NA  10.1    9.0

修改

根据您的真实数据,Ananda和我的代码不起作用。这是因为您在字符和数字之间没有空格(例如Videos(10.1)),而原始样本数据之间确实有空格(例如Videos (10.1))。修改我的原始答案,以下将为您完成工作。我上传了部分结果。

cSplit(my_df, "var", sep = ",", direction = "long") %>%
cSplit("var", sep = "(", direction = "wide") -> foo


foo[, var_2 := as.numeric(gsub(pattern = "\\)", replacement = "", x = var_2))][,
list(total = sum(var_2)), by = list(id, var_1)] %>%
dcast(id ~ var_1, value.var = "total")

#    id Comedy Education Entertainment FilmAnimation GadgetsGames Music NewsPolitics
#1:  1     NA        NA            NA            NA           NA    NA           NA
#2:  2     NA        NA            NA            NA           NA  3.90           NA
#3:  3     NA        NA          3.05            NA           NA  7.38           NA
#4:  4     NA        NA            NA            NA           NA    NA           NA
#5:  5     NA        NA            NA          0.58           NA    NA           NA

答案 1 :(得分:0)

library(dplyr)
library(stringi)
library(tidyr)    

mydf %>%
  mutate(both = var %>% stri_split_fixed(", ") ) %>%
  unnest(both) %>%
  separate(both, c("category", "value.string"), sep = " ") %>%
  mutate(value = value.string %>% extract_numeric) %>%
  group_by(id, category) %>%
  summarize(value = sum(value)) %>%
  spread(category, value)

答案 2 :(得分:0)

# Another method without additional libraries
# vector with data to split
X <- c("Videos (10.1), Music (9.5), Games (8.3), Videos (1)",
       "Videos (11.1), Dogs (10.5), Cats (8.4), Dogs (1)",
       "Cars (12.1), Music (9.5), Games (8.5), Games (2)",
       "Cars (14.1), Music (9.5), Dogs (8.6)",
       "Horses (10.1), Antelope (9.5), Music (8.7)")

# custom split function
g <- function(x, ...) {
  x <- chartr(")", " ", x)
  x <- chartr(",", "\n", x)
  x <- read.table(text=x, sep="(", strip.white=TRUE)
  L <- levels(x$V1)
  V <- numeric(0)
  for (l in L) {
    V <- c(V, sum(x$V2[x$V1==l]))
  }
  names(V) <- L
  return(V)
}

# making a data.frame element by element
Y <- data.frame(case=1:length(X))
for (i in 1:length(X)) {
  rw <- g(X[i]) 
  for (n in names(rw)) {
    Y[i,n] <- rw[n]
  }
}

Y

  case Games Music Videos Cats Dogs Cars Antelope Horses
1    1   8.3   9.5   11.1   NA   NA   NA       NA     NA
2    2    NA    NA   11.1  8.4 11.5   NA       NA     NA
3    3  10.5   9.5     NA   NA   NA 12.1       NA     NA
4    4    NA   9.5     NA   NA  8.6 14.1       NA     NA
5    5    NA   8.7     NA   NA   NA   NA      9.5   10.1