这是我的数据框,仅由1个观察值组成。这是一个很长的字符串,可以识别4个不同的部分:
example <- "4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
df <- data.frame(example)
如您所见,第一个观察结果由一个字符串组成,该字符串包含4个不同的部分:评分(4.6),评分数量(19个评分),一个句子(准确地……课程)和所招生的学生(151)
我使用separate()
函数将该列分为4个:
df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep = " ")
因此,这与预期不符。
任何想法。
更新:
这就是您对@nicola的评论
> df1 <- separate(df, example, c("Rating", "Number of rating", "Sentence", "Students"), sep=" {4,}")
Warning message:
Expected 4 pieces. Additional pieces discarded in 1 rows [1].
答案 0 :(得分:1)
如何?
x <- str_split(example, " ") %>%
unlist()
x <- x[x != ""]
df <- tibble("a", "b", "c", "d")
df[1, ] <- x
colnames(df) <- c("Rating", "Number of rating", "Sentence", "Students")
> str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 4 variables:
$ Rating : chr "4.6"
$ Number of rating: chr " (19 ratings)"
$ Sentence : chr " Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of ra"| __truncated__
$ Students : chr "151 students enrolled"
答案 1 :(得分:0)
答案有两个键。首先是使用正确的正则表达式作为分隔符sep = "[[:space:]]{2,}"
,这意味着两个或更多个空格(\\s{2,}
是更常见的替代形式)。第二个是您的示例实际上有很多结尾的空格,separate()
试图将其放在另一列中。只需使用trimws()
即可将其删除。因此,解决方案如下所示:
library(tidyr)
library(dplyr)
example <- "4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
df <- data.frame(example)
df_new <- df %>%
mutate(example = trimws(example)) %>%
separate(col = "example",
into = c("rating", "number_of_ratings", "sentence", "students_enrolled"),
sep = "[[:space:]]{2,}")
as_tibble(df_new)
# A tibble: 1 x 4
rating number_of_ratings sentence students_enrolled
<chr> <chr> <chr> <chr>
1 4.6 (19 ratings) Course Ratings are calculated from individual students’ ratings and a vari~ 151 students enr~
tibble仅用于格式化输出。
答案 2 :(得分:0)
使用stringr
包和一些正则表达式当然可以实现:
rating_mean n_ratings n_students descr
1 4.65 19 151 "Course (...) accurately."
library(stringr)
# create result data frame
result <- data.frame(cbind(rating_mean = 0, n_ratings = 0, n_students = 0, descr = 0))
# loop through rows of example data frame
for (i in 1:nrow(df)){
# replace spaces
example[i, 1] <- gsub("\\s+", " ", example[i, 1])
# match and extract mean rating
result[i, 1] <- as.numeric(str_match(example[i], "^[0-9]+\\.[0-9]+"))
# match and extract number of ratings
result[i, 2] <- as.numeric(str_match(str_match(example[i, 1], "\\(.+\\)"), "[0-9]+"))
# match and extract number of enrolled students
result[i, 3] <- as.numeric(str_match(str_match(example[i, 1], "\\s[0-9].+$"), "[0-9]+"))
# match and extract sentence
result[i, 4] <- str_match(example[i, 1], "[A-Z].+\\.")
}
example <- "4.65 (19 ratings) Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately. 151 students enrolled "
example <- data.frame(example, stringsAsFactors = FALSE)