我有一个凌乱的表,它有一个包含多个类别标签的列,由几个分隔符分隔。我希望我们R在每个分隔符处拆分该列,并为每个类别标签创建一个新列。我见过的方法一次只能分成一个分隔符。
我目前的表格如下:
# ID A B C D TEXT
# 1 1 a blue water
# 2 2 a b c fresh water
# 3 3 a b d cold stream
# 4 4 b c d lovely sunset
# 5 5 b c up there
我想要一个看起来像这样的表:
my_table1 <- my_table %>%
separate(TYPE, c('A', 'B'), ",")
my_table1
# > docs1
# ID A B TEXT
# 1 1 a <NA> blue water
# 2 2 a b fresh water
# 3 3 a;b f cold stream
# 4 4 f b and c lovely sunset
# 5 5 b;c <NA> up there
my_table2 <- my_table1 %>%
separate(A, c('A', 'C' ), ";")
# > docs2
# ID A C B TEXT
# 1 1 a <NA> <NA> blue water
# 2 2 a <NA> b fresh water
# 3 3 a b f cold stream
# 4 4 f <NA> b and c lovely sunset
# 5 5 b c <NA> up there
my_table3 <- my_table2 %>%
separate(A, c('A', 'D'), "and")
# > docs3
# ID A D C B TEXT
# 1 1 a <NA> <NA> <NA> blue water
# 2 2 a <NA> <NA> b fresh water
# 3 3 a <NA> b f cold stream
# 4 4 f <NA> <NA> b and c lovely sunset
# 5 5 b <NA> c <NA> up there
以下是我的尝试:
window.updateConsole = function() {
MathJax.Hub.Queue(["Rerender",MathJax.Hub,"math"]);
MathJax.Hub.Queue(function() {
var math = MathJax.Hub.getAllJax("mathDiv")[0];
MathJax.Hub.Queue(["Text", math, "R( \\theta ) = sin^{ \\class{hover P}{" + P.show + "} } \\left ( \\frac{\\class{hover B}{" + sign_mult(B.show) + "} ⋅ \\class{hover S}{" + sign_mult(S.show) + "} ⋅ \\class{hover J}{" + sign_mult(J.show) + "} ⋅ \\theta ⋅ ( \\theta \\class{hover S}{" + sign_sum(-(S.show)) + "})}{\\class{hover N}{" + N.show + "}} \\right ) \\; \\mapsto \\; \\left\\{\\begin{array}i x(\\theta) = \\class{hover C}{" + C.show + "} \\class{hover E}{" + sign_sum(E.show) + "} ⋅ R(\\theta) ⋅ cos^{\\class{hover H}{" + H.show + "}}(\\theta)\\\\y(\\theta) =\\class{hover D}{" + D.show + "} \\class{hover F}{" + sign_sum(F.show) + "} ⋅ R(\\theta) ⋅ sin^{\\class{hover Z}{" + Z.show + "}}(\\theta)\\end{array}\\right."]);
MathJax.Hub.Queue(setConsoleWidth);
});
};
这让我很接近,但列名已关闭。另外,我不想在几次迭代后猜测字符串“b和c”的结束位置。我有数千行,也许有五六个类别。我的猜测是,有一种更简单的方法可以做到这一点。
答案 0 :(得分:2)
作为替代方案并延长tidyverse
次尝试,以下是使用strsplit
和unnest
的解决方案:
df %>%
mutate(
val = strsplit(as.character(TYPE), "(;|,\\s*|\\s*and\\s*)")) %>%
unnest() %>%
select(-TYPE) %>%
group_by(ID, TEXT) %>%
mutate(n = 1:n()) %>%
spread(n, val)
## A tibble: 5 x 5
## Groups: ID, TEXT [5]
# ID TEXT `1` `2` `3`
# <int> <fct> <chr> <chr> <chr>
#1 1 blue water a NA NA
#2 2 fresh water a b c
#3 3 cold stream a b f
#4 4 lovely sunset f b c
#5 5 up there b c NA
请注意,这与您预期的输出不完全相同。但它确实匹配@ MKR的输出。
df <- read.table(text =
"ID TYPE TEXT
1 1 'a' 'blue water'
2 2 'a,b,c' 'fresh water'
3 3 'a;b,f' 'cold stream'
4 4 'f, b and c' 'lovely sunset'
5 5 'b;c' 'up there'")
答案 1 :(得分:1)
cSplit
包中的splitstackshape
函数可以使问题更容易解决。方法可以是:
library(splitstackshape)
# First use `gsub` to replace other delimiter and have only ',' delimiter.
my_table$TYPE <- gsub("and|;",",",my_table$TYPE)
Mod_df <- cSplit(my_table, "TYPE", sep = ",")
Mod_df
# ID TEXT TYPE_1 TYPE_2 TYPE_3
# 1: 1 blue water a NA NA
# 2: 2 fresh water a b c
# 3: 3 cold stream a b f
# 4: 4 lovely sunset f b c
# 5: 5 up there b c NA
tidyr::gather
和spread
可用于获取OP提及的格式:
library(tidyr)
gather(Mod_df, key, value, -ID,-TEXT) %>% mutate_if(is.factor, as.character) %>%
mutate(K = toupper(value)) %>%
select(-key) %>%
filter(!is.na(K)) %>%
spread(K, value)
# ID TEXT A B C F
# 1 1 blue water a <NA> <NA> <NA>
# 2 2 fresh water a b c <NA>
# 3 3 cold stream a b <NA> f
# 4 4 lovely sunset <NA> b c f
# 5 5 up there <NA> b c <NA>
数据强>
my_table <- read.table(text =
" ID TYPE TEXT
1 1 a 'blue water'
2 2 'a,b,c' 'fresh water'
3 3 'a;b,f' 'cold stream'
4 4 'f, b and c' 'lovely sunset'
5 5 'b;c' 'up there'",
header = TRUE, stringsAsFactors = FALSE)