我有一个带有标记词的向量,例如c(#142#856#856.2#745, NA, #856#855, NA, #685, #663, #965.23, #855#658#744#122)
。
单词被锐利分开。我想为每个不同的代码创建一个包含一列的数据帧,然后根据该行中的代码编写1或0(或NA)。
这个想法是每个元素变成一行,每个代码变成一个列,然后如果代码在该元素中,则列中标记为1,如果该代码不在该元素中则为0。 / p>
ID | 142 | 856 |856.2 | ... | 122 |
1 | 1 | 1 | 1 | ... | 0 |
2 | 0 | 0 | 0 | ... | 0 |
...
我知道如何使用复杂的算法进行大量循环。但是,有没有简单的方法可以轻松地做到这一点?
答案 0 :(得分:2)
您可以使用stringr
:
# First we load the package
library(stringr)
# Then we create your example data vector
tagged_vector <- c('#142#856#856.2#745', NA, '#856#855', NA, '#685', '#663',
'#965.23', '#855#658#744#122')
# Next we need to get all the unique codes
# stringr's str_extract_all() can do this:
all_codes <- str_extract_all(string=tagged_vector, pattern='(?<=#)[0-9\\.]+')
# We just looked for one or more numbers and/or dots following a '#' character
# Now we just want the unique ones:
unique_codes <- unique(na.omit(unlist(all_codes)))
# Then we can use grepl() to check whether each code occurs in any element
# I've also used as.numeric() since you want 0/1 instead of TRUE/FALSE
result <- data.frame(sapply(unique_codes, function(x){
as.numeric(grepl(x, tagged_vector))
}))
# Then we add in your ID column and move it to the front:
result$ID <- 1:nrow(result)
result <- result[ , c(ncol(result), 1:(ncol(result)-1))]
结果是
ID X142 X856 X856.2 X745 X855 X685 X663 X965.23 X658 X744 X122
1 1 1 1 1 1 0 0 0 0 0 0 0
2 2 0 0 0 0 0 0 0 0 0 0 0
3 3 0 1 0 0 1 0 0 0 0 0 0
4 4 0 0 0 0 0 0 0 0 0 0 0
5 5 0 0 0 0 0 1 0 0 0 0 0
6 6 0 0 0 0 0 0 1 0 0 0 0
7 7 0 0 0 0 0 0 0 1 0 0 0
8 8 0 0 0 0 1 0 0 0 1 1 1
您可能会在列名中注意到每个代码前面都有一个“X”。那是因为在R a variable name may not begin with a number。