Question

我有一个带有标记词的向量，例如c(#142#856#856.2#745, NA, #856#855, NA, #685, #663, #965.23, #855#658#744#122)。

单词被锐利分开。我想为每个不同的代码创建一个包含一列的数据帧，然后根据该行中的代码编写1或0（或NA）。

这个想法是每个元素变成一行，每个代码变成一个列，然后如果代码在该元素中，则列中标记为1，如果该代码不在该元素中则为0。 / p>

ID | 142 | 856 |856.2 | ... | 122 |
1  |  1  |  1  |  1   | ... |  0  |
2  |  0  |  0  |  0   | ... |  0  |
...

我知道如何使用复杂的算法进行大量循环。但是，有没有简单的方法可以轻松地做到这一点？

Answer 1

您可以使用stringr：

轻松完成此操作

# First we load the package
library(stringr)
# Then we create your example data vector
tagged_vector <- c('#142#856#856.2#745', NA, '#856#855', NA, '#685', '#663',
                   '#965.23', '#855#658#744#122')
# Next we need to get all the unique codes
# stringr's str_extract_all() can do this:
all_codes <- str_extract_all(string=tagged_vector, pattern='(?<=#)[0-9\\.]+')
# We just looked for one or more numbers and/or dots following a '#' character
# Now we just want the unique ones:
unique_codes <- unique(na.omit(unlist(all_codes)))
# Then we can use grepl() to check whether each code occurs in any element
# I've also used as.numeric() since you want 0/1 instead of TRUE/FALSE
result <- data.frame(sapply(unique_codes, function(x){
    as.numeric(grepl(x, tagged_vector))
}))
# Then we add in your ID column and move it to the front:
result$ID <- 1:nrow(result)
result <- result[ , c(ncol(result), 1:(ncol(result)-1))]

结果是

  ID X142 X856 X856.2 X745 X855 X685 X663 X965.23 X658 X744 X122
1  1    1    1      1    1    0    0    0       0    0    0    0
2  2    0    0      0    0    0    0    0       0    0    0    0
3  3    0    1      0    0    1    0    0       0    0    0    0
4  4    0    0      0    0    0    0    0       0    0    0    0
5  5    0    0      0    0    0    1    0       0    0    0    0
6  6    0    0      0    0    0    0    1       0    0    0    0
7  7    0    0      0    0    0    0    0       1    0    0    0
8  8    0    0      0    0    1    0    0       0    1    1    1

您可能会在列名中注意到每个代码前面都有一个“X”。那是因为在R a variable name may not begin with a number。

从标记的单词创建列

1 个答案: