Question

我在表格的一栏中有这个：

paragemcard-resp+insufcardioresp
dpco+pneumonia
posopperfulceragastrica+ards
pos op hematoma #rim direito expontanea
miopatiaduchenne-erb+insuf.resp
dpco+dhca+#femur
posde#subtroncantГ©ricaesqВЄ+complicepidural
dpco+asma

我想将它们分开：

paragemcard-resp                            insufcardioresp
dpco                                        pneumonia
posopperfulceragastrica                     ards
pos op hematoma #rim direito expontanea
miopatiaduchenne-erb                        insuf.resp
dpco                                        dhca                   #femur
posde#subtroncantГ©ricaesqВЄ                complicepidural
dpco                                        asma

但问题是他们没有相同的长度。如您所见，在第3行中，我们有2个变量，在第6行中我们有3个。

我想在同一列中创建此字符串以进行进一步分析。

由于

Answer 1

您可以使用strsplit：

text <- c("paragemcard-resp+insufcardioresp", "dpco+pneumonia", "posopperfulceragastrica+ards", "pos op hematoma #rim direito expontanea", "miopatiaduchenne-erb+insuf.resp", "dpco+dhca+#femur", "posde#subtroncantГ©ricaesqВЄ+complicepidural", "dpco+asma")

strings <- strsplit(text, "+", fixed = TRUE)
maxlen <- max(sapply(strings, length))
strings <- lapply(strings, function(s) { length(s) <- maxlen; s })
strings <- data.frame(matrix(unlist(strings), ncol = maxlen, byrow = TRUE))

看起来像

                                          X1              X2     X3
   1                        paragemcard-resp insufcardioresp   <NA>
   2                                    dpco       pneumonia   <NA>
   3                 posopperfulceragastrica            ards   <NA>
   4 pos op hematoma #rim direito expontanea            <NA>   <NA>
   5                    miopatiaduchenne-erb      insuf.resp   <NA>
   6                                    dpco            dhca #femur
   7            posde#subtroncantГ©ricaesqВЄ complicepidural   <NA>
   8                                    dpco            asma   <NA>

Answer 2

您可以使用read.table，但是您应该使用count.fields或某种正则表达式来首先确定正确的列数。使用Robert＆＃34; text＆＃34;样本数据：

Cols <- max(sapply(gregexpr("+", text, fixed = TRUE), length))+1
## Cols <- max(count.fields(textConnection(text), sep = "+"))

read.table(text = text, comment.char="", header = FALSE, 
           col.names=paste0("V", sequence(Cols)), 
           fill = TRUE, sep = "+")
#                                        V1              V2     V3
# 1                        paragemcard-resp insufcardioresp       
# 2                                    dpco       pneumonia       
# 3                 posopperfulceragastrica            ards       
# 4 pos op hematoma #rim direito expontanea                       
# 5                    miopatiaduchenne-erb      insuf.resp       
# 6                                    dpco            dhca #femur
# 7            posde#subtroncantГ©ricaesqВЄ complicepidural       
# 8                                    dpco            asma

此外，可能有用：＆＃34; stringi＆＃34;库使计数元素变得容易（作为上述gregexpr步骤的替代）。

library(stringi)
Cols <- max(stri_count_fixed(x, "+") + 1)

为什么需要＆＃34; Cols＆＃34;步？ read.table和family决定使用多少列：（1）在前5行数据中检测到的最大字段数，或（2）col.names参数的长度。在您的示例中，包含最多字段数的行是第六行，因此直接使用read.csv或read.table会导致数据包装错误。

将文本分隔为R中的变量

2 个答案: