如果不存在分隔符,则在R中分隔字段

时间:2014-05-29 22:56:31

标签: r

我有一个如下数据集:

structure(list(Info = c("Acacia melanoceras 0.0369 0.0427 0.0267 0.0298 0.0501 0.0042 ", 
"Acalypha diversifolia van 0.0670 0.0439 0.0281 0.0427 0.0464 -0.0148 ", 
"Acalypha macrostachya vin 0.0657 0.0621 0.0441 0.0522 0.0473 -0.0173 ", 
"Adelia triloba 0.0481 0.0350 0.0202 0.0174 0.0286 -0.0349 ", 
"Aegiphila panamensis 0.0437 0.0312 0.0166 0.0148 0.0194 -0.0497 ", 
"Alchornea costaricensis 0.0568 0.0781 0.0502 0.0221 0.0734 -0.0153 "
)), .Names = "Info", row.names = c(NA, 6L), class = "data.frame")

它目前只有一列,看起来像这样

                                                                   Info
1         Acacia melanoceras 0.0369 0.0427 0.0267 0.0298 0.0501 0.0042 
2 Acalypha diversifolia van 0.0670 0.0439 0.0281 0.0427 0.0464 -0.0148 
3 Acalypha macrostachya vin 0.0657 0.0621 0.0441 0.0522 0.0473 -0.0173 
4            Adelia triloba 0.0481 0.0350 0.0202 0.0174 0.0286 -0.0349 
5      Aegiphila panamensis 0.0437 0.0312 0.0166 0.0148 0.0194 -0.0497 
6   Alchornea costaricensis 0.0568 0.0781 0.0502 0.0221 0.0734 -0.0153

我希望它有7列,看起来像这样:

           Species             V1     V2      V3     V4    V5     V6
1         Acacia melanoceras 0.0369 0.0427 0.0267 0.0298 0.0501 0.0042 
2 Acalypha diversifolia van 0.0670 0.0439 0.0281 0.0427 0.0464 -0.0148 
3 Acalypha macrostachya vin 0.0657 0.0621 0.0441 0.0522 0.0473 -0.0173 
4            Adelia triloba 0.0481 0.0350 0.0202 0.0174 0.0286 -0.0349 
5      Aegiphila panamensis 0.0437 0.0312 0.0166 0.0148 0.0194 -0.0497 
6   Alchornea costaricensis 0.0568 0.0781 0.0502 0.0221 0.0734 -0.0153

这个问题给我带来了麻烦,因为物种名称并不总是两个字。原始文本文件没有分隔,因此我无法以分隔的方式读取它。我只能将其作为一个列字符串获取。有人有什么建议吗?

3 个答案:

答案 0 :(得分:6)

尝试使用gsub在"信息"中的每个数字前面加逗号。我们假设的数据帧的列被命名为" dat"然后用read.csv重新阅读:

> read.csv(text=gsub("( [-[:digit:].])", ",\\1", dat$Info), header=FALSE)
                         V1     V2     V3     V4     V5     V6      V7
1        Acacia melanoceras 0.0369 0.0427 0.0267 0.0298 0.0501  0.0042
2 Acalypha diversifolia van 0.0670 0.0439 0.0281 0.0427 0.0464 -0.0148
3 Acalypha macrostachya vin 0.0657 0.0621 0.0441 0.0522 0.0473 -0.0173
4            Adelia triloba 0.0481 0.0350 0.0202 0.0174 0.0286 -0.0349
5      Aegiphila panamensis 0.0437 0.0312 0.0166 0.0148 0.0194 -0.0497
6   Alchornea costaricensis 0.0568 0.0781 0.0502 0.0221 0.0734 -0.0153

我感谢您描述您的用例。我可能会在将来自己使用它。

答案 1 :(得分:4)

假设ds是您的数据:

ds <- 
  structure(list(Info = c("Acacia melanoceras 0.0369 0.0427 0.0267 0.0298 0.0501 0.0042 ", 
                          "Acalypha diversifolia van 0.0670 0.0439 0.0281 0.0427 0.0464 -0.0148 ", 
                          "Acalypha macrostachya vin 0.0657 0.0621 0.0441 0.0522 0.0473 -0.0173 ", 
                          "Adelia triloba 0.0481 0.0350 0.0202 0.0174 0.0286 -0.0349 ", 
                          "Aegiphila panamensis 0.0437 0.0312 0.0166 0.0148 0.0194 -0.0497 ", 
                          "Alchornea costaricensis 0.0568 0.0781 0.0502 0.0221 0.0734 -0.0153 "
  )), .Names = "Info", row.names = c(NA, 6L), class = "data.frame")

然后您可以执行类似

的操作
ds$Info <- gsub(" (-?[0-9])", ", \\1", ds$Info)
do.call(rbind, strsplit(ds$Info, ", "))
#     [,1]                        [,2]     [,3]     [,4]     [,5]     [,6]     [,7]      
#[1,] "Acacia melanoceras"        "0.0369" "0.0427" "0.0267" "0.0298" "0.0501" "0.0042 " 
#[2,] "Acalypha diversifolia van" "0.0670" "0.0439" "0.0281" "0.0427" "0.0464" "-0.0148 "
#[3,] "Acalypha macrostachya vin" "0.0657" "0.0621" "0.0441" "0.0522" "0.0473" "-0.0173 "
#[4,] "Adelia triloba"            "0.0481" "0.0350" "0.0202" "0.0174" "0.0286" "-0.0349 "
#[5,] "Aegiphila panamensis"      "0.0437" "0.0312" "0.0166" "0.0148" "0.0194" "-0.0497 "
#[6,] "Alchornea costaricensis"   "0.0568" "0.0781" "0.0502" "0.0221" "0.0734" "-0.0153 "

其中ds是您上面的数据,您几乎已经完成了。首先查找空格后跟数字并输入逗号。然后我们分割字符串并组合向量。然后,您可以将对象转换为data.frame,将相关列转换为numeric,然后添加colnames

编辑: 正如BondedDust的回答所示,使用read.csv会更优雅。

read.csv(text = ds$Info, header = FALSE)

答案 2 :(得分:1)

这是我的建议:

1)按' '拆分, 2)将物种和属名称粘贴在一起(我假设你有6个数字列)和 3)制作(字符)data.frame。 4)最后将列转换为数字和 5)将Species设置为colname。

    df <- structure(list(Info = c("Acacia melanoceras 0.0369 0.0427 0.0267 0.0298 0.0501 0.0042 ", 
                              "Acalypha diversifolia van 0.0670 0.0439 0.0281 0.0427 0.0464 -0.0148 ", 
                              "Acalypha macrostachya vin 0.0657 0.0621 0.0441 0.0522 0.0473 -0.0173 ", 
                              "Adelia triloba 0.0481 0.0350 0.0202 0.0174 0.0286 -0.0349 ", 
                              "Aegiphila panamensis 0.0437 0.0312 0.0166 0.0148 0.0194 -0.0497 ", 
                              "Alchornea costaricensis 0.0568 0.0781 0.0502 0.0221 0.0734 -0.0153 "
)), .Names = "Info", row.names = c(NA, 6L), class = "data.frame")
df

# split
sp <- strsplit(df$Info, ' ')
sp

# make (character) data.frame
require(plyr)
newdf <- ldply(sp, function(x) {
  l <- length(x)
  dta <- x[(l-5):l]
  spec <- paste(x[1:(l-6)], collapse = ' ')
  out <- c(spec, dta)
  return(out)
})

# make numeric cols
newdf[ , 2:7] <- apply(newdf[ , 2:7], 2, function(x) as.numeric(x))
names(newdf)[1] <- 'Species'
str(newdf)