Question

我是R的新手（以及stackoverflow，因此子弹只代表新行）并被分配到一个项目，我需要将MEDLINE数据清理成一个整洁的数据帧。原始.txt文件的示例如下：

 PMID- 28152974 
 OWN - NLM 
 IS  - 1471-230X (Electronic) 
 IS  - 1471-230X (Linking) 
 PMID- 28098115 
 OWN - NLM 
 IP  - 1 
 VI  - 28

等

每个新观察都以PMID开始，并非每个观察中都包含所有变量，并且需要合并同一观察中具有相同列名的一些单元（即IS）。最终数据框应如下所示：

 PMID      OWN  IS                                          VI
 28152974  NLM  1471-230X (Electronic) 1471-230X (Linking)  N/A 
 28098115  NLM  N/A                                         28

等

目前我以多种方式操纵我的数据。第一种是原始数据文件的格式，但是在两列中，没有“ - ”。例如：

 PMID 28152974
 OWN  NLM
 IS   1471-230X (Electronic) 
 IS   1471-230X (Linking) 
 PMID 28098115 
 OWN  NLM 
 IP   1 
 VI   28

等

第二个是所有观察都在一行中，每个变量有数千列。例如：

 PMID      OWN   IS                      IS                   PMID      OWN
 28152974  NLM   1471-230X (Electronic)  1471-230X (Linking)  28098115  NLM

等

第三个类似于第二个，但是它只有第一个PMID值中的不同列类型而不是数千个列。例如：

 PMID               OWN      IS 
 28152974 28098115  NLM NLM  1471-230X (Electronic) 1471-230X (Linking)

等

请帮忙。我不知道如何拼接我的数据，也不知道应该使用哪种操作。

Answer 1

可重复数据：

d <- c("PMID- 28152974", "OWN - NLM", "IS  - 1471-230X (Electronic)", 
       "IS  - 1471-230X (Linking)", "PMID- 28098115", "OWN - NLM", "IP  - 1", 
       "VI  - 28")

从文件输入：

d <- readLines('/path/to/file')

一个想法：

# split into records
i <- grepl("^PMID", d)
i <- cumsum(i)
d <- split(d, i)

# split into key-value pairs
d <- lapply(d, strsplit, "\\ {0,2}-\\ ")
d <- lapply(d, function (x) setNames(sapply(x, '[[', 2), sapply(x, '[[', 1)))

# merge IS variables
d <- lapply(d, function (x) {
  i <- names(x) == "IS"
  if (any(i))
     x <- c(x[!i], IS = paste(x[i], collapse = " "))
  return(x)
})

# merge records to data.frame
library(data.table)
d <- lapply(d, as.list)
d <- lapply(d, as.data.table)
d <- rbindlist(d, fill = T)
d <- as.data.frame(d)

Answer 2

不同类型的数据混合在数据文件的一列或两列中并不罕见。只要可以以某种方式识别不同种类的数据，例如通过正则表达式，就可以将行的内容移动到不同的列。

以下解决方案使用包read_fwf()中的readr来读取文本文件中的固定宽度数据（此处通过读取字符串进行模拟）。来自dcast()包的data.table用于从长格式转换为宽格式，从而生成一个data.frame，每行一条记录：

读取数据

library(data.table)
# read data 
dt <- readr::read_fwf(
  " PMID- 28152974 
 OWN - NLM 
 IS  - 1471-230X (Electronic) 
 IS  - 1471-230X (Linking) 
 PMID- 28098115 
 OWN - NLM 
 IP  - 1 
 VI  - 28 ", 
  readr::fwf_positions(c(2, 8), c(5, Inf), c("variable", "value")),
  col_types = "cc")
# coerce tibble to data.table
setDT(dt)

从长格式转换为宽格式

# create new column PMID with the record id
dt[variable == "PMID", PMID := value]
# fill missing values in subsequent rows to mark all rows belonging to one record
dt[, PMID := zoo::na.locf(PMID)]
dt
#   variable                  value     PMID
#1:     PMID               28152974 28152974
#2:      OWN                    NLM 28152974
#3:       IS 1471-230X (Electronic) 28152974
#4:       IS    1471-230X (Linking) 28152974
#5:     PMID               28098115 28098115
#6:      OWN                    NLM 28098115
#7:       IP                      1 28098115
#8:       VI                     28 28098115

# reshape from wide to long, thereby collapsing strings if necessary
dcast(dt[variable != "PMID"], PMID ~ ..., fun = paste, collapse = " ")
#       PMID IP                                         IS OWN VI
#1: 28098115  1                                            NLM 28
#2: 28152974    1471-230X (Electronic) 1471-230X (Linking) NLM

请注意，此方法非常灵活，因为如果所有重复的数据字段出现在数据中，无论它们如何被命名，而不仅仅是IS列，它都会折叠。

Answer 3

Thanks to G. Grothendieck，我从基础R中了解了read.dcf() function，这大大简化了这项任务。只需要进行一些小的调整。

# use connection to avoid warnings
con <- file("test.dat")
# read file row-wise and adjust for dcf format
dat <- sub("PMID", "\nPMID", sub("- ", ": ", trimws(readLines(con))))
# close connection to avoid warnings
close(con)

# re-read from variable using dcf format, collapse multiple entries
result <- read.dcf(textConnection(dat), all = TRUE)
result

      PMID OWN                                         IS   IP   VI  
1 28152974  NLM 1471-230X (Electronic), 1471-230X (Linking) <NA> <NA>
2 28098115  NLM                                          NA    1   28

将数据拼接/清理成整齐的数据帧

3 个答案:

读取数据

从长格式转换为宽格式