使用r中的多行条目从纯文本创建表

时间:2018-03-15 09:16:58

标签: r read.table readr

我有一张带有表格的PDF文件。我正在使用 Name Separator Description Protein IDs Identifier(s) of protein(s) contained in the protein group. They are sorted by number of identified peptides in descending order. Majority protein IDs These are the IDs of those proteins that have at least half of the peptides that the leading protein has. Peptide counts (all) Number of peptides associated with each protein in protein group, occuring in the order as the protein IDs occur in the 'Protein IDs' column. Here distinct peptide sequences are counted. Modified forms or different charges are counted as one peptide. 函数来提取文本,我得到一个带有几行代表该表的向量。

我的问题是,只有空格,有些单元格有多行。我正在寻找一种将其纳入数据框架的策略。我用my_lines <- c(" Name Separator Description", " Protein IDs Identifier(s) of protein(s) contained in the protein group. They", " are sorted by number of identified peptides in descending", " order.", " Majority protein IDs These are the IDs of those proteins that have at least half of", " the peptides that the leading protein has.", " Peptide counts (all) Number of peptides associated with each protein in protein", " group, occuring in the order as the protein IDs occur in the", " 'Protein IDs' column. Here distinct peptide sequences are", " counted. Modified forms or different charges are counted as", " one peptide." ) 玩了一下,但是抱怨每行的元素数量不匹配。

                  Name Separator
1          Protein IDs          
2 Majority protein IDs          
3 Peptide counts (all)          
                                                                                                                                                                                                                                                 Description
1                                                                                                                          Identifier(s) of protein(s) contained in the protein group. They are sorted by number of identified peptides in descending order.
2                                                                                                                                                  These are the IDs of those proteins that have at least half of the peptides that the leading protein has.
3 Number of peptides associated with each protein in protein group, occuring in the order as the protein IDs occur in the 'Protein IDs' column. Here distinct peptide sequences are counted. Modified forms or different charges are counted as one peptide.

这是数据

readr::read_fwf()

修改
我期望的输出将是这样的数据框:

writeLines(my_lines, 'test.txt')
readr::read_fwf('test.txt', 
                fwf_positions(c(1, 30, 45), c(29, 42, 300), 
                              c("Name", "Separator", "Description")),
                skip=1)

EDIT2:
所以玩Name我用下面的代码更接近了(我没有太注意列的起点和终点,只是测试了它)。

Description

这里的问题是,我为Name列中的空行获取了Gas。由于{{1}}列有多行,因此它还需要{{1}}的值,但没有。

2 个答案:

答案 0 :(得分:0)

使用dplyr::summarisetidyr::fill可以实现一个解决方案。

方法: 可以将第一行(i.e. Separator, Description)中的文本x[1]的位置视为在后续行中划分文本的指南。从pdf中的表中提取数据时,此规则很有用。使用这些位置将每行划分为3列并准备data.frame。最后,应用合并/汇总技术以获得所需的结果。

df <- rbind.data.frame(cbind(substr(x, 1, (regexpr("Separator", x[1])[1]-1)), 
           substr(x,regexpr("Separator", x[1])[1], 47), 
           substr(x, (regexpr("Description", x[1])[1]-1), nchar(x))),
            stringsAsFactors = FALSE)

#Rename columns
names(df) <- trimws(df[1,])
#remove 1st row
df <- df[-1,]

library(tidyverse)
df %>% mutate(Name = ifelse(trimws(Name) == "", NA, trimws(Name))) %>%
     fill(Name) %>%
     group_by(Name) %>%
     summarise(Description = paste(Description, collapse=""))


# Name                 Description                                                                                                                              
# <chr>                <chr>                                                                                                                                    
# 1 Majority protein IDs These are the IDs of those proteins that have at least half ofthe peptides that the leading protein has.                                 
# 2 Peptide counts (all) Number of peptides associated with each protein in proteingroup, occuring in the order as the protein IDs occur in the'Protein IDs' colu~
# 3 Protein IDs          Identifier(s) of protein(s) contained in the protein group. Theyare sorted by number of identified peptides in descendingorder. 

数据

x <- c("     Name                             Separator Description", 
  "    Protein IDs                                Identifier(s) of protein(s) contained in the protein group. They", 
  "                                               are sorted by number of identified peptides in descending", 
  "                                               order.", "    Majority protein IDs                       These are the IDs of those proteins that have at least half of", 
  "                                               the peptides that the leading protein has.", 
  "    Peptide counts (all)                       Number of peptides associated with each protein in protein", 
  "                                               group, occuring in the order as the protein IDs occur in the", 
  "                                               'Protein IDs' column. Here distinct peptide sequences are", 
  "                                               counted. Modified forms or different charges are counted as", 
  "                                               one peptide."
)

答案 1 :(得分:0)

这是一个基础R选项,它循环遍历文本行:

df <- data.frame(name=character(), text=character())
col <- ""
content <- ""
for (row in 2:length(text)) {
    if (grepl("^\\s{1,10}[^[:space:]]", text[row])) {
        if (content != "") {
            df <- rbind(df, data.frame(col, content))
        }
        col <- gsub("^\\s*(.*?)(\\s{10,}).*", "\\1", text[row], perl=TRUE)
        content <- ""
        content <- gsub(".*\\s{10,}(.*)$", "\\1", text[row], perl=TRUE)
    } else {
        content <- paste(" ", content, gsub("^\\s+(.*)", "\\1", text[row]))
    }
}
df <- rbind(df, data.frame(col, content))

                      col
1          Protein IDs
2 Majority protein IDs
3 Peptide counts (all)

content
1 Identifier(s) of protein(s) contained in the protein group. They are sorted by number of identified peptides in descending order.
2 These are the IDs of those proteins that have at least half of the peptides that the leading protein has.
3 Number of peptides associated with each protein in protein group, occuring in the order as the protein IDs occur in the 'Protein IDs' column. Here distinct peptide sequences are counted. Modified forms or different charges are counted as one peptide.

Demo