我有一张带有表格的PDF文件。我正在使用 Name Separator Description
Protein IDs Identifier(s) of protein(s) contained in the protein group. They
are sorted by number of identified peptides in descending
order.
Majority protein IDs These are the IDs of those proteins that have at least half of
the peptides that the leading protein has.
Peptide counts (all) Number of peptides associated with each protein in protein
group, occuring in the order as the protein IDs occur in the
'Protein IDs' column. Here distinct peptide sequences are
counted. Modified forms or different charges are counted as
one peptide.
函数来提取文本,我得到一个带有几行代表该表的向量。
我的问题是,只有空格,有些单元格有多行。我正在寻找一种将其纳入数据框架的策略。我用my_lines <- c(" Name Separator Description",
" Protein IDs Identifier(s) of protein(s) contained in the protein group. They",
" are sorted by number of identified peptides in descending",
" order.", " Majority protein IDs These are the IDs of those proteins that have at least half of",
" the peptides that the leading protein has.",
" Peptide counts (all) Number of peptides associated with each protein in protein",
" group, occuring in the order as the protein IDs occur in the",
" 'Protein IDs' column. Here distinct peptide sequences are",
" counted. Modified forms or different charges are counted as",
" one peptide."
)
玩了一下,但是抱怨每行的元素数量不匹配。
Name Separator
1 Protein IDs
2 Majority protein IDs
3 Peptide counts (all)
Description
1 Identifier(s) of protein(s) contained in the protein group. They are sorted by number of identified peptides in descending order.
2 These are the IDs of those proteins that have at least half of the peptides that the leading protein has.
3 Number of peptides associated with each protein in protein group, occuring in the order as the protein IDs occur in the 'Protein IDs' column. Here distinct peptide sequences are counted. Modified forms or different charges are counted as one peptide.
这是数据
readr::read_fwf()
修改
我期望的输出将是这样的数据框:
writeLines(my_lines, 'test.txt')
readr::read_fwf('test.txt',
fwf_positions(c(1, 30, 45), c(29, 42, 300),
c("Name", "Separator", "Description")),
skip=1)
EDIT2:
所以玩Name
我用下面的代码更接近了(我没有太注意列的起点和终点,只是测试了它)。
Description
这里的问题是,我为Name
列中的空行获取了Gas。由于{{1}}列有多行,因此它还需要{{1}}的值,但没有。
答案 0 :(得分:0)
使用dplyr::summarise
和tidyr::fill
可以实现一个解决方案。
方法:
可以将第一行(i.e. Separator, Description)
中的文本x[1]
的位置视为在后续行中划分文本的指南。从pdf
中的表中提取数据时,此规则很有用。使用这些位置将每行划分为3列并准备data.frame。最后,应用合并/汇总技术以获得所需的结果。
df <- rbind.data.frame(cbind(substr(x, 1, (regexpr("Separator", x[1])[1]-1)),
substr(x,regexpr("Separator", x[1])[1], 47),
substr(x, (regexpr("Description", x[1])[1]-1), nchar(x))),
stringsAsFactors = FALSE)
#Rename columns
names(df) <- trimws(df[1,])
#remove 1st row
df <- df[-1,]
library(tidyverse)
df %>% mutate(Name = ifelse(trimws(Name) == "", NA, trimws(Name))) %>%
fill(Name) %>%
group_by(Name) %>%
summarise(Description = paste(Description, collapse=""))
# Name Description
# <chr> <chr>
# 1 Majority protein IDs These are the IDs of those proteins that have at least half ofthe peptides that the leading protein has.
# 2 Peptide counts (all) Number of peptides associated with each protein in proteingroup, occuring in the order as the protein IDs occur in the'Protein IDs' colu~
# 3 Protein IDs Identifier(s) of protein(s) contained in the protein group. Theyare sorted by number of identified peptides in descendingorder.
数据强>
x <- c(" Name Separator Description",
" Protein IDs Identifier(s) of protein(s) contained in the protein group. They",
" are sorted by number of identified peptides in descending",
" order.", " Majority protein IDs These are the IDs of those proteins that have at least half of",
" the peptides that the leading protein has.",
" Peptide counts (all) Number of peptides associated with each protein in protein",
" group, occuring in the order as the protein IDs occur in the",
" 'Protein IDs' column. Here distinct peptide sequences are",
" counted. Modified forms or different charges are counted as",
" one peptide."
)
答案 1 :(得分:0)
这是一个基础R选项,它循环遍历文本行:
df <- data.frame(name=character(), text=character())
col <- ""
content <- ""
for (row in 2:length(text)) {
if (grepl("^\\s{1,10}[^[:space:]]", text[row])) {
if (content != "") {
df <- rbind(df, data.frame(col, content))
}
col <- gsub("^\\s*(.*?)(\\s{10,}).*", "\\1", text[row], perl=TRUE)
content <- ""
content <- gsub(".*\\s{10,}(.*)$", "\\1", text[row], perl=TRUE)
} else {
content <- paste(" ", content, gsub("^\\s+(.*)", "\\1", text[row]))
}
}
df <- rbind(df, data.frame(col, content))
col
1 Protein IDs
2 Majority protein IDs
3 Peptide counts (all)
content
1 Identifier(s) of protein(s) contained in the protein group. They are sorted by number of identified peptides in descending order.
2 These are the IDs of those proteins that have at least half of the peptides that the leading protein has.
3 Number of peptides associated with each protein in protein group, occuring in the order as the protein IDs occur in the 'Protein IDs' column. Here distinct peptide sequences are counted. Modified forms or different charges are counted as one peptide.