Question

我必须计算每个议程项目的页数。我已经将pdf文档中的文本提取到数据框中，本数据框的一行基本上包含一页文本。这就是我的数据：

mydf <- data.frame(text = c("AGENDA ITEM 1
        4", "This particular row contains a lot of text, really its all text present in one page", 
        "So ineffect, one page of text per row", "This is another page of text in this row", 
        "lets include another page for agenda 1", "AGENDA ITEM 2
        9",
        "now all the text in agenda 2 is included here","the 2nd page text of agenda 2", 
        "AGENDA ITEM 3
        12", "Now lets just add one row for this agenda, meaning it only has one page inside it"))

在AGENDA TEXT（同一行）下，数字是页码，它在同一行。要计算每个议程的页数，我只需计算行数，直到下一个AGENDA项目出现。考虑到上面的例子，答案应该是

AGENDA ITEM 1 = 4 Pages, AGENDA ITEM 2 = 2 Pages and AGENDA ITEM 3 = 1 Page.

我将如何做到这一点？我是分析文本的新手。感谢

Answer 1

如果模式＆＃34;议程项目##＆＃34;如果没有出现在您的普通文本中，您可以使用grep()使用以下方法。我希望这适合你。

#get all rownumbers of rows starting with the pattern
start_rows <- grep("AGENDA ITEM \\d+", mydf$text)

#get the end of each "AGENDA ITEM chapter"
#a chapter ends one line before the next chapter starts, hence, 
#-1 and offset -1 from startrows
#and the final chapter ends with the last line
end_rows <- c(start_rows[-1]-1
              ,length(mydf$text))

end_rows-start_rows
#[1] 4 2 1

Answer 2

您可以像这样使用grep

mydf <- data.frame(text = c("AGENDA ITEM 1
                            4", "This particular row contains a lot of text, really its all text present in one page", 
                            "So ineffect, one page of text per row", "This is another page of text in this row", 
                            "lets include another page for agenda 1", "AGENDA ITEM 2
                            9",
                            "now all the text in agenda 2 is included here","the 2nd page text of agenda 2", 
                            "AGENDA ITEM 3
                            12", "Now lets just add one row for this agenda, meaning it only has one page inside it"))

lst <- as.character(mydf$text)
index <- grep(pattern = "AGENDA ITEM", lst)
index <- c(index,length(lst))

pages <- diff(index)
pages[1:length(pages)-1] <- pages[1:length(pages)-1] - 1
pages

[1] 4 2 1

计算每个AGENDA的页数 - 文本挖掘中的r

2 个答案: