使用R编程基于Pdf中的关键字提取行前后

时间:2017-04-14 15:25:37

标签: r regex

我想提取与关键词"癌症相关的信息"来自使用R。

的pdf列表

我想在文本文件中提取包含单词癌症的行或段落之前和之后。

{{1}}

以上正则表达式无法正常工作

1 个答案:

答案 0 :(得分:0)

这是一种方法:

library(textreadr)
library(tidyverse)

loc <- function(var, regex, n = 1, ignore.case = TRUE){
    locs <- grep(regex, var, ignore.case = ignore.case)
    out <- sort(unique(c(locs - 1, locs, locs + 1)))
    out <- out[out > 0]
    out[out <= length(var)]
}

doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
    read_pdf() %>%
    slice(loc(text, 'cancer'))

doc

##    page_id element_id                                                                                                                  text
## 1       24         28                              Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private
## 2       24         29                              partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but
## 3       24         30                                stresses that, in order for them to work, they should be voluntary, and the government
## 4       25          8                         the availability of medicines to treat life-threatening diseases. It notes, for example, that
## 5       25          9                             while an average estimate of the value of drugs to treat the country's cancer patients is
## 6       25         10                             $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the
## 7       25         12                           because of the high cost of these medicines,” says the Policy, which also calls for tax and
## 8       25         13                                                                              excise exemptions for anti-cancer drugs.
## 9       25         14                       Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health
## 10      32         19                              Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective,
## 11      32         20                               anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended
## 12      32         21                             December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1