我想提取与关键词"癌症相关的信息"来自使用R。
的pdf列表我想在文本文件中提取包含单词癌症的行或段落之前和之后。
{{1}}
以上正则表达式无法正常工作
答案 0 :(得分:0)
这是一种方法:
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
doc
## page_id element_id text
## 1 24 28 Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private
## 2 24 29 partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but
## 3 24 30 stresses that, in order for them to work, they should be voluntary, and the government
## 4 25 8 the availability of medicines to treat life-threatening diseases. It notes, for example, that
## 5 25 9 while an average estimate of the value of drugs to treat the country's cancer patients is
## 6 25 10 $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the
## 7 25 12 because of the high cost of these medicines,” says the Policy, which also calls for tax and
## 8 25 13 excise exemptions for anti-cancer drugs.
## 9 25 14 Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health
## 10 32 19 Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective,
## 11 32 20 anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended
## 12 32 21 December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1