PDF中带有R的句子中单词的共现(tm包?)

时间:2018-10-22 22:22:53

标签: r

因此,我的目标是使用R编写一些可以擦洗PDF并在在一起提及时拉出的东西-例如,在https://pdfs.semanticscholar.org/403c/fd873feb7055c9140b7abfa4584fa7ee1c7f.pdf中提到血管加压素和下丘脑前部之类的东西。我发现的大多数文本分析教程通常都摆脱了标点符号以及文本分析之前的所有内容,因此无法检查何时在同一句子中提到内容。这可能吗?

谢谢!

1 个答案:

答案 0 :(得分:0)

您可能不得不详细说说并给出真实的示例数据,但是原则上这是可行的。这是一个示例,希望对您有所帮助:

# here are some 'documents' -- just text strings
doc1 <- "hello. apple horse."
doc2 <- "hello. banana legislature"
doc3 <- "hello, apple banana. horse legislature"

# store them in a list...
list_of_docs <- list(doc1, doc2, doc3)

# ...so we can apply a custom function to this list
lapply(list_of_docs, function(d) {

  # split each document on the '.' character 
  # (fixed=T means interprest this as plain text, not regex)
  phrases_in_d <- unlist(strsplit(d, '.', fixed=T))

  # now here's a regex pattern to search for:
  #   apple followed by anything followed by banana, 
  #     OR 
  #   banana followed by anything followed by apple
  search_regex <- 'apple.*banana|banana.*apple'

  # grepl() returns a logical vector (TRUE or FALSE) to say if there's a match
  # for 'search regex' among 'phrases in document d'
  # any() returns true if any phrases match
  any(grepl(search_regex, phrases_in_d))
})

如您所料,结果是false, false, true的列表。