从R中的文本中提取年龄

时间:2018-08-07 11:29:56

标签: r string stringr text-extraction information-extraction

我有一个.csv文件,其列包含从网络抓取的图书描述,然后将其导入R以进行进一步分析。我的目标是从R的此列中提取主角的年龄,所以我想像的是:

  1. 使用正则表达式匹配“年龄”和“岁”之类的字符串
  2. 将包含这些字符串的句子复制到新列中(这样我可以确保该句子不是,例如“在中世纪,有50个人居住在xy中”
  3. 从该列中提取数字(如果可能的话,一些数字单词)提取到新列中。

生成的表(或者可能是data.frame)可能希望像这样

|Description             |Sentence           |Age
|YY is a novel by Mr. X  |The 12-year-old boy| 12
|about a boy. The 12-year|is named Dave.     |
|-old boy is named Dave..|                   |

如果我能帮助您,那将非常有用,因为我的R技能仍然非常有限,而且我还没有找到解决该问题的方法!

2 个答案:

答案 0 :(得分:3)

如果字符串除年龄以外还包含其他数字/说明,但您只希望年龄,则为另一种选择。

library(stringr)
description <- "YY is a novel by Mr. X about a boy. The boy is 5 feet tall. The 12-year-old boy is named Dave. Dave is happy. Dave lives at 42 Washington street."
sentence <- str_split(description, "\\.")[[1]][which(grepl("-year-old", unlist(str_split(description, "\\."))))]
> sentence 
[1] " The 12-year-old boy is named Dave"

age <- as.numeric(str_extract(description, "\\d+(?=-year-old)"))
> age
[1] 12

在这里,我们使用字符串“ -year-old-old”告诉我们要拉哪个句子,然后提取该字符串后面的年龄。

答案 1 :(得分:2)

您可以尝试以下

library(stringr)

description <- "YY is a novel by Mr. X about a boy. The 12-year-old boy is named Dave. Dave is happy."

sentence <- str_extract(description, pattern = "\\.[^\\.]*[0-9]+[^\\.]*.") %>% 
  str_replace("^\\. ", "")
> sentence
[1] "The 12-year-old boy is named Dave."

age <- str_extract(sentence, pattern = "[0-9]+")
> age
[1] "12"