从文本中提取单词并从中创建向量

时间:2019-11-26 13:17:37

标签: r regex gsub text-processing stringr

假设我有一个包含以下文本的txt文件:

Type: fruits
Title: retail
Date: 2015-11-10
Country: UK
Products:
  apple,
  passion fruit,
  mango
Documents: NDA
Export: 2.10

我使用readLines函数读取了此文件。 然后,我想要一个看起来像这样的向量:

x <- c(fruits, apple, passion fruit, mango)

因此,我想提取“类型:”之后的词以及“产品:”和“文档:”之间的所有词。 我怎样才能做到这一点?谢谢!

2 个答案:

答案 0 :(得分:1)

如果不进行更改,则看起来类似于yaml格式,例如使用同名包裹

library(yaml)
info <- yaml::read_yaml("your file.txt")
# strsplit - split either side of the commas
# unlist - convert to vector
# trimws - remove trailing and leading white space
out <- trimws(unlist(strsplit(info$Products, ",")))

您将在info中以所需名称的形式获得其他条目作为列表元素,例如info$Type

答案 1 :(得分:0)

如果有这样的向量,也许有一个更好的解决方案,以防万一您可以尝试一下:

vec <- readLines("path\\file.txt")

文件中包含您发布的文本,您可以尝试以下操作:

# replace biggest spaces
gsub("   "," ",
     # replace the first space
     sub(" ",", ",
       # pattern to extract words
       gsub(".*Type:\\s*|Title.*Products:\\s*| Documents.*", "",
           # collapse in one vector
           paste0(vec, collapse = " "))))
[1] "fruits, apple, passion fruit, mango"

如果您dput(vec)使代码可重现:

c("Type: fruits", "Title: retail", "Date: 2015-11-10", "Country: UK", 
"Products:", "  apple,", "  passion fruit,", "  mango", "Documents: NDA", 
"Export: 2.10")