Question

我正在尝试使用R从文本中提取数字和日期。说我有一个文本字符串向量，V.text。文本字符串是包含数字和日期的句子。例如：

"listed on 2/14/2015 for 150000 and sold for $160,000 on 3/1/2015"

我想提取数量和日期作为单独的矢量组件。所以输出将是两个向量：

1  1500000 160000
2  2/14/2015 3/1/2015

我尝试使用scan()但无法获得所需的结果。我将不胜感激任何帮助

Answer 1

首先拆分出“单词”。那么带斜杠的是日期，只有$，数字或逗号的是数字。在后一种情况下，剥离非数字字符并转换为数字：

s <- strsplit(x, " ")[[1]]

grep("/", s, value = TRUE) # dates
## [1] "2/14/2015" "3/1/2015" 

as.numeric(gsub("\\D", "", grep("^[$0-9,]+$", s, value = TRUE)))
## [1] 150000 160000

如果可以使用负数或十进制数，则将最后一行代码更改为：

as.numeric(gsub("[^-0-9.]", "", grep("^-?[$0-9,.]+$", s, value = TRUE)))

Answer 2

怎么样：

txt <- "listed on 2/14/2015 for 150000 and sold for $160,000 on 3/1/2015"
lapply(c('[0-9,]{5,}',
         '[0-9]{1,2}/[0-9]{1,2}/[0-9]{4}'),
       function(re) {
           matches <- gregexpr(re, txt)
           gsub(',', '', regmatches(txt, matches)[[1]])
       })
## [[1]]
## [1] "150000" "160000"
## [[2]]
## [1] "2/14/2015" "3/1/2015"

（数字的第一个匹配假定为5位或更多。如果你的数字较少，那么这个简单的正则表达式将与日期的年份发生冲突。）

Answer 3

快速而肮脏的方法：

x<-"listed on 2/14/2015 for 150000 and sold for $160,000 on 3/1/2015"
mydate<-regmatches(x,gregexpr("\\d{1,2}/\\d{1,2}/\\d{4}",x,perl=TRUE))
mynumber<-regmatches(sub(",","",x),gregexpr("\\d{6}",sub(",","",x),perl=TRUE))

您可以在r-fiddle中运行上述代码：

使用R从文本（类似句子的字符串的向量）中提取数字和日期

3 个答案: