我在.rtf中有一些文章,我想从中提取日期。文章看起来像这样:
第一行是标题,然后是空白行。然后,它列出了以下内容,每个都在自己的行上:
我尝试了以下代码,但似乎不起作用。好像问题出在提取日期。
##First I read the file using this code:
htmlText <- read_file(paste("/Users/adhyantarahma/Desktop/Factiva-20190905-0316.rtf"))
##then I removed new lines tags
removeNewLines <- gsub("\n"," ",htmlText)
##and I changed " to ' in text
cleanLines <- gsub("\"", "'", removeNewLines)
print(cleanLines)
##the relevant part of cleanLines look like this
#\\ 347 words\\ 9 April 2016\\ FARS News Agency\\ FARSNA\\ English\\
##then I used this to extract date
date <- str_extract_all(htmlText, "words \\d{1,2} [A-Z][a-z]+ \\d{4}")[[1]]
但是它似乎没有收到日期。当我运行它时,它总是说没有字符。
我应该怎么做才能拿起日期?
答案 0 :(得分:0)
我只是在关注您的解决方案,并将其更改为稍作修改的解决方案。
library(readr)
library(stringr)
htmlText <- read_file("Factiva-20190905-0316.rtf")
# -------------------------------------------------------------------------
# output -- htmlText
# [1] "347 words\n9 April 2016\nFARS News Agency\nFARSNA\nEnglish\n"
# -------------------------------------------------------------------------
#replace "\n" with a space
removeNewLines <- gsub("\n"," ",htmlText)
# -------------------------------------------------------------------------
# output -- removeNewLines
# [1] "347 words 9 April 2016 FARS News Agency FARSNA English "
# -------------------------------------------------------------------------
# extract the date from removedNewLines
my_date <- str_extract_all(removeNewLines, "\\d{1,2} [A-Z][a-z]+ \\d{4}")[[1]]
# output -- my_date
#[1] "9 April 2016"
# -------------------------------------------------------------------------