Question

我在.rtf中有一些文章，我想从中提取日期。文章看起来像这样：

第一行是标题，然后是空白行。然后，它列出了以下内容，每个都在自己的行上：

字数
日期
通讯社
新闻社的缩写
语言
版权信息

我尝试了以下代码，但似乎不起作用。好像问题出在提取日期。

##First I read the file using this code: 
htmlText <- read_file(paste("/Users/adhyantarahma/Desktop/Factiva-20190905-0316.rtf"))

##then I removed new lines tags
removeNewLines <- gsub("\n"," ",htmlText) 

##and I changed " to ' in text
cleanLines <- gsub("\"", "'", removeNewLines) 

print(cleanLines)

##the relevant part of cleanLines look like this 
#\\ 347 words\\ 9 April 2016\\ FARS News Agency\\ FARSNA\\ English\\

##then I used this to extract date 
date <- str_extract_all(htmlText, "words \\d{1,2} [A-Z][a-z]+ \\d{4}")[[1]]

但是它似乎没有收到日期。当我运行它时，它总是说没有字符。

我应该怎么做才能拿起日期？

Answer 1

我只是在关注您的解决方案，并将其更改为稍作修改的解决方案。

library(readr)
library(stringr)
htmlText <- read_file("Factiva-20190905-0316.rtf")


# -------------------------------------------------------------------------
              # output -- htmlText
# [1] "347 words\n9 April 2016\nFARS News Agency\nFARSNA\nEnglish\n"

# -------------------------------------------------------------------------

#replace "\n" with a space   
removeNewLines <- gsub("\n"," ",htmlText) 

# -------------------------------------------------------------------------
            # output -- removeNewLines
# [1] "347 words 9 April 2016 FARS News Agency FARSNA English "

# -------------------------------------------------------------------------

       # extract the date from removedNewLines
my_date <- str_extract_all(removeNewLines, "\\d{1,2} [A-Z][a-z]+ \\d{4}")[[1]]
        # output -- my_date
#[1] "9 April 2016"

# -------------------------------------------------------------------------

如何从R中的.rtf中提取日期

1 个答案: