包含年度报告的分裂字符向量到句子

时间:2017-10-22 08:38:37

标签: r

我已经阅读了微软2016年提交给R的文件。现在我要清理文件,并将其拆分成句子。我使用了以下代码:

MSFT <- paste(readLines("https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/0001564590-17-014900.txt"), collapse = " ")

有人可以帮助我吗?

1 个答案:

答案 0 :(得分:1)

这是您可以尝试的一种方式:

MSFT <- paste(readLines("https://www.sec.gov/Archives/edgar/data/789019/000156459017014900/0001564590-17-014900.txt"), collapse = " ")

从不在body HTML标记内的文本中删除所有内容(假设:其他所有内容都不需要)

#Remove everything but body(s)
MSFT_body <- substr(MSFT, gregexpr("<body", MSFT)[[1]], gregexpr("</body", MSFT)[[1]])

在正文中,删除<>中的所有内容,以删除HTML,CSS,aso ......

#Remove all html tags and characters
MSFT_body_html_removed <- gsub("<.*?>|&[A-Za-z]+;|&#[0-9]+;", "", MSFT_body)

删除所有空格(即空格,换行符,制表符......),空格为

#Remove all whitespace and replace with space
MSFT_body_html_removed <- gsub("\\s+", " ", MSFT_body_html_removed)

您可以使用openNLP句子标记器(预训练)来查找句子:

#Define function to tokenise text to sentences
sentence_tokeniser <- openNLP::Maxent_Sent_Token_Annotator(language = "en")

#convert to String class
text <- NLP::as.String(MSFT_body_html_removed)

使用注释将标记符应用于文本

#Annotate text
annotated_sentences <- NLP::annotate(text, sentence_tokeniser)

提取句子

#extract sentences
sentences <- text[annotated_sentences]

打印前5个句子:

# print first 5 sentences  
for (i in 1:5) {
  print(paste("Sentence", i))
  cat(paste(sentences[i], "\n"))
}

这会给你:

[1] "Sentence 1"
UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended June 30, 2017 OR TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to Commission File Number 001-37845 MICROSOFT CORPORATION WASHINGTON 91-1144442 (STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOFT WAY, REDMOND, WASHINGTON 98052-6399 (425) 882-8080 www.microsoft.com/investor 
[1] "Sentence 2"
Securities registered pursuant to Section12(b) of the Act: COMMON STOCK, $0.00000625 par value per share NASDAQ Securities registered pursuant to Section12(g) of the Act: NONE Indicate by check mark if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.YesNo Indicate by check mark if the registrant is not required to file reports pursuant to Section13 or Section15(d) of the Exchange Act.YesNo Indicate by check mark whether the registrant (1)has filed all reports required to be filed by Section13 or 15(d) of the Securities Exchange Act of 1934 during the preceding 12 months (or for such shorter period that the registrant was required to file such reports), and (2)has been subject to such filing requirements for the past 90 days.YesNo Indicate by check mark whether the registrant has submitted electronically and posted on its corporate website, if any, every Interactive Data File required to be submitted and posted pursuant to Rule 405 of Regulat... <truncated>
[1] "Sentence 3"
Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company, or an emerging growth company. 
[1] "Sentence 4"
See the definitions of large accelerated filer, accelerated filer, smaller reporting company, and emerging growth company in Rule12b-2 of the Exchange Act. 
[1] "Sentence 5"
Large accelerated filer Acceleratedfiler Non-acceleratedfiler (Donotcheckifasmallerreportingcompany) Smallerreportingcompany Emerging growth company If an emerging growth company, indicate by check mark if the registrant has elected not to use the extended transition period for complying with any new or revised financial accounting standards provided pursuant to Section 13(a) of the Exchange Act.