我有一个语料库对象,我想从中提取数据,因此我可以将它们添加为docvar。
对象看起来像这样
v1 <- c("(SE22-y -7 A go q ,, Document of The World Bank FOR OFFICIAL USE ONLY il I ( >I8.( )]i 1 t'f-l±E C 4'( | Report No. 9529-LSO l il .rt N ,- / . t ,!I . 1. 'i 1( T v f) (: AR.) STAFF APPRAISAL REPORT KINGDOM OF LESOTHO EDUCATION SECTOR DEVELOPMENT PROJECT JUNE 19, 1991 Population and Human Resources Division Southern Africa Department This document has a restricted distribution and may be used by reipients only in the performance of their official duties. Its contents may not otherwise be disclosed without World Bank authorization.",
"Document of The World Bank Report No. 13611-PAK STAFF APPRAISAL REPORT PAKISTAN POPULATION WELFARE PROGRAM PROJECT FREBRUARY 10, 1995 Population and Human Resources Division Country Department I South Asia Region",
"I Toward an Environmental Strategy for Asia A Summary of a World Bank Discussion Paper Carter Brandon Ramesh Ramankutty The World Bank Washliington, D.C. (C 1993 The International Bank for Reconstruction and Development / THiE WORLD BANK 1818 H Street, N.W. Washington, D.C. 20433 All rights reserved Manufactured in the United States of America First printing November 1993",
"Report No. PID9188 Project Name East Timor-TP-Emergency School (@) Readiness Project Region East Asia and Pacific Region Sector Other Education Project ID TPPE70268 Borrower(s) EAST TIMOR Implementing Agency Address UNTAET (UN TRANSITIONAL ADMINISTRATION FOR EAST TIMOR) Contact Person: Cecilio Adorna, UNTAET, Dili, East Timor Fax: 61-8 89 422198 Environment Category C Date PID Prepared June 16, 2000 Projected Appraisal Date May 27, 2000 Projected Board Date June 20, 2000",
"Page 1 CONFORMED COPY CREDIT NUMBER 2447-CHA (Reform, Institutional Support and Preinvestment Project) between PEOPLE'S REPUBLIC OF CHINA and INTERNATIONAL DEVELOPMENT ASSOCIATION Dated December 30, 1992")
c1 <- corpus(v1)
我想要做的第一件事就是提取第一个出现的日期,大部分时间是“月份年”(1990年12月)或“月日,年”(1991年6月19日),或者是1995年10月10日的拼写错误。在这种情况下,月份可以被丢弃。
我的代码是一个组合 Extract date text from string &安培; Extract Dates in any format from Text in R:
lapply(c1$documents$texts, function(x) anydate(str_extract_all(c1$documents$texts, "[[:alnum:]]+[ /]*\\d{2}[ /]*\\d{4}")))
并收到错误:
Error in anytime_cpp(x = x, tz = tz, asUTC = asUTC, asDate = TRUE, useR = useR, : Unsupported Type
但是,我不知道如何提供日期格式。此外,我真的不知道如何编写正确的正则表达式。
https://www.regular-expressions.info/dates.html&amp; https://www.regular-expressions.info/rlanguage.html
关于这个问题的其他问题是: Extract date from text
Need to extract date from a text file of strings in R
http://r.789695.n4.nabble.com/Regexp-extract-first-occurrence-of-date-in-string-td997254.html