使用regexpr()和regmatches()提取模式

时间:2019-04-07 23:11:28

标签: r

我试图将字符串分为三个部分:名称和时间(日期,时间)和通用文本。最初看起来像:

data = 
c("JENNIFER [Day 1, 9:00 A.M.]: Generic text, it doesn't matter what is going on here. There are more than 2 lines." 

"SAM [Day 2, 10:15 A.M.]: This doesn't matter. It has a lot of lines." 
"DAN'S [Day 4, 12:00 P.M.]: It doesn't really matter what's going on in this part.")

我能够提取数据的第一部分,NAME [TIME] :,但是我很难做到的是将NAME和TIME分开。

match = regexpr("^[A-Z].*:", data)
regmatches(data, match)

这给了我

JENNIFER [Day 1, 9:00 A.M.]:
SAM [Day 2, 10:15 A.M.]:
DAN'S [Day 4, 12:00 P.M.]:

我可以看到名字全都用大写字母表示,所以我会说"^[A-Z]",但这也会用大写字母开头的所有其他句子。

我要创建一个数据框:

   Name           Date             Content
JENNIFER     Day 1 9:00A.M    "combined text" 

1 个答案:

答案 0 :(得分:2)

修复{ "results":{ "ALL":{ "currencyName":"Albanian Lek", "currencySymbol":"Lek", "id":"ALL" }, "XCD":{ "currencyName":"East Caribbean Dollar", "currencySymbol":"$", "id":"XCD" }, "EUR":{ "currencyName":"Euro", "currencySymbol":"€", "id":"EUR" }, "BBD":{ "currencyName":"Barbadian Dollar", "currencySymbol":"$", "id":"BBD" }, "BTN":{ "currencyName":"Bhutanese Ngultrum", "id":"BTN" }, "BND":{ "currencyName":"Brunei Dollar", "currencySymbol":"$", "id":"BND" }, "XAF":{ "currencyName":"Central African CFA Franc", "id":"XAF" }, "CUP":{ "currencyName":"Cuban Peso", "currencySymbol":"$", "id":"CUP" }, "USD":{ "currencyName":"United States Dollar", "currencySymbol":"$", "id":"USD" } } } 以使其成为正确的R代码,如末尾的注释所示,我们可以像这样从基数R使用data

strcapture

给予:

strcapture("^(.*) \\[(.*)\\]: (.*)", data,
  list(Name = character(0), Date = character(0), Text = character(0)))

注意

      Name              Date                                                  Text
1 JENNIFER  Day 1, 9:00 A.M. Blablablablablablbalbllalbalbalbl. Balalalbablablabl.
2      SAM Day 2, 10:15 A.M.  Balblablablabalbalbalblabalblablabl. Balaldfkemfeke.
3    DAN'S Day 4, 12:00 P.M.                                        DFnerke"dfsdf"