从字符串R中提取特定值

时间:2018-01-27 16:02:21

标签: r

我有一个数据框,我想从中提取具有条件的字符串中的特定值。

Sr. No. String

1.      ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-
        Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 
        22,951.57.
2.      WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 
        May. Info. MMT*125485645*99999999. Your Net Available Balance is 
        USD 20,531.38.

DF

Conditions:
1. Take first coming word credited/debited/credit/debit as "Credit" or "Debit" in type.
2. Take last four digit after your Account/your Ac/your a/c or your acc (or the string lookes like XXXX1234) in Acc.
3. Take first value coming after credited/debited/credit/debit word in the sring as Fig.
4. Take date after word "on" or which lookes like date from string in Date column.
5. Take description in desc after word Info:
6. Take balance after word Available Balance/Net Balance/Balance or Last Numeric figure in the string.

DF2
    Sr.No.      Type      Acc      Fig     Date       Desc             Balance
    1           Credit    1987     22,500  30-10-2017 Info: CAM*CASH   22,951
                                                      DEPOSIT*ELISH SEC.
    2           Debit     1987     5,000   14-May     Info.            20,531.38
                                                      MMT*125485645*99999999.

从那个Dataframe我想要下面提到的具有特定条件的数据帧。

{{1}}

2 个答案:

答案 0 :(得分:2)

我尝试写一般表达式,因为我可以想出但如果数据的结构不一样,那么可能需要调整Regex

library(stringr)
input = structure(list(
  `Sr. No.`=c("1", "2"), 
  String=c(
    "ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 22,951.57.", 
    "WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 May. Info. MMT*125485645*99999999. Your Net Available Balance is USD 20,531.38.)")), 
  .Names=c("Sr. No.", "String"), row.names=1:2, class="data.frame")

rule_13 = str_match(input$String, "(credit|debit)ed[^0-9]*((?:EUR|USD|INR|Rs) [0-9,.]+)")
rule_2 = str_match(input$String, "(?:Account|your Ac|your a/c|your acc|XX)[^0-9]*([0-9]+)")
rule_4 = str_match(input$String, " on ([0-9]+[ -](?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|[0-9]+)(?:[ -][0-9]+)?)")
rule_5 = str_match(input$String, "\\bInfo\\b[^\\w\\d]+(.+)(?=\\. )")
rule_6 = str_match(input$String, "(?:Available Balance|Net Balance|Balance)[^0-9]*([0-9,.]+[0-9])")

data.frame(
    Sr.No=input$`Sr. No.`,
    Type=rule_13[,2],
    Acc=rule_2[,2],
    Fig=rule_13[,3],
    Data=rule_4[,2],
    Desc=rule_5[,2],
    Balance=rule_6[,2])

输出

Sr.No   Type  Acc       Fig      Data                       Desc   Balance
    1 credit 1987 22,500.00 30-Oct-17 CAM*CASH DEPOSIT*ELISH SEC 22,951.57
    2  debit 1987  5,000.00    14 May     MMT*125485645*99999999 20,531.38

答案 1 :(得分:-1)

在模式之后,您可以将所有正则表达式组合在一行中并提取信息:

 pat=c(Account="(?<=X)\\d+",
 Type="(credit|debit)",
 Fig="(\\w{1,3}\\s\\d+.*\\.\\d+\\s)",
 Date="(\\d+\\s\\w+\\.)|(?<=on\\s)(\\d+\\W\\w+\\W\\d+)",
 Decs="(Info.*\\.\\s)",
 Balance="(?<=Balance\\s\\is\\s).*\\.")
 data.frame(mapply(str_extract,DF[2],pat))
  String    NA.           NA..1     NA..2                              NA..3          NA..4
1   1987 credit  EUR 22,500.00  30-Oct-17 Info: CAM*CASH DEPOSIT*ELISH SEC.  EUR 22,951.57.
2   1987  debit   USD 5,000.00    14 May.     Info. MMT*125485645*99999999.  USD 20,531.38.
3   1234 credit INR 187,314.00   31/10/17