我有一个数据框,我想从中提取具有条件的字符串中的特定值。
Sr. No. String
1. ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-
Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR
22,951.57.
2. WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14
May. Info. MMT*125485645*99999999. Your Net Available Balance is
USD 20,531.38.
DF
Conditions:
1. Take first coming word credited/debited/credit/debit as "Credit" or "Debit" in type.
2. Take last four digit after your Account/your Ac/your a/c or your acc (or the string lookes like XXXX1234) in Acc.
3. Take first value coming after credited/debited/credit/debit word in the sring as Fig.
4. Take date after word "on" or which lookes like date from string in Date column.
5. Take description in desc after word Info:
6. Take balance after word Available Balance/Net Balance/Balance or Last Numeric figure in the string.
DF2
Sr.No. Type Acc Fig Date Desc Balance
1 Credit 1987 22,500 30-10-2017 Info: CAM*CASH 22,951
DEPOSIT*ELISH SEC.
2 Debit 1987 5,000 14-May Info. 20,531.38
MMT*125485645*99999999.
从那个Dataframe我想要下面提到的具有特定条件的数据帧。
{{1}}
答案 0 :(得分:2)
我尝试写一般表达式,因为我可以想出但如果数据的结构不一样,那么可能需要调整Regex
library(stringr)
input = structure(list(
`Sr. No.`=c("1", "2"),
String=c(
"ABCD, your Account XX1987 has been credited with EUR 22,500.00 on 30-Oct-17. Info: CAM*CASH DEPOSIT*ELISH SEC. The Available Balance is EUR 22,951.57.",
"WXYZ, Your Ac XXXXXXXX1987 is debited with USD 5,000.00 on 14 May. Info. MMT*125485645*99999999. Your Net Available Balance is USD 20,531.38.)")),
.Names=c("Sr. No.", "String"), row.names=1:2, class="data.frame")
rule_13 = str_match(input$String, "(credit|debit)ed[^0-9]*((?:EUR|USD|INR|Rs) [0-9,.]+)")
rule_2 = str_match(input$String, "(?:Account|your Ac|your a/c|your acc|XX)[^0-9]*([0-9]+)")
rule_4 = str_match(input$String, " on ([0-9]+[ -](?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)|[0-9]+)(?:[ -][0-9]+)?)")
rule_5 = str_match(input$String, "\\bInfo\\b[^\\w\\d]+(.+)(?=\\. )")
rule_6 = str_match(input$String, "(?:Available Balance|Net Balance|Balance)[^0-9]*([0-9,.]+[0-9])")
data.frame(
Sr.No=input$`Sr. No.`,
Type=rule_13[,2],
Acc=rule_2[,2],
Fig=rule_13[,3],
Data=rule_4[,2],
Desc=rule_5[,2],
Balance=rule_6[,2])
输出
Sr.No Type Acc Fig Data Desc Balance
1 credit 1987 22,500.00 30-Oct-17 CAM*CASH DEPOSIT*ELISH SEC 22,951.57
2 debit 1987 5,000.00 14 May MMT*125485645*99999999 20,531.38
答案 1 :(得分:-1)
在模式之后,您可以将所有正则表达式组合在一行中并提取信息:
pat=c(Account="(?<=X)\\d+",
Type="(credit|debit)",
Fig="(\\w{1,3}\\s\\d+.*\\.\\d+\\s)",
Date="(\\d+\\s\\w+\\.)|(?<=on\\s)(\\d+\\W\\w+\\W\\d+)",
Decs="(Info.*\\.\\s)",
Balance="(?<=Balance\\s\\is\\s).*\\.")
data.frame(mapply(str_extract,DF[2],pat))
String NA. NA..1 NA..2 NA..3 NA..4
1 1987 credit EUR 22,500.00 30-Oct-17 Info: CAM*CASH DEPOSIT*ELISH SEC. EUR 22,951.57.
2 1987 debit USD 5,000.00 14 May. Info. MMT*125485645*99999999. USD 20,531.38.
3 1234 credit INR 187,314.00 31/10/17