我有一个带有文本列的数据框,由于文本字符串包含多个变量,例如
,我想将其拆分为多列df = pd.read_csv('C:/Users/mydata.csv')
print(df)
my_msg
1 "Acct:XXXXXX0000 Debit:NGN2,000.00 Details:ATM CASH WITHDRAWAL"
2 "Acct:XXXXXX0000 Credit:NGN135,000.00 Details:BY UE Date:03-05-2019 10:03
Available Bal:NGN135,454.78 Enquiries:123456"
3 "Prepaid Card Alert **** POS : Dr NGN 4,052.50 Desc:
GTB/*******330/*****939 9000005600 NDate: 03-05-2019 09:36 Bal : NGN
506,265.00 FEEDBACK? Call 123456"
4 "Acct:XXXXXX0001 Debit:NGN300.00 Details:MOBILE BANKING300.00
IRTIME********7061 Date:03-05-2019 00:09 Available Bal:NGN373,358.56
Enquiries:12346"
I'm expecting an output in following way
Acct Debit Credit Bal Pos
0000 2000 NA NA NA
0000 NA 135000 135454 NA
NA NA NA 506265 4052
0001 300 NA 373358 NA
答案 0 :(得分:1)
使用正则表达式。
例如:
import pandas as pd
df = pd.DataFrame({"Col": ["Acct:XXXXXX0000 Debit:NGN2,000.00 Details:ATM CASH WITHDRAWAL",
"Acct:XXXXXX0000 Credit:NGN135,000.00 Details:BY UE Date:03-05-2019 10:03 Available Bal:NGN135,454.78 Enquiries:123456",
"Prepaid Card Alert **** POS : Dr NGN 4,052.50 Desc: GTB/*******330/*****939 9000005600 NDate: 03-05-2019 09:36 Bal : NGN 506,265.00 FEEDBACK? Call 123456",
"Acct:XXXXXX0001 Debit:NGN300.00 Details:MOBILE BANKING300.00 IRTIME********7061 Date:03-05-2019 00:09 Available Bal:NGN373,358.56 Enquiries:12346"
]})
df["Acct"] = df["Col"].str.extract(r"Acct\s*:\s*XXXXXX(\d+)\s+")
df["Debit"] = df["Col"].str.extract(r"Debit\s*:\s*NGN\s*([0-9,\.]+)\s+")
df["Credit"] = df["Col"].str.extract(r"Credit\s*:\s*NGN\s*([0-9,\.]+)\s+")
df["Bal"] = df["Col"].str.extract(r"Bal\s*:\s*NGN\s*([0-9,\.]+)\s+")
df["Pos"] = df["Col"].str.extract(r"POS\s*:\s*Dr\s*NGN\s*([0-9,\.]+)\s+")
print(df)
输出:
Col Acct Debit \
0 Acct:XXXXXX0000 Debit:NGN2,000.00 Details:ATM ... 0000 2,000.00
1 Acct:XXXXXX0000 Credit:NGN135,000.00 Details:B... 0000 NaN
2 Prepaid Card Alert **** POS : Dr NGN 4,052.5... NaN NaN
3 Acct:XXXXXX0001 Debit:NGN300.00 Details:MOBILE... 0001 300.00
Credit Bal Pos
0 NaN NaN NaN
1 135,000.00 135,454.78 NaN
2 NaN 506,265.00 4,052.50
3 NaN 373,358.56 NaN