str = 'FW201703002082017MF0164EXESTBOPF01163500116000 0001201700258000580000116000.WALTERS BAY BOGAWANTALAWA 1M'
上面的表达式是需要拆分的字符串,并按如下方式单独提取:
Borkername = FW
Sale year = 2017
Saleno = 0300
sale_dte = 20.08.2017 # date need to be format
Factoryno = MF0164
Catalogu code= EXEST
Grade =BOPF
Gross weight =01163.50 #decimal point needed
Net Weight = 01163.50 #decimal point needed
Lot_No = 0001
invoice_year = 2017
invoice_no = 00258
price = 000580.00 #decimal point needed
Netweight = 01160.00 #decimal point needed
Buyer = 'WALTERS BAY BOGAWANTALAWA'
Buyer_code = '1M'
这是一条没有任何分母的单行。所以,请帮我写一个正则表达式,将每个字段分成python中的panda列。
例如:
(\A[A-Z]{2})
这将给我前两个字符。我怎样才能获得下一个4位数?
答案 0 :(得分:0)
你需要在两个方面做到这一点。首先使用正则表达式将字符串拆分为(大多数)固定长度的段。然后使用您返回的列表,手动将字段修复为您需要的格式。例如:
import re
import csv
headings = [
"Borkername", "Sale year", "Saleno", "sale_dte", "Factoryno", "Catalogu code", "Grade", "Gross weight",
"Net Weight", "Lot_No", "invoice_year", "invoice_no", "price", "Netweight", "Buyer", "Buyer_code"]
re_fields = re.compile(r'(.{2})(.{4})(.{3})(.{8})(.{6})(.{5})(.{4})(.{7})(.{7}) (.{4})(.{4})(.{5})(.{8})(.{7}).(.*?) (.{2})$')
with open('input.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
csv_writer = csv.writer(f_output)
csv_writer.writerow(headings)
for line in f_input:
fields = list(re_fields.match(line).groups())
fields[3] = "{}.{}.{}".format(fields[3][:2], fields[3][2:4], fields[3][4:])
fields[7] = float("{}.{}".format(fields[7][:5], fields[7][5:]))
fields[8] = float("{}.{}".format(fields[8][:5], fields[8][5:]))
fields[12] = float("{}.{}".format(fields[12][:6], fields[12][6:]))
fields[13] = float("{}.{}".format(fields[13][:5], fields[13][5:]))
csv_writer.writerow(fields)
这会给你output.csv
包含:
Borkername,Sale year,Saleno,sale_dte,Factoryno,Catalogu code,Grade,Gross weight,Net Weight,Lot_No,invoice_year,invoice_no,price,Netweight,Buyer,Buyer_code
FW,2017,030,02.08.2017,MF0164,EXEST,BOPF,1163.5,1160.0,0001,2017,00258,580.0,1160.0,WALTERS BAY BOGAWANTALAWA,1M
然后可以使用Pandas读取:
import pandas as pd
data = pd.read_csv('output.csv')
print data
给出了:
Borkername Sale year Saleno sale_dte Factoryno Catalogu code Grade Gross weight Net Weight Lot_No \
0 FW 2017 30 02.08.2017 MF0164 EXEST BOPF 1163.5 1160.0 1
invoice_year invoice_no price Netweight Buyer Buyer_code
0 2017 258 580.0 1160.0 WALTERS BAY BOGAWANTALAWA 1M