需要一个正则表达式来在Python中拆分String

时间:2017-09-20 10:08:10

标签: python regex data-extraction

str = 'FW201703002082017MF0164EXESTBOPF01163500116000 0001201700258000580000116000.WALTERS BAY BOGAWANTALAWA 1M'

上面的表达式是需要拆分的字符串,并按如下方式单独提取:

Borkername = FW
Sale year = 2017
Saleno = 0300
sale_dte = 20.08.2017 # date need to be format
Factoryno = MF0164
Catalogu code= EXEST
Grade =BOPF
Gross weight =01163.50 #decimal point needed
Net Weight = 01163.50 #decimal point needed
Lot_No = 0001
invoice_year = 2017
invoice_no = 00258
price = 000580.00 #decimal point needed
Netweight = 01160.00 #decimal point needed
Buyer = 'WALTERS BAY BOGAWANTALAWA'
Buyer_code = '1M'

这是一条没有任何分母的单行。所以,请帮我写一个正则表达式,将每个字段分成python中的panda列。

例如:

(\A[A-Z]{2}) 

这将给我前两个字符。我怎样才能获得下一个4位数?

1 个答案:

答案 0 :(得分:0)

你需要在两个方面做到这一点。首先使用正则表达式将字符串拆分为(大多数)固定长度的段。然后使用您返回的列表,手动将字段修复为您需要的格式。例如:

import re            
import csv

headings = [
    "Borkername", "Sale year", "Saleno", "sale_dte", "Factoryno", "Catalogu code", "Grade", "Gross weight", 
    "Net Weight", "Lot_No", "invoice_year", "invoice_no", "price", "Netweight", "Buyer", "Buyer_code"]

re_fields = re.compile(r'(.{2})(.{4})(.{3})(.{8})(.{6})(.{5})(.{4})(.{7})(.{7}) (.{4})(.{4})(.{5})(.{8})(.{7}).(.*?) (.{2})$')

with open('input.txt') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_writer = csv.writer(f_output)
    csv_writer.writerow(headings)

    for line in f_input:
        fields = list(re_fields.match(line).groups())

        fields[3] = "{}.{}.{}".format(fields[3][:2], fields[3][2:4], fields[3][4:])
        fields[7] = float("{}.{}".format(fields[7][:5], fields[7][5:]))
        fields[8] = float("{}.{}".format(fields[8][:5], fields[8][5:]))
        fields[12] = float("{}.{}".format(fields[12][:6], fields[12][6:]))
        fields[13] = float("{}.{}".format(fields[13][:5], fields[13][5:]))

        csv_writer.writerow(fields)

这会给你output.csv包含:

Borkername,Sale year,Saleno,sale_dte,Factoryno,Catalogu code,Grade,Gross weight,Net Weight,Lot_No,invoice_year,invoice_no,price,Netweight,Buyer,Buyer_code
FW,2017,030,02.08.2017,MF0164,EXEST,BOPF,1163.5,1160.0,0001,2017,00258,580.0,1160.0,WALTERS BAY BOGAWANTALAWA,1M

然后可以使用Pandas读取:

import pandas as pd

data = pd.read_csv('output.csv')
print data

给出了:

  Borkername  Sale year  Saleno    sale_dte Factoryno Catalogu code Grade  Gross weight  Net Weight  Lot_No  \
0         FW       2017      30  02.08.2017    MF0164         EXEST  BOPF        1163.5      1160.0       1   
   invoice_year  invoice_no  price  Netweight                      Buyer Buyer_code  
0          2017         258  580.0     1160.0  WALTERS BAY BOGAWANTALAWA         1M