我开发了一个模拟Web活动的自定义系统,例如下载文件等。我还有一个自定义文件格式来提供给这个系统。我希望将这个用perl编写的旧系统更改为python中的新系统。但首先我必须以某种方式解析文件。
我想要解析文件中的某些字段,例如[settings]
我对系统有任何争论。我还有一个[macro]
部分,它是重要内容(步骤等)的开始。
我遇到的麻烦是解析这些部分让我的系统用不同的,更简单的格式写出来(我有数千个这样的文件,我只想编写一个生成器来获取旧文件并写入新的格式在新文件中)。
旧格式:
[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0
[macro]
%::WebSurfRules =
(
'Step1' =>
{
action => 'NAVIGATE',
inputstring => 'http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',
},
'Step2' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'OUTER',
matchstring => 'phHttpDest->\{\'FirstClick\'\}',
pass => 'phHttpDest->\{\'Step2Pass\'\}',
},
'Step3' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'OUTER',
matchstring => 'phHttpDest->\{\'SecondClick\'\}',
},
'Step4' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'OUTER',
matchstring => 'phHttpDest->\{\'DealClick\'\}',
accept_multi_match => 'ANY_TOP_FIRST',
},
'Step5' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'INNER',
matchstring => 'phHttpDest->\{\'LinkClick2\'\}',
fail => 'Step6',
# accept_multi_match => 'ANY_TOP_LAST',
},
'Step6' =>
{
action => 'CLICK_REFERENCE',
matchtype => 'INNER',
matchstring => 'phHttpDest->\{\'DocClick\'\}',
},
'Step7' =>
{
action => 'CLICK_DOWNLOAD_OK',
},
);
[data]
Print WebAddress______________ Destination_________________________________________________ FirstClick_________________ SecondClick________________ DealClick_________________________ LinkClick2________________________ DocClick___________________________________ PayInterval DueDay Step2Pass__________ QaRule_________________________________________________________________________________________________________________
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Fund´s Allocation q1 Step3 qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
我希望它吐出来的东西:
[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0
[macro]
%::WebSurfRules =
(
'1' => 'NAVIGATE,phHttpDest->\{\'WebAddress\'\}',
'2' => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',
'3' => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',
'4' => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
'5' => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',
'6' => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',
);
[data]
Print WebAddress______________ Destination_________________________________________________ FirstClick_________________ SecondClick________________ DealClick_________________________ LinkClick2________________________ DocClick___________________________________ PayInterval DueDay Step2Pass__________ QaRule_________________________________________________________________________________________________________________
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_apl.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Fund´s Allocation q1 Step3 qa_regexp=Report D?d?ate\\s+\\d\\d\/$MM{$n}\/$YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMES[$MM{$n}-1].+$YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPSHORTMONTHNAMES[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
0 http://www.tda-sgft.com/ d:\\$YYYYMM{$n}\\raw\\remit\\wl\\CXPEN1_bond.pdf Mortgage Loan ABS Caixa Penedes 1 TDA MAINPAGE - FAIL Investors information on Payment Date q1 Step3 qa_regexp=PAYMENT DATE:\\s+$aCAPMONTHNAMESSPANISH[$MM{$n}-1] \\d\\d.+? ?.? $YYYY{$n}
每次点击phHttpDest和操作都会与[data]
部分的标题相关联。
答案 0 :(得分:2)
因此,一种方法是使用一组正则表达式替换来创建新格式的文件。我没有完全理解你的格式规则所以我一般都实现了整个事情,但是有一些不同之处。您必须进入并进行一些调整才能对其进行微调。 output.txt文件是在将您的示例用作input.txt
时生成的文件<强>码强>
import re
data = open('input.txt').read()
data = re.sub(r" 'Step([0-9]+)' =>\s+{\s+action\s+=> ", r" '\1' => ", data)
data = re.sub(r"',\s+pass\s+[^,]+,", "", data)
data = re.sub(r"',\s+accept_multi_match\s+[^,]+,", "", data)
data = re.sub(r"\n +#.*\n", "\n", data)
data = re.sub(r"',\s+fail\s+[^,]+,", "", data)
data = re.sub(r"',\s+matchtype\s+[^,]+,", "", data)
data = re.sub(r"',\s+inputstring\s+=> '", ",", data)
data = re.sub(r"\s+matchstring\s+=> '", ",", data)
data = re.sub(r"\n },", "',", data)
open('output.txt', 'w').write(data)
<强> output.txt的强>
[settings]
email_to=people
special_websurf_processing=1
period_0_1_only=1
crc_recheck=0
[macro]
%::WebSurfRules =
(
'1' => 'NAVIGATE,http://www.tda-sgft.com/TdaWeb/jsp/fondos/Fondos.tda',',
'2' => 'CLICK_REFERENCE,phHttpDest->\{\'FirstClick\'\}',
'3' => 'CLICK_REFERENCE,phHttpDest->\{\'SecondClick\'\}',',
'4' => 'CLICK_REFERENCE,phHttpDest->\{\'DealClick\'\}',
'5' => 'CLICK_REFERENCE,phHttpDest->\{\'LinkClick2\'\}',
'6' => 'CLICK_REFERENCE,phHttpDest->\{\'DocClick\'\}',',
'7' => 'CLICK_DOWNLOAD_OK',',
);
...