我想使用python从文件中打印某些数据块。基本上它应该作为解析器工作,只输出块来计算我的标准。
我的文件包含呼叫中心的日志。我希望该部分以“####”开头并以"</soap:Body>>"
结尾,但它也应该在我的文件中包含一个称为msisdn的特定数字:"<msisdn>any number</msisdn>"
该文件也有点大。所以,当我执行readlines()时,我无法使用正则表达式使用for i,枚举数据(行) 这里的数据是分裂的,我无法搜索我需要的整个块。
文件的一部分在这里:
####<Jun 4, 2016 12:05:50 PM IST> <Debug> <MessagingBridgeRuntimeVerbose> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1465022150722> <BEA-000000> <Bridge NPGBridge doTrigger(): state = 4 stopped = false>
####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1465022150886> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] *** CreateVASWrapper Reprting Stage VAS V-3.0 ***: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">
<TransactionId>DATA030620160431128801011429ADD</TransactionId>
<msisdn>8801011429</msisdn>
<productCode>DATA</productCode>
<action>ADD</action>
<IMSI>405801124044563</IMSI>
<SubsType>PrePaid</SubsType>
</VASProxyType>
</soap:Body>>
####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1465022150889> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, Authentication, REQUEST] ***REQUEST FOR VAS V-3.0 ****: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">
<TransactionId>DATA030620160431128801011429ADD</TransactionId>
输出应为:
&lt; [ACTIVE] ExecuteThread:'13'表示队列:'weblogic.kernel.Default(self-tuning)'&gt; &LT;&GT; &LT;&GT; &LT;&GT; &LT; 1465022150886&GT; &LT; [PipelinePairNode1,PipelinePairNode1_request,CreateVASReportingStage,REQUEST] * CreateVASWrapper Reprting Stage VAS V-3.0 * :DATA030620160431128801011429ADD 8801011429 数据 加 405801124044563 式PrePaid &GT;
友好的帮助!
答案 0 :(得分:0)
如同建议的评论:您的XML无效。最好确保有效的XML,然后使用像[etree] [1]或[Beautiful Soup] [2]这样的解析器。
但是如果你想使用正则表达式,你可以尝试:
import re
mytext = [
'####<Jun 4, 2016 12:05:50 PM IST> <Debug> <MessagingBridgeRuntimeVerbose> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<WLS Kernel>> <> <> <1465022150722> <BEA-000000> <Bridge NPGBridge doTrigger(): state = 4 stopped = false>',
'####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<anonymous>> <> <> <1465022150886> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] *** CreateVASWrapper Reprting Stage VAS V-3.0 ***: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">',
'<VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">',
' <TransactionId>DATA030620160431128801011429ADD</TransactionId>',
' <msisdn>8801011429</msisdn>',
' <productCode>DATA</productCode>',
' <action>ADD</action>',
' <IMSI>405801124044563</IMSI>',
' <SubsType>PrePaid</SubsType>',
'</VASProxyType>',
'</soap:Body>',
'<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<anonymous>> <> <> <1465022150889> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, Authentication, REQUEST] ***REQUEST FOR VAS V-3.0 ****: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">',
' <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">',
' <TransactionId>DATA030620160431128801011429ADD</TransactionId>',
]
searches = [
{
"if_in": "<[ACTIVE] ExecuteThread:",
"search": "<\[ACTIVE[^<>]+> <<WLS Kernel>> <> <> <\d+>",
},
{
"if_in": "PipelinePairNode1, PipelinePairNode1_request, Create",
"search": "< \[PipelinePairNode1, PipelinePairNode1_request, Create[^\[\]]+\]",
},
{
"if_in": "CreateVASWrapper Reprting Stage VAS",
"search": "CreateVASWrapper Reprting Stage VAS[^*]+",
},
{
"if_in": "<TransactionId>",
"search": "(?<=<TransactionId>)[^<>]+",
},
{
"if_in": "<msisdn>",
"search": "(?<=<msisdn>)[^<>]+",
},
{
"if_in": "<action>",
"search": "(?<=<action>)[^<>]+",
},
{
"if_in": "<IMSI>",
"search": "(?<=<IMSI>)[^<>]+",
},
{
"if_in": "<SubsType>",
"search": "(?<=<SubsType>)[^<>]+",
},
]
result = ""
found_once = []
for item in mytext:
for search in searches:
if search['if_in'] in item and search['if_in'] not in found_once:
f = re.findall(search['search'], item)
if f:
result += f[0] + " "
found_once.append(search['if_in'])
print result
如果您想查找其他内容,请将其添加到searches
。
结果将是:
<[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1465022150722> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] CreateVASWrapper Reprting Stage VAS V-3.0 DATA030620160431128801011429ADD 8801011429 ADD 405801124044563 PrePaid
答案 1 :(得分:0)
处理此类问题的规范方法是编写某种“基于事件的”解析器(如SAX xml解析器......):解析器逐行读取文件(您不需要读取整个文件)内存中的内容),根据你自己的规则(你可能想要使用正则表达式,但有时普通的字符串方法同样有效)扫描该行,并根据行内容发出一个给定的“事件”(将被处理)通过回调方法)与相关数据。
在你的情况下,你会有一个事件,用于开始一个有趣的数据块的行(以“####”开头的行),另一个用于包含xml数据的行,以及用于块的最后一行的事件(包含“”的行 - 这样的东西:
class Parser(object):
def parse(self, logfile):
self.in_block = False
for line in logfile:
if self.is_block_start(line):
self.in_block = True
self.handle_block_start(line)
elif self.in_block:
if self.is_data(line):
self.handle_data(line)
elif self.is_block_end(line):
self.in_block = False
self.handle_block_end(line)
else:
continue
def is_block_start(self):
# your code here
def is_data(self):
# your code here
def is_block_end(self):
# your code here
def handle_block_start(self, line):
# your code here
def handle_data(self, line):
# your code here
def handle_block_end(self, line):
# your code here