如何只打印数据文件的某一部分,在python中计算我们的模式

时间:2016-07-19 07:58:45

标签: python regex parsing

我想使用python从文件中打印某些数据块。基本上它应该作为解析器工作,只输出块来计算我的标准。 我的文件包含呼叫中心的日志。我希望该部分以“####”开头并以"</soap:Body>>"结尾,但它也应该在我的文件中包含一个称为msisdn的特定数字:"<msisdn>any number</msisdn>"

该文件也有点大。所以,当我执行readlines()时,我无法使用正则表达式使用for i,枚举数据(行)                 这里的数据是分裂的,我无法搜索我需要的整个块。

文件的一部分在这里:

####<Jun 4, 2016 12:05:50 PM IST> <Debug> <MessagingBridgeRuntimeVerbose> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1465022150722> <BEA-000000> <Bridge NPGBridge doTrigger(): state = 4 stopped = false> 
####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1465022150886> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] *** CreateVASWrapper Reprting Stage VAS V-3.0 ***: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">
    <TransactionId>DATA030620160431128801011429ADD</TransactionId>
    <msisdn>8801011429</msisdn>
    <productCode>DATA</productCode>
    <action>ADD</action>
    <IMSI>405801124044563</IMSI>
    <SubsType>PrePaid</SubsType>
  </VASProxyType>
</soap:Body>> 
####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1465022150889> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, Authentication, REQUEST] ***REQUEST FOR VAS V-3.0 ****: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">
    <TransactionId>DATA030620160431128801011429ADD</TransactionId>

输出应为:

     &lt; [ACTIVE] ExecuteThread:'13'表示队列:'weblogic.kernel.Default(self-tuning)'&gt; &LT;&GT; &LT;&GT; &LT;&GT; &LT; 1465022150886&GT; &LT; [PipelinePairNode1,PipelinePairNode1_request,CreateVASReportingStage,REQUEST] * CreateVASWrapper Reprting Stage VAS V-3.0 *

    DATA030620160431128801011429ADD     8801011429     数据     加     405801124044563     式PrePaid    &GT;

友好的帮助!

2 个答案:

答案 0 :(得分:0)

如同建议的评论:您的XML无效。最好确保有效的XML,然后使用像[etree] [1]或[Beautiful Soup] [2]这样的解析器。

但是如果你想使用正则表达式,你可以尝试:

import re

mytext = [
    '####<Jun 4, 2016 12:05:50 PM IST> <Debug> <MessagingBridgeRuntimeVerbose> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<WLS Kernel>> <> <> <1465022150722> <BEA-000000> <Bridge NPGBridge doTrigger(): state = 4 stopped = false>',
    '####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<anonymous>> <> <> <1465022150886> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] *** CreateVASWrapper Reprting Stage VAS V-3.0 ***: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">',
    '<VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">',
    '    <TransactionId>DATA030620160431128801011429ADD</TransactionId>',
    '    <msisdn>8801011429</msisdn>',
    '    <productCode>DATA</productCode>',
    '    <action>ADD</action>',
    '    <IMSI>405801124044563</IMSI>',
    '    <SubsType>PrePaid</SubsType>',
    '</VASProxyType>',
    '</soap:Body>',
    '<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<anonymous>> <> <> <1465022150889> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, Authentication, REQUEST] ***REQUEST FOR VAS V-3.0 ****: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">',
    '    <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">',
    '        <TransactionId>DATA030620160431128801011429ADD</TransactionId>',
]

searches = [
    {
       "if_in": "<[ACTIVE] ExecuteThread:",
       "search": "<\[ACTIVE[^<>]+> <<WLS Kernel>> <> <> <\d+>",
    },
    {
        "if_in": "PipelinePairNode1, PipelinePairNode1_request, Create",
        "search": "< \[PipelinePairNode1, PipelinePairNode1_request, Create[^\[\]]+\]",
    },
    {
        "if_in": "CreateVASWrapper Reprting Stage VAS",
        "search": "CreateVASWrapper Reprting Stage VAS[^*]+",
    },
    {
        "if_in": "<TransactionId>",
        "search": "(?<=<TransactionId>)[^<>]+",
    },
    {
        "if_in": "<msisdn>",
        "search": "(?<=<msisdn>)[^<>]+",
    },
    {
        "if_in": "<action>",
        "search": "(?<=<action>)[^<>]+",
    },
    {
        "if_in": "<IMSI>",
        "search": "(?<=<IMSI>)[^<>]+",
    },
    {
        "if_in": "<SubsType>",
        "search": "(?<=<SubsType>)[^<>]+",
    },
]

result = ""
found_once = []

for item in mytext:
    for search in searches:
        if search['if_in'] in item and search['if_in'] not in found_once:
            f = re.findall(search['search'], item)
            if f:
                result += f[0] + " "
                found_once.append(search['if_in'])

print result

如果您想查找其他内容,请将其添加到searches

结果将是:

<[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1465022150722> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] CreateVASWrapper Reprting Stage VAS V-3.0  DATA030620160431128801011429ADD 8801011429 ADD 405801124044563 PrePaid

答案 1 :(得分:0)

处理此类问题的规范方法是编写某种“基于事件的”解析器(如SAX xml解析器......):解析器逐行读取文件(您不需要读取整个文件)内存中的内容),根据你自己的规则(你可能想要使用正则表达式,但有时普通的字符串方法同样有效)扫描该行,并根据行内容发出一个给定的“事件”(将被处理)通过回调方法)与相关数据。

在你的情况下,你会有一个事件,用于开始一个有趣的数据块的行(以“####”开头的行),另一个用于包含xml数据的行,以及用于块的最后一行的事件(包含“”的行 - 这样的东西:

class Parser(object):

    def parse(self, logfile):
        self.in_block = False
        for line in logfile:
            if self.is_block_start(line):
                self.in_block = True
                self.handle_block_start(line)
            elif self.in_block:
                if self.is_data(line):
                    self.handle_data(line)
                elif self.is_block_end(line):
                    self.in_block = False
                    self.handle_block_end(line)
            else:
                continue

    def is_block_start(self):
        # your code here

    def is_data(self):
        # your code here

    def is_block_end(self):
        # your code here

    def handle_block_start(self, line):
        # your code here

    def handle_data(self, line):
        # your code here

    def handle_block_end(self, line):
        # your code here