Question

我想使用python从文件中打印某些数据块。基本上它应该作为解析器工作，只输出块来计算我的标准。我的文件包含呼叫中心的日志。我希望该部分以“####”开头并以"</soap:Body>>"结尾，但它也应该在我的文件中包含一个称为msisdn的特定数字："<msisdn>any number</msisdn>"

该文件也有点大。所以，当我执行readlines（）时，我无法使用正则表达式使用for i，枚举数据（行）这里的数据是分裂的，我无法搜索我需要的整个块。

文件的一部分在这里：

####<Jun 4, 2016 12:05:50 PM IST> <Debug> <MessagingBridgeRuntimeVerbose> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1465022150722> <BEA-000000> <Bridge NPGBridge doTrigger(): state = 4 stopped = false> 
####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1465022150886> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] *** CreateVASWrapper Reprting Stage VAS V-3.0 ***: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">
    <TransactionId>DATA030620160431128801011429ADD</TransactionId>
    <msisdn>8801011429</msisdn>
    <productCode>DATA</productCode>
    <action>ADD</action>
    <IMSI>405801124044563</IMSI>
    <SubsType>PrePaid</SubsType>
  </VASProxyType>
</soap:Body>> 
####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1465022150889> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, Authentication, REQUEST] ***REQUEST FOR VAS V-3.0 ****: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">
    <TransactionId>DATA030620160431128801011429ADD</TransactionId>

输出应为：

＆lt; [ACTIVE] ExecuteThread：'13'表示队列：'weblogic.kernel.Default（self-tuning）'＆gt; ＆LT;＆GT; ＆LT;＆GT; ＆LT;＆GT; ＆LT; 1465022150886＆GT; ＆LT; [PipelinePairNode1，PipelinePairNode1_request，CreateVASReportingStage，REQUEST] * CreateVASWrapper Reprting Stage VAS V-3.0 * ：

DATA030620160431128801011429ADD 8801011429 数据加 405801124044563 式PrePaid ＆GT;

友好的帮助！

Answer 1

如同建议的评论：您的XML无效。最好确保有效的XML，然后使用像[etree] [1]或[Beautiful Soup] [2]这样的解析器。

但是如果你想使用正则表达式，你可以尝试：

import re

mytext = [
    '####<Jun 4, 2016 12:05:50 PM IST> <Debug> <MessagingBridgeRuntimeVerbose> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<WLS Kernel>> <> <> <1465022150722> <BEA-000000> <Bridge NPGBridge doTrigger(): state = 4 stopped = false>',
    '####<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<anonymous>> <> <> <1465022150886> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] *** CreateVASWrapper Reprting Stage VAS V-3.0 ***: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">',
    '<VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">',
    '    <TransactionId>DATA030620160431128801011429ADD</TransactionId>',
    '    <msisdn>8801011429</msisdn>',
    '    <productCode>DATA</productCode>',
    '    <action>ADD</action>',
    '    <IMSI>405801124044563</IMSI>',
    '    <SubsType>PrePaid</SubsType>',
    '</VASProxyType>',
    '</soap:Body>',
    '<Jun 4, 2016 12:05:50 PM IST> <Error> <ALSB Logging> <ggneai29> <AircelESB_MS1> <[ACTIVE] ExecuteThread: \'13\' for queue: \'weblogic.kernel.Default (self-tuning)\'> <<anonymous>> <> <> <1465022150889> <BEA-000000> < [PipelinePairNode1, PipelinePairNode1_request, Authentication, REQUEST] ***REQUEST FOR VAS V-3.0 ****: <soap:Body xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">',
    '    <VASProxyType xmlns="http://xmlns.aircel.com/AircelTransformation/ProxyService/OrderProxy/1.0/CreateVASSubscriptionConsumerSchema">',
    '        <TransactionId>DATA030620160431128801011429ADD</TransactionId>',
]

searches = [
    {
       "if_in": "<[ACTIVE] ExecuteThread:",
       "search": "<\[ACTIVE[^<>]+> <<WLS Kernel>> <> <> <\d+>",
    },
    {
        "if_in": "PipelinePairNode1, PipelinePairNode1_request, Create",
        "search": "< \[PipelinePairNode1, PipelinePairNode1_request, Create[^\[\]]+\]",
    },
    {
        "if_in": "CreateVASWrapper Reprting Stage VAS",
        "search": "CreateVASWrapper Reprting Stage VAS[^*]+",
    },
    {
        "if_in": "<TransactionId>",
        "search": "(?<=<TransactionId>)[^<>]+",
    },
    {
        "if_in": "<msisdn>",
        "search": "(?<=<msisdn>)[^<>]+",
    },
    {
        "if_in": "<action>",
        "search": "(?<=<action>)[^<>]+",
    },
    {
        "if_in": "<IMSI>",
        "search": "(?<=<IMSI>)[^<>]+",
    },
    {
        "if_in": "<SubsType>",
        "search": "(?<=<SubsType>)[^<>]+",
    },
]

result = ""
found_once = []

for item in mytext:
    for search in searches:
        if search['if_in'] in item and search['if_in'] not in found_once:
            f = re.findall(search['search'], item)
            if f:
                result += f[0] + " "
                found_once.append(search['if_in'])

print result

如果您想查找其他内容，请将其添加到searches。

结果将是：

<[ACTIVE] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1465022150722> < [PipelinePairNode1, PipelinePairNode1_request, CreateVASReportingStage, REQUEST] CreateVASWrapper Reprting Stage VAS V-3.0  DATA030620160431128801011429ADD 8801011429 ADD 405801124044563 PrePaid

Answer 2

处理此类问题的规范方法是编写某种“基于事件的”解析器（如SAX xml解析器......）：解析器逐行读取文件（您不需要读取整个文件）内存中的内容），根据你自己的规则（你可能想要使用正则表达式，但有时普通的字符串方法同样有效）扫描该行，并根据行内容发出一个给定的“事件”（将被处理）通过回调方法）与相关数据。

在你的情况下，你会有一个事件，用于开始一个有趣的数据块的行（以“####”开头的行），另一个用于包含xml数据的行，以及用于块的最后一行的事件（包含“”的行 - 这样的东西：

class Parser(object):

    def parse(self, logfile):
        self.in_block = False
        for line in logfile:
            if self.is_block_start(line):
                self.in_block = True
                self.handle_block_start(line)
            elif self.in_block:
                if self.is_data(line):
                    self.handle_data(line)
                elif self.is_block_end(line):
                    self.in_block = False
                    self.handle_block_end(line)
            else:
                continue

    def is_block_start(self):
        # your code here

    def is_data(self):
        # your code here

    def is_block_end(self):
        # your code here

    def handle_block_start(self, line):
        # your code here

    def handle_data(self, line):
        # your code here

    def handle_block_end(self, line):
        # your code here

如何只打印数据文件的某一部分，在python中计算我们的模式

2 个答案: