Question

我的文本文件如下所示。这种类型的行在文件中多次出现。

[Nov 22 22:27:13] INFO - [com.macys.seo.business.impl.LinkBusinessImpl] - 执行搜索（WS）网关请求：KeywordVO（关键字= GUESS得分= 83965 normalizedKeyword = GUESS productIds = [ ] categoryIds = [] hotListed = false blackListed = false globalHotList = false url = / buy / GUESS）

我想只将以下数据提取到文件中，例如：

keyword = Guess，Score = 83965，hotListed = false，globalHotList = false url = / buy / GUESS

这是我到目前为止所拥有的：

def get_sentences(filename):
    with open('d:\log.log') as file_contents:
        d1, d2 = '( ', ' )' # just example delimiters
        for line in file_contents:
            if d1 in line:
                results = []
            elif d2 in line:
                yield results
            else: results.append(line)
    print results

请告知。

Answer 1

Regular expressions可以帮助一次解析：

import re, pprint

with open('d:\log.log') as f:
   s = f.read()
results = re.findall(r'KeywordVO \((.*?)\)', s)
pprint.pprint(results)

上面的正则表达式使用KeywordVO来识别哪些括号是相关的（我猜你不想匹配示例文本的(WS)部分）。您可能需要仔细查看日志文件，确定提取所需数据的确切正则表达式。

获得所有关键字对的长文本字符串后，请使用另一个正则表达式来拆分键/值对：r'[A-Za-z]+\s*=\s*[A-Za-z\[\]\,]'。这个正则表达式很棘手，因为你想在等号的右边捕获复杂的值，但不想意外地捕获下一个键（不幸的是，键/值对没有用逗号或某些东西分隔）。

祝你解析好运： - ）

Answer 2

您可以使用正则表达式：

>>> re.findall(r'\w+ = \S+', the_text)
['keyword = GUESS', 'score = 83965', 'normalizedKeyword = GUESS',
 'productIds = []', 'categoryIds = []', 'hotListed = false',
 'blackListed = false', 'globalHotList = false', 'url = /buy/GUESS']

然后你可以拆分=并抓住你需要的那些。

类似的东西：

>>> data = re.findall(r'\w+ = \S+', the_text)
>>> ok = ('keyword', 'score', 'hotListed', 'url')
>>> [i for i in [i.split(' = ') for i in data] if i[0] in ok
[['keyword', 'GUESS'], ['score', '83965'], ['hotListed', 'false'], ['url', '/buy/GUESS']]

需要从Python中的括号内的文本文件中读取数据

2 个答案: