Question

我尝试解析文件并在某个字符（在本例中为|）之前提取内容，以创建字典并根据此内容/密钥过滤掉重复项。我的看法是我应该使用正则表达式。

模拟输入数据：

AK_0004: abc123|Abc1231301820 abc123|Abc1231308920 qwerty|Qwerty0202470 qwerty|Qwerty209910

AK_0005: abc123|Abc12302100 abc123|Abc12302110 qwerty|Qwerty0209580 qwerty|Qwerty0209600

AK_0062: abc123|Abc12300430 qwerty|Qwerty0211140

我想：

AK_0004: abc123 abc123 qwerty qwerty

依旧......

到目前为止，我的尝试是：

import re

for line in open('splittest.txt', 'r'):

    m = re.compile(r"^[^|]*")

    print re.findall(m, line)

输出：

['AK_0004: abc123']

['AK_0005: abc123']

['AK_0006: abc123']

Answer 1

你确实可以使用正则表达式，特别是你想要创建捕获组，其模式与|之前的文本相匹配，我将假设它是任何单词字符。

import re

# Compile the regex pattern. (\w+) is our capture group.
p = re.compile(r'(\w+)\|')

line = 'AK_0004: abc123|Abc1231301820 abc123|Abc1231308920 qwerty|Qwerty0202470 qwerty|Qwerty209910'

# Get the AK_xxx
line_id = line.split(':')[0]

# Findall matches
m = p.findall(line)

print('{}: {}'.format(line_id, ' '.join(m)))

将产生：

AK_0004: abc123 abc123 qwerty qwerty

如何使用正则表达式解析文件中的每一行并在字符前提取内容？

1 个答案: