Question

我有一个python列表，每个字符串都是以下4种可能的选项之一（当然名称会有所不同）：

Mr: Smith\n
Mr: Smith; John\n
Smith\n
Smith; John\n

我希望将其更正为：

Mr,Smith,fname\n
Mr,Smith,John\n
title,Smith,fname\n
title,Smith,John\n

使用4 re.sub（）很容易：

with open ("path/to/file",'r') as fileset:
    dataset = fileset.readlines()
for item in dataset:
    dataset = [item.strip() for item in dataset]    #removes some misc. white noise
    item = re.sub((.*):\W(.*);\W,r'\g<1>'+','+r'\g<2>'+',',item)
    item = re.sub((.*);\W(.*),'title,'+r'\g<1>'+','+r'\g<2>',item)
    item = re.sub((.*):\W(.*),r'\g<1>'+','+r'\g<2>'+',fname',item)
    item = re.sub((*.),'title,'+r'\g<1>'+',fname',item)

虽然这对我正在使用的数据集很好，但我希望提高效率是否有单一操作可以简化此过程？

如果我忘记了引用或其他一些原因，请原谅;我现在不在我的工作站，我知道我已经删除了换行符（\n）。

谢谢，

Answer 1

简

您可以将其减少到一行，而不是运行两个循环。改编自How to iterate over the file in Python（并使用代码部分中的代码）：

f = open("path/to/file",'r')
while True:
    x = f.readline()
    if not x: break
    print re.sub(r, repl, x)

有关其他替代方案，请参阅Python - How to use regexp on file, line by line, in Python。

代码

为了便于查看，我已将您的文件更改为数组。

See regex in use here

^(?:([^:\r\n]+):\W*)?([^;\r\n]+)(?:;\W*(.+))?

注意：你不需要python中的所有内容，我这样做是为了在regex101上显示它，所以你的正则表达式实际上只是^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?

用法

See code in use here

import re

a = [
    "Mr: Smith",
    "Mr: Smith; John",
    "Smith",
    "Smith; John"
]
r = r"^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?"

def repl(m):
    return (m.group(1) or "title" ) + "," + m.group(2) + "," + (m.group(3) or "fname")

for s in a:
    print re.sub(r, repl, s)

说明

^在行首处断言位置
(?:([^:]+):\W*)?可选择匹配以下内容
- ([^:]+)将:除:之外的任何字符一次或多次捕获到捕获组1
- \W*按字面意思匹配
- \s*匹配任意数量的非字字符（从OP＆＃39原始代码中复制，我认为可以使用([^;]+)代替）
;将除(?:;\W*(.+))?之外的任何字符一次或多次分组到捕获组2
;可选择匹配以下内容
- \W*按字面意思匹配
- \s*匹配任意数量的非字字符（从OP＆＃39原始代码中复制，我认为可以使用(.+)代替）
- re.sub(r, repl, s)将任意角色一次或多次捕获到捕获组3

鉴于正则表达式部分的上述解释。 repl的工作原理如下：

repl是对group 1函数的回调，它返回：
- title如果它捕获了任何内容，group 2否则
- group 3（它应该始终设置 - 再次使用OP＆＃39的逻辑）
- fname如果它捕获了任何内容，JavascriptExecutor js = (JavascriptExecutor) driver; js.executeScript("arguments[0].setAttribute('style', 'display:block')", targetElement);否则

Answer 2

恕我直言，RegEx在这里过于复杂，你可以使用经典的字符串函数来分割你的字符串 item 。为此，您可以使用partition（或rpartition）。

首先，将项字符串拆分为“记录”，如下所示：

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
records = item.splitlines()
# -> ['Mr,Smith,fname', 'Mr,Smith,John', 'title,Smith,fname', 'title,Smith,John']

然后，您可以创建一个简短的函数来规范化每个“记录”。这是一个例子：

def normalize_record(record):
    # type: (str) -> str
    name, _, fname = record.partition(';')
    title, _, name = name.rpartition(':')
    title = title.strip() or 'title'
    name = name.strip()
    fname = fname.strip() or 'fname'
    return "{0},{1},{2}".format(title, name, fname)

此功能比RegEx集合更容易理解。并且，在大多数情况下，它更快。

为了更好的集成，您可以定义另一个函数来处理每个项：

def normalize(row):
    records = row.splitlines()
    return "\n".join(normalize_record(record) for record in records) + "\n"

演示：

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
item = normalize(item)

你得到：

'Mr,Smith,fname\nMr,Smith,John\ntitle,Smith,fname\ntitle,Smith,John\n'

Python re.sub（）优化

2 个答案:

简

代码

用法

说明