Question

我希望文本中看起来像<Bus:1234|Bob Alice>或<Car:5678|Nelson Mandela>的所有标记分别替换为<a my-inner-type="CR:1234">Bob Alice</a>和<a my-inner-type="BS:5678">Nelson Mandela</a>。所以基本上，根据Type TypeA或TypeB，我希望使用Python3和regex在文本字符串中相应地替换文本。

我尝试在python中执行以下操作，但不确定这是否是正确的方法：

import re
def my_replace():
    re.sub(r'\<(.*?)\>', replace_function, data)

有了上述内容，我正在尝试对< >标记和我找到的每个标记执行正则表达式，我将其传递给名为replace_function的函数，以在标记之间拆分文本并确定是否它是TypeA或TypeB并计算内容并动态返回替换标记。我甚至不确定使用re.sub是否可行，但任何线索都会有所帮助。谢谢。

示例：

<Car:1234|Bob Alice>变为<a my-inner-type="CR:1234">Bob Alice</a>
<Bus:5678|Nelson Mandela>变为<a my-inner-type="BS:5678">Nelson Mandela</a>

Answer 1

使用re.sub完全可以实现这一点，并且您使用替换功能（设计为允许动态替换）在正确的轨道上。请参阅下面的示例，该示例与您提供的示例一起使用 - 可能必须根据您的用例进行修改，具体取决于文本中存在的其他数据（即您需要忽略的其他标记）

import re

def replace_function(m):
    # note: to not modify the text (ie if you want to ignore this tag),
    # simply do (return the entire original match):
    # return m.group(0)

    inner = m.group(1)
    t, name = inner.split('|')

    # process type here - the following will only work if types always follow
    # the pattern given in the question
    typename = t[4:]
    # EDIT: based on your edits, you will probably need more processing here
    # eg:
    if t.split(':')[0] == 'Car':
        typename = 'CR'
    # etc

    return '<a my-inner-type="{}">{}</a>'.format(typename, name)

def my_replace(data):
    return re.sub(r'\<(.*?)\>', replace_function, data)



# let's just test it
data = 'I want all the tags in a text that look like <TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela> to be replaced with'
print(my_replace(data))

警告：如果这个文本实际上是完整的html，正则表达式匹配将不可靠 - 使用像beautifulsoup这样的html处理器。 ;）

Answer 2

可能是@ swalladge答案的扩展，但是如果我们知道映射，我们就会利用字典的优势。（想想用自定义映射函数替换字典。

import re    

d={'TypeA':'A',
   'TypeB':'B',
   'Car':'CR',
   'Bus':'BS'}

def repl(m):
  return '<a my-inner-type="'+d[m.group(1)]+m.group(2)+'">'+m.group(3)+'</a>'

s='<TypeA:1234|Bob Alice> or <TypeB:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))
print()
s='<Bus:1234|Bob Alice> or <Car:5678|Nelson Mandela>'
print(re.sub('<(.*?)(:\d+)\|(.*?)>',repl,s))

<强>输出

<a my-inner-type="A:1234">Bob Alice</a> or <a my-inner-type="B:5678">Nelson Mandela</a>

<a my-inner-type="BS:1234">Bob Alice</a> or <a my-inner-type="CR:5678">Nelson Mandela</a>

工作示例here。

<强>正则表达式
我们通过match object捕捉我们需要的3个组并通过This one引用它们。粗体突出显示我们在正则表达式中捕获的三个组。
<的 (.*?) (:\d+) \|的 (.*?) {{1} }
我们在>函数中使用这3个组来返回正确的字符串。

Answer 3

很抱歉，这不是一个完整的答案，但我已经在电脑上睡着了，但这是与您提供的任一字符串(<Type)(\w:)(\d+\|)(\w+\s\w+>)匹配的正则表达式。查看https://pythex.org/以测试您的正则表达式内容。

Answer 4

如果您的格式为<Type:num|name>：

，则此代码将有效

def replaceupdate(tag):
    replace = ''
    t = ''
    i = 1
    ident = ''
    name = ''
    typex = ''
    while t != ':':
        typex += tag[i]
        t = tag[i]
        i += 1
    t = ''
    while t != '|':
        if tag[i] == '|':
            break
        ident += tag[i]
        t = tag[i]
        i += 1
    t = ''
    i += 1
    while t != '>':
        name += tag[i]
        t = tag[i]
        i += 1
    replace = '<a my-inner-type="{}{}">{}</a>'.format(typex, ident, name)
    return replace

我知道它不使用正则表达式，它必须以其他方式拆分文本，但这是主要的批量。

Answer 5

尝试：

import re

def get_tag(match):
    base = '<a my-inner-type="{}">{}</a>'
    inner_type = match.group(1).upper()
    my_inner_type = '{}{}:{}'.format(inner_type[0], inner_type[-1], match.group(2))
    return base.format(my_inner_type, match.group(3))

print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Bus:1234|Bob Alice>'))

print(re.sub(r'\<(\w+):(\d+)\W([^\>]+).*', get_tag, '<Car:5678|Nelson Mandela>'))

Python3根据标记类型的条件替换标记

5 个答案: