Question

我试图仅删除xml标签中的选定字符+后面的任何数字+正在进行的: ..例如： <ns2:projectArea alias=应该看起来像<projectArea alias= <ns9:name>应该看起来像<name>

基本上，数字是随机的（1-9之间的任何数字），并且总是会有一个必须删除的进程:。

到目前为止我所拥有的是：

import argparse
import re

# Initiates argument
parser = argparse.ArgumentParser()

parser.add_argument("--input", "-i", help="Set the input xml to clean up")
parser.add_argument("--output", "-o", help="Set the output xml location")

args = parser.parse_args()
inputfile = args.input
outputfile = args.output
if args.input:
  print("inputfile location is %s" % args.input)
if args.output:
  print("outputfile location is %s" % args.output)
# End argument

text = re.sub('<[^<]+>', "", open(inputfile).read())
with open(outputfile, "w") as f:
    f.write(text)

这段代码就是问题所在：'<[^<]+>' 它删除了整个标签，所以如果我以后需要搜索文本，基本上必须搜索纯文本而不是标签。

我可以用'<[^<]+>'替换ns，删除: +以下数字（可能是多少数字）+后面的tzinfo？

Answer 1

由于正则表达式，它可能正在发生。请尝试使用此正则表达式：

   text = re.sub('^<[a-zA-Z0-9]+:','<',open(inputfile).read())

Answer 2

这有效：

查找r"<(?:(?:(/?)\w+[1-9]:(\w+\s*/?))|(?:\w+[1-9]:(\w+\s+(?:\"[\S\s]*?\"|'[\S\s]*?'|[^>]?)+\s*/?)))>"
替换<$1$2$3>

https://regex101.com/r/yRhMI9/1

可读版本：

 <
 (?:
      (?:
           ( /? )                        # (1)
           \w+ [1-9] :
           ( \w+ \s* /? )                # (2)
      )
   |  (?:
           \w+ [1-9] :
           (                             # (3 start)
                \w+ \s+ 
                (?: " [\S\s]*? " | ' [\S\s]*? ' | [^>]? )+
                \s* /?
           )                             # (3 end)
      )
 )
 >

Answer 3

正则表达式：(?:(?<=<)|(?<=<\/))(ns[0-9]+:)(?=[^>]*?>)

Demo

使用正则表达式从xml标记中删除选择的字符

3 个答案: