Question

我正在尝试处理包含错误格式元素的xml文件。

错误形成的元素是不尊重以下模式的元素：<name attribute1=value1 attribute2=value2 ... attributeN=valueN>

可以有0到n个属性。

因此，<my element number>无效，而<my element=number>则无效。

以下是我的文字示例：

<product_name>
    A high wind in Jamaica <The innocent voyage>  The modern library of the world s best books   Books  Richard Arthur Warren Hughes
</product_name>

此处，<product_name>是一个很好的元素，而<The innocent voyage>则不是。

如果发现错误的元素，我希望将<>替换为中性字符，例如+。

由于包含这些标签的文件相当大（1.5 GB），我宁愿不使用强力方法。

你们会看到解决这个问题的快速（如果可能的话，优雅）方式吗？

Answer 1

当您声明您希望远离regex时，我能够创建以下不使用regex的代码（尽管我确信regex会是def valid_tag(tag): temp = tag.split() for word in temp[1:]: if "=" not in word: return False return True非常有用）

"<hello test=test>"

在这里，您将标记作为字符串作为参数传递。例如："<"

您可以在每个标记上运行此测试，方法是创建另一种获取标记的方法，方法是找到">"，然后找到后面的第一个<hello test=test>，并创建一个子字符串，该子字符串将作为标记你传入这种方法。

注意：这假定您的代码编写如下：< hello test = test >而不是{{1}}

这种方法仍然非常原始，并且如上所述做了一些假设，但希望它能为您提供所需的开始。

删除特定分隔符内的所有空格

1 个答案: