Question

假设我有一个包含XML输出的字符串，如下所示：

<dept-details>
     <dept-domain-id>1</dept-domain-id>
     <dept-req-status>no-vacancies-present</dept-req-status>
      .
      .
</dept-details>

我想用下划线（_）替换包含连字符（ - ）的所有标签，因为我看到Beautiful Soup不允许您直接访问包含的标签 - 除非使用find（）作为this帖子说this也是如此。

所以我的目的是将包含 - 的标签转换为_，以便字符串看起来像：

<dept_details>
     <dept_domain_id>1</dept_domain_id>
     <dept_req_status>no-vacancies-present</dept_req_status>
      .
      .
</dept_details>

我想知道如何使用python re方法实现这一点，或者如果我可以直接使用BeautifulSoup来实现这一点，那就太棒了！

提前致谢

Answer 1

这里需要正则表达式，请尝试此解决方案：

>>> s
'<dept-details><dept-domain-id>1</dept-domain-id><dept-req-status>no-vacancies</dept-req-status></dept-details>'
>>> re.sub('<(.*?)>', lambda x: x.group(0).replace('-','_'), s)
'<dept_details><dept_domain_id>1</dept_domain_id><dept_req_status>no-vacancies</dept_req_status></dept_details>'

正则表达式存在一些问题，例如它也会替换任何具有-的属性，但至少这会让你朝着正确的方向前进。

Answer 2

编辑：看到Burhan的回答，它好多了。

string = '<dept-details><dept-domain-id>1</dept-domain-id><dept-req-status>no-vacancies-present</dept-req-status></dept-details>'

import re

tags = re.finditer('<.*?-.*?>', string)

for x in tags:
    string = string[:x.span()[0]] + x.group(0).replace('-','_') + string[x.span()[1]:]

print string

其中string是您的实际XML代码字符串。肯定有更好的方法。

使用python中的BeautifulSoup用下划线替换xml标记中的所有连字符

2 个答案: