使用Python中的BeautifulSoup识别和替换XML元素

时间:2015-06-26 09:17:20

标签: python xml-parsing beautifulsoup

我正在尝试使用BeautifulSoup4来查找和替换XML中的特定元素。更具体地说,我想查找' file_name'的所有实例(在下面的示例中,文件名是' Cyp26A1_atRA_minus_tet_plus.txt')并将其替换为该文档的完整路径 - 保存在' file_name_replacement_dir'变量。我的第一个任务,即我所坚持的,是隔离感兴趣的部分,以便我可以使用replaceWith()方法替换它。

XML

      <ParameterGroup name="Experiment_22">
        <Parameter name="Data is Row Oriented" type="bool" value="1"/>
        <Parameter name="Experiment Type" type="unsignedInteger" value="0"/>
        <Parameter name="File Name" type="file" value="Cyp26A1_atRA_minus_tet_plus.txt"/>
        <Parameter name="First Row" type="unsignedInteger" value="1"/>

实际上有44个实验,有4个不同的文件名(11个文件名为1,11个文件名为2,依此类推)。因此,上面的XML片段重复了44次,只是将不同的文件存储在&#34;文件名&#34;线。

到目前为止

我的代码

xml_dir = 'D:\MPhil\Model_Building\Models\Retinoic_acid\[06]\RAR_Models\Model_Line_2'
xml_file_name = 'RARa_RXR_M22.cps'
xml=model_dir+'\\'+model_name
file_name_replacement_dir = D:\MPhil\Model_Building\Models\Retinoic_acid\[06]\RAR_Models
soup = BeautifulSoup(open(xml))
print soup.find_all('parametergroup name="Experiment_22"')

然而,这会返回一个空列表。我还尝试了一些其他的功能来代替汤.findall()&#39;但仍然无法找到文件名的句柄。有人知道怎么做我想做的事吗?

2 个答案:

答案 0 :(得分:3)

xml = '<ParameterGroup name="Experiment_22">\
<Parameter name="Data is Row Oriented" type="bool" value="1"/>\
<Parameter name="Experiment Type" type="unsignedInteger" value="0"/>\
<Parameter name="File Name" type="file" value="Cyp26A1_atRA_minus_tet_plus.txt"/>\
<Parameter name="First Row" type="unsignedInteger" value="1"/>\
</ParameterGroup>'

from bs4 import BeautifulSoup
import os
soup = BeautifulSoup(xml)

for tag in soup.find_all("parameter", {'name': 'File Name'}):
    tag['value'] = os.path.join('new_dir', tag['value'])

print soup
  • Close your XML 'ParameterGroup' tag.
  • Capitalisation of tags may not work with BeautifulSoup, try parameter in lower case.
  • use os.path to manipulate paths so that it works cross-platforms.

答案 1 :(得分:2)

Your selector for find_all is wrong you need to separate the tag name and attribute like so:

find_all("Parameter",{'name':'File Name'})

That will get you all the file name tags directly. If you really need the parent tag then pass in "ParameterGroup" without the attribute dictionary.

Not sure if BeautifulSoup require lower casing your tags, you may have to experiment with that.