如何用Python抓取XML?

时间:2012-10-08 07:01:57

标签: python xml dom scrape

我正在尝试使用Python解析以下XML。我正在使用:

thumbnail_tag = dom.getElementsByTagName('media:thumbnail')[0].toxml()

选择第一个。我知道我可以将[0]更改为[1]以获取带有yt:name="mqdefault"的标记,但是还有其他方法可以更改上述语句中的参数(向media:thumbnail添加内容)?

<entry>
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" height="90" width="120" time="00:01:48.500" yt:name="default" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" height="180" width="320" yt:name="mqdefault" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" height="360" width="480" yt:name="hqdefault" />
</entry>

<entry>
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" height="90" width="120" time="00:01:48.500" yt:name="default" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" height="180" width="320" yt:name="mqdefault" />
<media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" height="360" width="480" yt:name="hqdefault" />
</entry>

3 个答案:

答案 0 :(得分:0)

Here is documentation on python XML parsers.

对于您的实施,您可以使用:

for element in thumbnail_tag:
    attr = element.getAttribute('yt:name')

要更改属性的值:

for element in thumbnail_tag:
    attr = element.getAttribute('yt:name')
    if attr == 'mqdefault':
        element.setAttribute('yt:name', 'new_value')
        break

答案 1 :(得分:0)

我建议使用标准xml.etree.ElementTree而不是DOM。虽然DOM更传统,但它也更加流畅,更难以使用。看看Dive Into Python 3, Chapter 12. XML

标准模块支持XPath语言的一个子集,在您的情况下可能很有用。

以下是从sample.xml

中提取所需元素的示例代码
import xml.etree.ElementTree as et

tree = et.parse('sample.xml')

root = tree.getroot()     # the root element of the tree

##et.dump(root)             # here is how the input file looks inside

print '==========================================='
print 'Iterate through all media:thumbnail:'

# XPath expressions that describe the wanted elements. Here we have 3 ones;
# however, they are just strings and can be constructed on the fly.
xp_default = ".//{http://search.yahoo.com/mrss/}thumbnail[" \
                 "@{http://gdata.youtube.com/schemas/2007}name='default']"

xp_mqdefault = ".//{http://search.yahoo.com/mrss/}thumbnail[" \
                 "@{http://gdata.youtube.com/schemas/2007}name='mqdefault']"

xp_hqdefault = ".//{http://search.yahoo.com/mrss/}thumbnail[" \
                 "@{http://gdata.youtube.com/schemas/2007}name='hqdefault']"

for e in root.iterfind(xp_default):
    et.dump(e)
    print '-------------------------------------------'

for e in root.iterfind(xp_mqdefault):
    et.dump(e)
    print '-------------------------------------------'

for e in root.iterfind(xp_hqdefault):
    et.dump(e)
    print '-------------------------------------------'
    print 'The e.attrib is a dictionary of attributes:'
    print e.attrib

打印以下内容......:

c:\tmp\___python\sharataka\so12776774>py a.py
===========================================
Iterate through all media:thumbnail:
<ns0:thumbnail xmlns:ns0="http://search.yahoo.com/mrss/" xmlns:ns1="http://gdata
.youtube.com/schemas/2007" height="90" time="00:01:41" url="http://img.youtube.c
om/vi/jXE6G9CYcJs/default.jpg" width="120" ns1:name="default" />

-------------------------------------------
<ns0:thumbnail xmlns:ns0="http://search.yahoo.com/mrss/" xmlns:ns1="http://gdata
.youtube.com/schemas/2007" height="180" url="http://img.youtube.com/vi/jXE6G9CYc
Js/mqdefault.jpg" width="320" ns1:name="mqdefault" />

-------------------------------------------
<ns0:thumbnail xmlns:ns0="http://search.yahoo.com/mrss/" xmlns:ns1="http://gdata
.youtube.com/schemas/2007" height="360" url="http://img.youtube.com/vi/jXE6G9CYc
Js/hqdefault.jpg" width="480" ns1:name="hqdefault" />

-------------------------------------------
The e.attrib is a dictionary of attributes:
{'url': 'http://img.youtube.com/vi/jXE6G9CYcJs/hqdefault.jpg', 'width': '480', '
height': '360', '{http://gdata.youtube.com/schemas/2007}name': 'hqdefault'}

...对于sample.xml(找到某处,缩短了)内容:

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns='http://www.w3.org/2005/Atom'
  xmlns:media='http://search.yahoo.com/mrss/'
  xmlns:yt='http://gdata.youtube.com/schemas/2007'>
  <entry>
    <media:group>
      <media:title type='plain'>Learning the ABCs</media:title>
      <media:description type='plain'>
        A great method for teaching kids the alphabet.
      </media:description>
      <media:keywords>alphabet, teaching, children</media:keywords>
      <yt:duration seconds='202'/>
      <yt:videoid>jXE6G9CYcJs</yt:videoid>
      <media:credit role='uploader' scheme='urn:youtube'
          yt:display='GoogleDeveloperssFriend'>GoogleDeveloperssFriend</media:credit>
      <media:category label='Education'
        scheme='http://gdata.youtube.com/schemas/2007/categories.cat'>
        Education</media:category>
      <media:content url='http://www.youtube.com/v/jXE6G9CYcJs'
        type='application/x-shockwave-flash' medium='video' isDefault='true'
        expression='full' duration='202' yt:format='5'/>
      <media:content
        url='rtsp://rtsp2.youtube.com/ChoLENySANFEgGDA==/0/0/0/video.3gp'
        type='video/3gpp' medium='video' expression='full'
        duration='202' yt:format='1'/>
      <media:content
        url='rtsp://rtsp2.youtube.com/ChoLENySARFEgGDA==/0/0/0/video.3gp'
        type='video/3gpp' medium='video' expression='full'
        duration='202' yt:format='6'/>
      <media:player url='https://www.youtube.com/watch?v=jXE6G9CYcJs'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/default.jpg'
        height='90' width='120' time='00:01:41' yt:name='default'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/hqdefault.jpg'
        height='360' width='480' yt:name='hqdefault'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/mqdefault.jpg'
        height='180' width='320' yt:name='mqdefault'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/1.jpg'
        height='90' width='120' time='00:00:50.500' yt:name='start'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/2.jpg'
        height='90' width='120' time='00:01:41' yt:name='end'/>
      <media:thumbnail url='http://img.youtube.com/vi/jXE6G9CYcJs/3.jpg'
        height='90' width='120' time='00:02:31.500' yt:name='middle'/>
    </media:group>
    <yt:statistics viewCount='286355' favoriteCount='201'/>
  </entry>
</feed>

答案 2 :(得分:0)

要创建此xml字符串的dom对象,您必须在根标记或同一标记中定义 XML命名空间

命名空间由元素开头的 xmlns属性定义。

名称空间声明具有以下语法:

xmlns:prefix="URI"

例如:

<root>
    <h:table xmlns:h="http://bluejson.com/W3C/">
        <h:tr>
            <h:td>JSON</h:td>
            <h:td>JavaScript</h:td>
            <h:td>Python</h:td>
        </h:tr>
    </h:table>

    <f:table xmlns:f="http://bluejson.com/W3C/">
        <f:name>My Study Room</f:name>
        <f:width>800</f:width>
        <f:height>420</f:height>
        <f:length>1120</f:length>
    </f:table>
</root>

在上面的示例中,标记中的xmlns属性为h:和f:前缀提供了限定名称空间。

为元素定义名称空间时,具有相同前缀的所有子元素都与相同的名称空间相关联。

命名空间可以在使用它们的元素中声明,也可以在XML根元素中声明:

<root xmlns:h="http://bluejson.com/W3C/" xmlns:f="http://bluejson.com/W3C/">
    <h:table>
        <h:tr>
            <h:td>JSON</h:td>
            <h:td>JavaScript</h:td>
            <h:td>Python</h:td>
        </h:tr>
    </h:table>

    <f:table>
        <f:name>My Study Room</f:name>
        <f:width>800</f:width>
        <f:height>420</f:height>
        <f:length>1120</f:length>
    </f:table>
</root>

现在,用于创建xml dom Object和获取属性的Python代码

import xml.dom.minidom

dom = xml.dom.minidom.parseString("""
<root xmlns:media="http://media/" xmlns:yt="http://media/yt/">
    <media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" height="90" width="120" time="00:01:48.500" yt:name="default" />
    <media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" height="180" width="320" yt:name="mqdefault" />
    <media:thumbnail url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" height="360" width="480" yt:name="hqdefault" />
</root>""")

media_thumbnail = dom.getElementsByTagNameNS("http://media/","thumbnail")
print media_thumbnail[0].getAttribute("height")
print media_thumbnail[0].getAttribute("width")
print media_thumbnail[0].getAttribute("time")
print media_thumbnail[0].getAttributeNS("http://media/yt/","name")
media_thumbnail[0].setAttribute("unit","px")
media_thumbnail[0].setAttributeNS("http://media/yt/","value","1")
print dom.toxml()

输出:

90
120
00:01:48.500
default
<?xml version="1.0" ?><root xmlns:media="http://media/" xmlns:yt="http://media/yt/">
    <media:thumbnail height="90" time="00:01:48.500" unit="px" url="http://i.ytimg.com/vi/k8J-72MmTGg/default.jpg" value="1" width="120" yt:name="default"/>
    <media:thumbnail height="180" url="http://i.ytimg.com/vi/k8J-72MmTGg/mqdefault.jpg" width="320" yt:name="mqdefault"/>
    <media:thumbnail height="360" url="http://i.ytimg.com/vi/k8J-72MmTGg/hqdefault.jpg" width="480" yt:name="hqdefault"/>
</root>