Question

我目前正在尝试在线解析xml文件，并从此文件中获取所需的数据。我的代码显示如下：

import urllib2
from xml.dom.minidom import parse
import pandas as pd
import time

page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextXml.php?sid=KBFI&num=360')
page_content = page.read()
with open('KBFI.xml', 'w') as fid:
    fid.write(page_content)

data = []

xml = parse('KBFI.xml')
percp = 0
for station in xml.getElementsByTagName('station'):
for ob in xml.getElementsByTagName('ob'):
    # Convert time sting to time_struct ignoring last 4 chars ' PDT'
    ob_time = time.strptime(ob.getAttribute('time')[:-4],'%d %b %I:%M %p')
    for variable in xml.getElementsByTagName('variable'):
        if variable.getAttribute('var') == 'PCP1H':
            percp = True
            # UnIndent if you want all variables
            if variable.getAttribute('value') == 'T':
                data.append([ob_time.tm_mday,
                             ob_time.tm_hour,
                             ob_time.tm_min,
                             0])
            elif variable.getAttribute('value') >= 0:
                data.append((ob_time.tm_mday,
                            ob_time.tm_hour,
                            ob_time.tm_min,
                            variable.getAttribute('value')))
    if not percp:
        # If PCP1H wasn't found add as 0
        data.append([ob_time.tm_mday,
                    ob_time.tm_hour,
                    ob_time.tm_min,
                    0])
print data

不幸的是我无法发布xml文件的图像，但如果我的脚本运行，它的一个版本将保存到当前目录中。

我希望代码能够简单地检查＆＃39;变量＆＃39;是否存在？ PCPH1并打印＆＃39;值＆＃39;如果它存在（每个＆＃39; ob＆＃39;只有一个条目）。如果它不存在或提供“T＆＃39;”的价值，我希望它能够打印“0＆＃39; 0＆＃39; 0那个特定时刻。目前输出（我提供的脚本可以运行以查看输出）包含完全不正确的值，每小时有六个条目而不是一个。我的代码出了什么问题？

Answer 1

代码中的主要问题，在每个for循环中，您将使用 -

获取元素

xml.getElementsByTagName('ob')

这实际上从xml元素开始搜索，在你的情况下在根元素中，在xml.getElementsByTagName('variable')的情况下相同，这将在根元素处开始搜索，所以每次你都是获取带有标记variable的所有元素，这就是为什么每小时获得6个条目而不是一个（因为在完整的xml中有6个条目）。

你应该使用 -

ob.getElementsByTagName('variable')

使用 -

的ob元素

station.getElementsByTagName('ob')

这样我们只检查我们当前迭代的特定元素（不是完整的xml文档）。

另外，另一方面，你正在做 -

elif variable.getAttribute('value') >= 0:

如果我没有错，getAttribute()会返回字符串，因此无论实际value是什么，此检查都将为真。在xml中，我看到value有字符串和数字，所以不确定你想要的条件是什么（虽然这不是主要问题，主要问题是我上面描述的那个）。

示例代码更改 -

import urllib2
from xml.dom.minidom import parse
import pandas as pd
import time

page = urllib2.urlopen('http://www.wrh.noaa.gov/mesowest/getobextXml.php?

sid=KBFI&num=360')
page_content = page.read()
with open('KBFI.xml', 'w') as fid:
    fid.write(page_content.decode())

data = []

xml = parse('KBFI.xml')
percp = 0
for station in xml.getElementsByTagName('station'):
    for ob in station.getElementsByTagName('ob'):
        # Convert time sting to time_struct ignoring last 4 chars ' PDT'
        ob_time = time.strptime(ob.getAttribute('time')[:-4],'%d %b %I:%M %p')
        for variable in ob.getElementsByTagName('variable'):
            if variable.getAttribute('var') == 'PCP1H':
                percp = True
                # UnIndent if you want all variables
                if variable.getAttribute('value') == 'T':
                    data.append([ob_time.tm_mday,
                                 ob_time.tm_hour,
                                 ob_time.tm_min,
                                 0])
                elif variable.getAttribute('value') >= 0:
                    data.append((ob_time.tm_mday,
                                ob_time.tm_hour,
                                ob_time.tm_min,
                                variable.getAttribute('value')))
        if not percp:
            # If PCP1H wasn't found add as 0
            data.append([ob_time.tm_mday,
                        ob_time.tm_hour,
                        ob_time.tm_min,
                        0])
print data

xml解析Python错误地读取文件

1 个答案: