我希望你可以帮助我,所以我需要创建一个解析文本的函数,并将数据提取到pandas DataFrame中:
""" 功能 --------- rcp_poll_data
Extract poll information from an XML string, and convert to a DataFrame
Parameters
----------
xml : str
A string, containing the XML data from a page like
get_poll_xml(1044)
Returns
-------
A pandas DataFrame with the following columns:
date: The date for each entry
title_n: The data value for the gid=n graph (take the column name from the `title` tag)
This DataFrame should be sorted by date
Example
-------
Consider the following simple xml page:
<chart>
<series>
<value xid="0">1/27/2009</value>
<value xid="1">1/28/2009</value>
</series>
<graphs>
<graph gid="1" color="#000000" balloon_color="#000000" title="Approve">
<value xid="0">63.3</value>
<value xid="1">63.3</value>
</graph>
<graph gid="2" color="#FF0000" balloon_color="#FF0000" title="Disapprove">
<value xid="0">20.0</value>
<value xid="1">20.0</value>
</graph>
</graphs>
</chart>
Given this string, rcp_poll_data should return
result = pd.DataFrame({'date': pd.to_datetime(['1/27/2009', '1/28/2009']),
'Approve': [63.3, 63.3], 'Disapprove': [20.0, 20.0]})
def rcp_poll_data(xml):
soup = BeautifulSoup(xml,'xml')
dates=soup.find("series")
datesval=soup.findChildren(string=True)
del datesval[-7:]
obama=soup.find("graph",gid="1")
obamaval={"title":obama["title"],"color":obama["color"]}
romney=soup.find("graph",gid="2")
romneyval={"title":romney["title"],"color":romney["color"]}
result = pd.DataFrame({'date': pd.to_datetime(datesval,errors="ignore"), 'GID1':obamaval, 'GID2':romneyval})
return result
&#34;&#34;&#34; 但是当我执行程序时,我不断收到此错误: 将dicts与非系列混合可能会导致模糊的排序。
请帮忙! PS:get_poll函数是这样的:
def get_poll_xml(poll_id):
url="http://charts.realclearpolitics.com/charts/"+str(poll_id)+".xml"
return requests.get(url).content
例如poll_id = 1044
答案 0 :(得分:0)
考虑使用内置的xml.etree.ElementTree
而非 BeautifulSoup (更适合html网页抓取)来解析具有iterfind
,findall
等方法的XML内容, find
通过子节点到XPath,即使是像@gid='1'
这样的谓词。由于<value>
和<series>
父标记中的<graph>
元素长度相同,因此您可以循环使用zip()
:
import requests
import pandas as pd
import xml.etree.ElementTree as et
def get_poll_xml(poll_id):
url="http://charts.realclearpolitics.com/charts/{}.xml".format(poll_id)
return requests.get(url).content
def rcp_poll_data(xml):
tree = et.fromstring(xml)
dates = []; graphlist1 = []; graphlist2 = []
g1title = tree.find("./graphs/graph[@gid='1']").get('title')
g2title = tree.find("./graphs/graph[@gid='2']").get('title')
for s, g1, g2 in zip(tree.iterfind("./series/value"),
tree.iterfind("./graphs/graph[@gid='1']/value"),
tree.iterfind("./graphs/graph[@gid='2']/value")):
dates.append(s.text)
graphlist1.append(g1.text)
graphlist2.append(g2.text)
return pd.DataFrame({'Date':pd.to_datetime(dates, errors="ignore"),
g1title: graphlist1,
g2title: graphlist2})
poll_id = 1044
xml_str = get_poll_xml(poll_id)
df = rcp_poll_data(xml_str)
<强>输出强>
print(df.head(20))
# Approve Date Disapprove
# 0 63.3 2009-01-27 20.0
# 1 63.3 2009-01-28 20.0
# 2 63.5 2009-01-29 19.3
# 3 63.5 2009-01-30 19.3
# 4 61.8 2009-01-31 19.4
# 5 61.8 2009-02-01 19.4
# 6 61.8 2009-02-02 19.4
# 7 61.8 2009-02-03 19.4
# 8 61.8 2009-02-04 19.4
# 9 61.8 2009-02-05 19.4
# 10 61.6 2009-02-06 21.4
# 11 61.6 2009-02-07 21.4
# 12 61.6 2009-02-08 21.4
# 13 65.4 2009-02-09 22.6
# 14 65.4 2009-02-10 22.6
# 15 64.2 2009-02-11 23.3
# 16 64.2 2009-02-12 23.3
# 17 64.2 2009-02-13 23.3
# 18 64.8 2009-02-14 25.4
# 19 65.5 2009-02-15 25.5