我正在阅读一些xml数据,特别是我有以下字符串
H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}
这是乳胶符号。我正在使用mathjax,但没有$ sign,mathjax无法识别此文本。因此,此文本显示在我的浏览器中,如上所示。 我正在使用以下代码读取xml数据
today = some date
base_url = "http://export.arxiv.org/oai2?verb=ListRecords&"
url = (base_url + "from=%s&until=%s&" % (today, today) + "metadataPrefix=arXivRaw")
try:
response = urllib2.urlopen(url)
except urllib2.HTTPError, e:
return
rawdata = response.read()
root = ET.fromstring(rawdata)
if root.find(OAI+'ListRecords') is not None:
for record in root.find(OAI+'ListRecords').findall(OAI+"record"):
author_string = info.find(ARXIVRAW+"authors").text
我可以使用feedparser读取相同的文本,在这种情况下,我得到
u'H. P. Dembinski, B. K\xe9gl, I. C. Mari\u015f, M. Roth, D. Veberi\u010d'
我的浏览器可以正确解释所有特殊字符。这是我的feedparser解决方案
url = 'some url'
response = urllib.urlopen(url).read().decode('latin-1')
feed = feedparser.parse(response)
for entry in feed.entries:
data = {}
try:
data['authors'] = ', '.join(author.name for author in entry.authors)
except AttributeError:
data['authors'] = ''
如何更改ElementTree解决方案(第一个)以获取与feedparser解决方案相同的字符串?
编辑:这是一段产生不需要的结果的代码
import urllib2
from itertools import ifilter
import xml.etree.ElementTree as ET
import feedparser
OAI = "{http://www.openarchives.org/OAI/2.0/}"
ARXIV = "{http://arxiv.org/OAI/arXiv/}"
ARXIVRAW = "{http://arxiv.org/OAI/arXivRaw/}"
def main():
url = "http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1503.09027&metadataPrefix=arXivRaw"
try:
response = urllib2.urlopen(url)
except urllib2.HTTPError, e:
return
rawdata = response.read().decode('latin-1')
root = ET.fromstring(rawdata)
record = root.find(OAI+'GetRecord').findall(OAI+"record")
meta = record[0].find(OAI+'metadata')
info = meta.find(ARXIVRAW+"arXivRaw")
print "author = ", info.find(ARXIVRAW+"authors").text
base_url = 'http://export.arxiv.org/api/query?'
search_query = 'id:1503.09027'
max_results = 2000
sortBy = 'submittedDate'
sortOrder = 'ascending'
query = 'search_query=%s&max_results=%i&sortBy=%s&sortOrder=%s' % (search_query, max_results, sortBy, sortOrder)
response = urllib2.urlopen(base_url+query).read().decode('latin-1')
feed = feedparser.parse(response)
for entry in feed.entries:
print "entry.authors = ", entry.authors
if __name__ == "__main__":
main()
输出: python test.py
author = H.P. Dembinski,B。K \' {e} gl,I.C。 Mari \ c {s},M。Roth,D。Veberi \ v {c}