如何处理html中的特殊字符

时间:2015-09-15 06:59:11

标签: python xml elementtree feedparser

我正在阅读一些xml数据,特别是我有以下字符串

H.P. Dembinski, B. K\'{e}gl, I.C. Mari\c{s}, M. Roth, D. Veberi\v{c}

这是乳胶符号。我正在使用mathjax,但没有$ sign,mathjax无法识别此文本。因此,此文本显示在我的浏览器中,如上所示。 我正在使用以下代码读取xml数据

today = some date
base_url = "http://export.arxiv.org/oai2?verb=ListRecords&"
url = (base_url + "from=%s&until=%s&" % (today, today) + "metadataPrefix=arXivRaw")

try:
    response = urllib2.urlopen(url)

except urllib2.HTTPError, e:
    return

rawdata = response.read()
root = ET.fromstring(rawdata)

if root.find(OAI+'ListRecords') is not None:
   for record in root.find(OAI+'ListRecords').findall(OAI+"record"):
     author_string = info.find(ARXIVRAW+"authors").text

我可以使用feedparser读取相同的文本,在这种情况下,我得到

u'H. P. Dembinski, B. K\xe9gl, I. C. Mari\u015f, M. Roth, D. Veberi\u010d'

我的浏览器可以正确解释所有特殊字符。这是我的feedparser解决方案

url = 'some url'
response = urllib.urlopen(url).read().decode('latin-1')

feed = feedparser.parse(response)

for entry in feed.entries:
    data = {}

    try:
        data['authors'] = ', '.join(author.name for author in entry.authors)
    except AttributeError:
        data['authors'] = ''

如何更改ElementTree解决方案(第一个)以获取与feedparser解决方案相同的字符串?

编辑:这是一段产生不需要的结果的代码

import urllib2
from itertools import ifilter
import xml.etree.ElementTree as ET
import feedparser

OAI = "{http://www.openarchives.org/OAI/2.0/}"
ARXIV = "{http://arxiv.org/OAI/arXiv/}"
ARXIVRAW = "{http://arxiv.org/OAI/arXivRaw/}"

def main():

     url = "http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1503.09027&metadataPrefix=arXivRaw"

    try:
        response = urllib2.urlopen(url)

    except urllib2.HTTPError, e:
        return

    rawdata = response.read().decode('latin-1')
    root = ET.fromstring(rawdata)

    record = root.find(OAI+'GetRecord').findall(OAI+"record")
    meta = record[0].find(OAI+'metadata')
    info = meta.find(ARXIVRAW+"arXivRaw")
    print "author = ", info.find(ARXIVRAW+"authors").text  

    base_url = 'http://export.arxiv.org/api/query?'

    search_query = 'id:1503.09027'              
    max_results = 2000
    sortBy = 'submittedDate'
    sortOrder = 'ascending'

    query = 'search_query=%s&max_results=%i&sortBy=%s&sortOrder=%s' % (search_query, max_results, sortBy, sortOrder)

    response = urllib2.urlopen(base_url+query).read().decode('latin-1')
    feed = feedparser.parse(response)

    for entry in feed.entries:
        print "entry.authors = ", entry.authors

if __name__ == "__main__":
    main()

输出: python test.py

author = H.P. Dembinski,B。K \' {e} gl,I.C。 Mari \ c {s},M。Roth,D。Veberi \ v {c}

0 个答案:

没有答案