尝试将文本添加到xml

时间:2017-01-02 00:52:11

标签: python xml wordpress unicode lxml

过去几个小时我一直试图解决这个问题。我在尝试阅读文档和现有论坛帖子以解决此问题时遇到了困难。所以我想在放弃之前我会尝试这个地方作为解决这个问题的最后努力。

基本上,手头的任务是打开一个文件(实际上很多文件),这些文件充满了我想要放在新XML元素中的文本。文本文件实际上是使用Python脚本创建的,因此它很好地处理了UTF-16和UTF-8。但似乎每当我尝试将文本内容放入内存以放入新的xml标签(而不是像以前那样写入新的文本文件)时,我会抛出以下错误消息:

"Traceback (most recent call last):
  File "K:\Users\Johnny\My Documents\PythonSandbox\websiteMigrationScripts\createXmlFile.py", line 87, in <module>
root[k][directionsBodyIndex].text = '<![CDATA[' + "".join(directionsBuffer) + ']]>'
  File "src/lxml/lxml.etree.pyx", line 1031, in lxml.etree._Element.text.__set__ (src\lxml\lxml.etree.c:55337)
  File "src/lxml/apihelpers.pxi", line 711, in lxml.etree._setNodeText (src\lxml\lxml.etree.c:24657)
  File "src/lxml/apihelpers.pxi", line 699, in lxml.etree._createTextNode (src\lxml\lxml.etree.c:24506)
  File "src/lxml/apihelpers.pxi", line 1431, in lxml.etree._utf8 (src\lxml\lxml.etree.c:32293)
  UnicodeEncodeError: 'utf-8' codec can't encode character '\udc92' in position 1862: surrogates not allowed"

我的脚本如下所示:

from bs4 import BeautifulSoup
import os, codecs
import imageFilesSub
import utf16FilesList
import openpyxl, lxml
from openpyxl.utils import get_column_letter, column_index_from_string

# First get the list of files to parse
filesDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\Website\tier 2 pages\tier 3 pages\tier 4 pages'
filesInDir = os.listdir(filesDir)
filesOutDir = r'.\blogsToParse'
filesToParse = []
for file in filesInDir:
    if (file.endswith('-template.html')) and not('travel-blog' in file) and not('accommodations' in file) and not('best-time-to-visit' in file) and not('activities' in file) and not('how-to-get-there' in file) and not('planning-and-preparing' in file) and not('restaurants' in file) and not('which-side' in file) and not('books-and-maps' in file):
        filesToParse.append(file)

# Then get a list of (unique) slugs that represent a unique row entry in the WoW Database
wowDatabaseDir = r'K:\Users\Johnny\My Documents\_World_of_Waterfalls\WordPressSite'
wowSpreadsheet = r'WoW Database for WP.xlsm'
wb = openpyxl.load_workbook(wowDatabaseDir + '\\' + wowSpreadsheet, data_only=True)
sheet = wb.active

# the following loop returns to maxRow the highest non-empty row
maxRow = 1  # openpyxl indexes from 1 not 0
for i in range(1, sheet.max_row): 
    if sheet.cell(row=i, column=33).value is None:
        pass
    else:
        maxRow = maxRow + 1

# now make a list containing the directory names of the writeups
writeupDirs = []
slugList = []
for i in range(3, maxRow + 1):
    writeupDirs.append(sheet.cell(row=i, column=18).value)
    slugList.append(sheet.cell(row=i, column=33).value)

from lxml import etree

xmlFile = 'WoW Database for WP 2017-01-01.xml'
data_file = wowDatabaseDir + '\\' + xmlFile
tree = etree.ElementTree(file=data_file)
root = tree.getroot()

k = 0
for element in root:
    try:
        element.attrib[root[k][0].tag] = root[k][0].text  # this puts Entry_No as an attribute of Row
        element.attrib[root[k][1].tag] = root[k][1].text  # this puts Waterfall Name as an attribute of Row
        root[k].append(etree.Element("Introduction_Body"))
        root[k].append(etree.Element("Directions_Body"))

        # need to go through some hoops and hurdles just to find the index of the desired tag (there must be a better way)
        children = []
        for child in root[k]:
            children.append(child.tag)
        fileDirIndex = children.index('File_directory')
        postSlugIndex = children.index('Post_Slug')
        introFilePtrIndex = children.index('Introduction_File_Ptr')
        introBodyIndex = children.index('Introduction_Body')
        introFile = wowDatabaseDir + '\\' + root[k][fileDirIndex].text + '\\' + root[k][introFilePtrIndex].text
        if root[k][postSlugIndex].text in utf16FilesList.utf16List:  # check the slug for unicode special handling
            inFile = open(introFile, 'r', encoding="utf-16", errors="surrogateescape")  # utf-16 works for Chinese, but not anything else
        else:
            inFile = open(introFile, 'r', encoding="utf-8", errors="surrogateescape")
        introBuffer = []
        for line in inFile:
            introBuffer.append(line)
        root[k][introBodyIndex].text = '<![CDATA[' + "".join(introBuffer) + ']]>'
        inFile.close()

        directionsFilePtrIndex = children.index('Directions_File_Ptr')
        directionsBodyIndex = children.index('Directions_Body')
        directionsFile = wowDatabaseDir + '\\' + root[k][fileDirIndex].text + '\\' + root[k][directionsFilePtrIndex].text
        if root[k][postSlugIndex].text in utf16FilesList.utf16List:  # check the slug for unicode special handling
            inFile = open(directionsFile, 'r', encoding="utf-16", errors="surrogateescape")  # utf-16 works for Chinese, but not anything else
        else:
            inFile = open(directionsFile, 'r', encoding="utf-8", errors="surrogateescape")
        directionsBuffer = []
        for line in inFile:
            directionsBuffer.append(line)
        root[k][directionsBodyIndex].text = '<![CDATA[' + "".join(directionsBuffer) + ']]>'
        inFile.close()
    except IndexError:
        pass
    k = k+1

违规文本文件(至少标记的第一个文件)如下所示:

<div class="ad-right">[adrotate banner="17"]</div>

Wapama Falls sits in Hetch Hetchy, which is in the remote northwest corner of Yosemite National Park.  We generally drive up to Yosemite Valley from Los Angeles before getting up to Hetch Hetchy so we'll describe this route first.  It typically takes us about 6 hours to make the drive from <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014181&aid=825833" target="_blank">Los Angeles</a> to Yosemite Valley.  We normally go from <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014181&aid=825833" target="_blank">Los Angeles</a> to <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20013079&aid=825833" target="_blank">Fresno</a> via the I-5 and Hwy 99, then through <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014886&aid=825833" target="_blank">Oakhurst</a> and <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20016735&aid=825833" target="_blank">Wawona</a> via the Hwy 41.  Once in Yosemite Valley, we'd drive west towards the Big Oak Flat Road where the Hwy 120 and Hwy 140 junction.  Then, we'd drive uphill on the Hwy 140 towards the Big Oak Flat Entrance (the Northwest Entrance), where we'd leave the park. 

From the Big Oak Flat Entrance on the Big Oak Flat Road (Route 120), we'd shortly have to turn right at the signed turnoff for <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a> and the Evergreen Road.  Then, we'd follow Evergreen Road for 7.5 miles to its junction with Hetch Hetchy Road in <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>.  Turning right onto Hetch Hetchy Road, we'd follow it to the parking lot by the O’Shaughnessy Dam after about seven miles.  On the way, we'd have passed through another entrance fee station.  The two-lane road was a bit narrow in places so we had to drive slowly.  Eventually, we'd reach a car park next to the dam.  The drive from Yosemite Valley to the car park at the O'Shaugnessy Dam took us less than 90 minutes.

From <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20015732&aid=825833" target="_blank">San Francisco</a>, we'd drive east towards <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20015274&aid=825833" target="_blank">Pleasanton</a>, then continue east on the I-205 towards the Hwy 120 passing through <a rel="nofollow" href="    http://www.booking.com/searchresults.html?city=20013298&aid=825833" target="_blank">Groveland</a> and eventually through the town of <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>.  Once we were east of <a rel="nofollow" href="http://www.booking.com/searchresults.html?city=20014336&aid=825833" target="_blank">Mather</a>, we'd follow the road to the O'Shaugnessy Dam as described above.  Overall, this drive would take around 4 hours without traffic.

这恰好是正在处理的第35条记录,因此它成功地解析并填充了修改后的XML 34先前记录。这些文件是类似的文本文件(基本上是带有HTML标记和一些WordPress短代码的文本文件)。

这就是令我沮丧的事情。我不明白这个第35个文件与之前的34个文件有什么不同。我也不确定非法角色首先是如何进入那里的,也不知道如何克服它。

非常感谢社区的任何帮助。

谢谢!

0 个答案:

没有答案