努力使用Python来解析和操作包含外部文本文件内容的XML文件

时间:2016-11-14 17:24:37

标签: python xml excel wordpress parsing

我花了几个月的时间试图完成以下任务。

我有一个Excel生成的XML文件,它捕获我一直致力于构建网站的数据库。我的梦想是将这个或一些操作形式的XML文件导入到WordPress中,这样我就不必再逐个手动编辑每个帖子或网页了(特别是当我做出影响几个内容的更改或者网站上的所有页面/帖子。

Excel文件(我称之为'test_of_2016-09-19.xml')如下所示:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Row>
        <Entry_No>1</Entry_No>
        <Waterfall_Name>Bridalveil Fall</Waterfall_Name>
        <Continent___Super_Region>North America</Continent___Super_Region>
        <Country>USA</Country>
        <State___Province>California</State___Province>
        <Subregion>Southern and Central Sierras</Subregion>
        <locale___political_or_official>Mariposa County</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Yosemite National Park</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>oakhurst, el portal, mariposa, yosemite, yosemite valley, sierra, california, waterfall, fresno, modesto, wawona, tunnel, merced, pohono, wheelchair</Misc__Tags>
        <scenic_rating>4.5</scenic_rating>
        <difficulty_rating>1</difficulty_rating>
        <distance>roadside; 1/2 mile round trip to base; wheelchair</distance>
        <time_commitment>20 minutes</time_commitment>
        <GPS_Coordinates>37.71736, -119.64901</GPS_Coordinates>
        <date_first_visited>1999-09-04</date_first_visited>
        <date_last_visited>2011-06-04</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/yosemite-bridalveil-fall.html</Old_Web_Address>
        <Post_Slug>yosemite-bridalveil-fall.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>52</Entry_No>
        <Waterfall_Name>Switzer Falls</Waterfall_Name>
        <Continent___Super_Region>North America</Continent___Super_Region>
        <Country>USA</Country>
        <State___Province>California</State___Province>
        <Subregion>Southern California</Subregion>
        <locale___political_or_official>Los Angeles County</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Angeles National Forest, La Canada Flintridge</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>la canada, flintridge, altadena, pasadena, san gabriel, angeles national forest, angeles crest, los angeles, southern california, california, waterfall, arroyo seco, gabrielino trail, clear creek station, adventure pass, picnic</Misc__Tags>
        <scenic_rating>2</scenic_rating>
        <difficulty_rating>3.5</difficulty_rating>
        <distance>4.6 miles round trip (to base of main drop)</distance>
        <time_commitment>3.5 hours (to base of main drop)</time_commitment>
        <GPS_Coordinates>34.25828, -118.15474</GPS_Coordinates>
        <date_first_visited>2003-02-02</date_first_visited>
        <date_last_visited>2016-04-23</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/california-switzer-falls.html</Old_Web_Address>
        <File_directory>./waterfall_writeups/52_Switzer_Falls/</File_directory>
        <Introduction>introduction-switzer-falls.html</Introduction>
        <Directions>directions-switzer-falls.html</Directions>
        <Nearby_Waterfalls_Tags>southern california, pasadena, angeles crest, waterfall</Nearby_Waterfalls_Tags>
        <Itinerary_Tags>itinerary, switzer falls</Itinerary_Tags>
        <Trip_Report_Tags>trip report, switzer falls</Trip_Report_Tags>
        <Trip_Planning_Article_Tags>featured article, switzer falls</Trip_Planning_Article_Tags>
        <Post_Slug>california-switzer-falls.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <Continent___Super_Region>Asia</Continent___Super_Region>
        <Country>China</Country>
        <locale___political_or_official>Guangxi</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Daxin County</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>daxin, guichin river, guangxi, vietnam, china, waterfall, ban gioc, transnational, border</Misc__Tags>
        <scenic_rating>4</scenic_rating>
        <difficulty_rating>1.5</difficulty_rating>
        <distance>1km round trip</distance>
        <time_commitment>30-45 minutes</time_commitment>
        <GPS_Coordinates>22.85577, 106.72273</GPS_Coordinates>
        <date_first_visited>2009-04-23</date_first_visited>
        <date_last_visited>2009-04-23</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/asia-detian-waterfall.html</Old_Web_Address>
        <Post_Slug>asia-detian-waterfall.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>1125</Entry_No>
    </Row>
    <Row>
        <Entry_No>1126</Entry_No>
    </Row>
    <Row>
        <Entry_No>1127</Entry_No>
    </Row>
</Root>

我想要做的是,如果存在特定的元素或标签(特别是File_directory,Introduction,Directions),则打开指向的文件,抓取他们的文本内容,并将它们放在新元素或标签中,例如Introduction_Body ,Directions_Body等),然后写出新修改的XML文件。

新的XML文件如下所示:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Row>
        <Entry_No>1</Entry_No>
        <Waterfall_Name>Bridalveil Fall</Waterfall_Name>
        <Continent___Super_Region>North America</Continent___Super_Region>
        <Country>USA</Country>
        <State___Province>California</State___Province>
        <Subregion>Southern and Central Sierras</Subregion>
        <locale___political_or_official>Mariposa County</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Yosemite National Park</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>oakhurst, el portal, mariposa, yosemite, yosemite valley, sierra, california, waterfall, fresno, modesto, wawona, tunnel, merced, pohono, wheelchair</Misc__Tags>
        <scenic_rating>4.5</scenic_rating>
        <difficulty_rating>1</difficulty_rating>
        <distance>roadside; 1/2 mile round trip to base; wheelchair</distance>
        <time_commitment>20 minutes</time_commitment>
        <GPS_Coordinates>37.71736, -119.64901</GPS_Coordinates>
        <date_first_visited>1999-09-04</date_first_visited>
        <date_last_visited>2011-06-04</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/yosemite-bridalveil-fall.html</Old_Web_Address>
        <Post_Slug>yosemite-bridalveil-fall.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>52</Entry_No>
        <Waterfall_Name>Switzer Falls</Waterfall_Name>
        <Continent___Super_Region>North America</Continent___Super_Region>
        <Country>USA</Country>
        <State___Province>California</State___Province>
        <Subregion>Southern California</Subregion>
        <locale___political_or_official>Los Angeles County</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Angeles National Forest, La Canada Flintridge</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>la canada, flintridge, altadena, pasadena, san gabriel, angeles national forest, angeles crest, los angeles, southern california, california, waterfall, arroyo seco, gabrielino trail, clear creek station, adventure pass, picnic</Misc__Tags>
        <scenic_rating>2</scenic_rating>
        <difficulty_rating>3.5</difficulty_rating>
        <distance>4.6 miles round trip (to base of main drop)</distance>
        <time_commitment>3.5 hours (to base of main drop)</time_commitment>
        <GPS_Coordinates>34.25828, -118.15474</GPS_Coordinates>
        <date_first_visited>2003-02-02</date_first_visited>
        <date_last_visited>2016-04-23</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/california-switzer-falls.html</Old_Web_Address>
        <File_directory>./waterfall_writeups/52_Switzer_Falls/</File_directory>
        <Introduction>introduction-switzer-falls.html</Introduction>
        <Introduction_Body>This would be text from the file ./waterfall_writeups/52_Switzer_Falls/introduction-switzer-falls.html complete with links, img tags, and other lorem ipsum; would I need to do anything special for special characters like Chinese and Japanese Characters, accent markings, etc?</Introduction_Body>
        <Directions>directions-switzer-falls.html</Directions>
        <Directions_Body>This would be text from the file ./waterfall_writeups/52_Switzer_Falls/directions-switzer-falls.html complete with links, img tags, and other lorem ipsum; would I need to do anything special for special characters like Chinese and Japanese Characters, accent markings, etc?</Directions_Body>
        <Nearby_Waterfalls_Tags>southern california, pasadena, angeles crest, waterfall</Nearby_Waterfalls_Tags>
        <Itinerary_Tags>itinerary, switzer falls</Itinerary_Tags>
        <Trip_Report_Tags>trip report, switzer falls</Trip_Report_Tags>
        <Trip_Planning_Article_Tags>featured article, switzer falls</Trip_Planning_Article_Tags>
        <Post_Slug>california-switzer-falls.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>657</Entry_No>
        <Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
        <Continent___Super_Region>Asia</Continent___Super_Region>
        <Country>China</Country>
        <locale___political_or_official>Guangxi</locale___political_or_official>
        <alt__locale__unofficial_or_more_recognized>Daxin County</alt__locale__unofficial_or_more_recognized>
        <Misc__Tags>daxin, guichin river, guangxi, vietnam, china, waterfall, ban gioc, transnational, border</Misc__Tags>
        <scenic_rating>4</scenic_rating>
        <difficulty_rating>1.5</difficulty_rating>
        <distance>1km round trip</distance>
        <time_commitment>30-45 minutes</time_commitment>
        <GPS_Coordinates>22.85577, 106.72273</GPS_Coordinates>
        <date_first_visited>2009-04-23</date_first_visited>
        <date_last_visited>2009-04-23</date_last_visited>
        <Old_Web_Address>http://www.world-of-waterfalls.com/asia-detian-waterfall.html</Old_Web_Address>
        <Post_Slug>asia-detian-waterfall.html</Post_Slug>
    </Row>
    <Row>
        <Entry_No>1125</Entry_No>
    </Row>
    <Row>
        <Entry_No>1126</Entry_No>
    </Row>
    <Row>
        <Entry_No>1127</Entry_No>
    </Row>
</Root>

在本论坛的某些人的指导下,我至少能够使用以下代码:

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET
import os

data_file = 'test_of_2016-09-19.xml'

tree = ET.ElementTree(file=data_file)
root = tree.getroot()

for element in root:
    if element.find('File_directory') is not None: 
        directory = element.find('File_directory').text
    if element.find('Introduction') is not None:
        introduction = element.find('Introduction').text
    if element.find('Directions') is not None:
        directions = element.find('Directions').text

    #The following code was suggested to me, but I'm having trouble getting them to work and understanding what each line is doing
    intro_tree = ET.ElementTree(directory+introduction) #throws NameError: name 'ET' is not defined
    intro_text = intro_tree.find('body').text #won't work since intro_tree not defined, but even then, I'm not sure what this line is trying to do
    intro = SubElement(element,'Introduction') #throws NameError: name 'SubElement' is not defined
    intro.text = intro_text #didn't get this far, but what is the intent of this line?
    # Do the same for Directions
    directions_tree = ET.ElementTree(directory+directions)
    directions_text = directions_tree.find('body').text
    directions = SubElement(element,'Direction')

# After the loop, write the file back with new elements added
tree.write('new_' + data_file)

因为我是Python的新手,我在尝试做这个简单的任务时遇到了很多困难,但是我觉得因为我对语法和正确的关键词和/的无知而我是盲目的或方法甚至库使用。有更简单的方法吗?我是否正确使用Python和XML使用ElementTree库来完成这项工作?或者是lxml还是minidom更好?我真的不知道,鉴于所有选项和我缺乏Python背景,所有文献都令人困惑。

任何帮助我解决这个僵局的人都会非常感激。

谢谢, 约翰尼

0 个答案:

没有答案