我花了几个月的时间试图完成以下任务。
我有一个Excel生成的XML文件,它捕获我一直致力于构建网站的数据库。我的梦想是将这个或一些操作形式的XML文件导入到WordPress中,这样我就不必再逐个手动编辑每个帖子或网页了(特别是当我做出影响几个内容的更改或者网站上的所有页面/帖子。
Excel文件(我称之为'test_of_2016-09-19.xml')如下所示:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Row>
<Entry_No>1</Entry_No>
<Waterfall_Name>Bridalveil Fall</Waterfall_Name>
<Continent___Super_Region>North America</Continent___Super_Region>
<Country>USA</Country>
<State___Province>California</State___Province>
<Subregion>Southern and Central Sierras</Subregion>
<locale___political_or_official>Mariposa County</locale___political_or_official>
<alt__locale__unofficial_or_more_recognized>Yosemite National Park</alt__locale__unofficial_or_more_recognized>
<Misc__Tags>oakhurst, el portal, mariposa, yosemite, yosemite valley, sierra, california, waterfall, fresno, modesto, wawona, tunnel, merced, pohono, wheelchair</Misc__Tags>
<scenic_rating>4.5</scenic_rating>
<difficulty_rating>1</difficulty_rating>
<distance>roadside; 1/2 mile round trip to base; wheelchair</distance>
<time_commitment>20 minutes</time_commitment>
<GPS_Coordinates>37.71736, -119.64901</GPS_Coordinates>
<date_first_visited>1999-09-04</date_first_visited>
<date_last_visited>2011-06-04</date_last_visited>
<Old_Web_Address>http://www.world-of-waterfalls.com/yosemite-bridalveil-fall.html</Old_Web_Address>
<Post_Slug>yosemite-bridalveil-fall.html</Post_Slug>
</Row>
<Row>
<Entry_No>52</Entry_No>
<Waterfall_Name>Switzer Falls</Waterfall_Name>
<Continent___Super_Region>North America</Continent___Super_Region>
<Country>USA</Country>
<State___Province>California</State___Province>
<Subregion>Southern California</Subregion>
<locale___political_or_official>Los Angeles County</locale___political_or_official>
<alt__locale__unofficial_or_more_recognized>Angeles National Forest, La Canada Flintridge</alt__locale__unofficial_or_more_recognized>
<Misc__Tags>la canada, flintridge, altadena, pasadena, san gabriel, angeles national forest, angeles crest, los angeles, southern california, california, waterfall, arroyo seco, gabrielino trail, clear creek station, adventure pass, picnic</Misc__Tags>
<scenic_rating>2</scenic_rating>
<difficulty_rating>3.5</difficulty_rating>
<distance>4.6 miles round trip (to base of main drop)</distance>
<time_commitment>3.5 hours (to base of main drop)</time_commitment>
<GPS_Coordinates>34.25828, -118.15474</GPS_Coordinates>
<date_first_visited>2003-02-02</date_first_visited>
<date_last_visited>2016-04-23</date_last_visited>
<Old_Web_Address>http://www.world-of-waterfalls.com/california-switzer-falls.html</Old_Web_Address>
<File_directory>./waterfall_writeups/52_Switzer_Falls/</File_directory>
<Introduction>introduction-switzer-falls.html</Introduction>
<Directions>directions-switzer-falls.html</Directions>
<Nearby_Waterfalls_Tags>southern california, pasadena, angeles crest, waterfall</Nearby_Waterfalls_Tags>
<Itinerary_Tags>itinerary, switzer falls</Itinerary_Tags>
<Trip_Report_Tags>trip report, switzer falls</Trip_Report_Tags>
<Trip_Planning_Article_Tags>featured article, switzer falls</Trip_Planning_Article_Tags>
<Post_Slug>california-switzer-falls.html</Post_Slug>
</Row>
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<Continent___Super_Region>Asia</Continent___Super_Region>
<Country>China</Country>
<locale___political_or_official>Guangxi</locale___political_or_official>
<alt__locale__unofficial_or_more_recognized>Daxin County</alt__locale__unofficial_or_more_recognized>
<Misc__Tags>daxin, guichin river, guangxi, vietnam, china, waterfall, ban gioc, transnational, border</Misc__Tags>
<scenic_rating>4</scenic_rating>
<difficulty_rating>1.5</difficulty_rating>
<distance>1km round trip</distance>
<time_commitment>30-45 minutes</time_commitment>
<GPS_Coordinates>22.85577, 106.72273</GPS_Coordinates>
<date_first_visited>2009-04-23</date_first_visited>
<date_last_visited>2009-04-23</date_last_visited>
<Old_Web_Address>http://www.world-of-waterfalls.com/asia-detian-waterfall.html</Old_Web_Address>
<Post_Slug>asia-detian-waterfall.html</Post_Slug>
</Row>
<Row>
<Entry_No>1125</Entry_No>
</Row>
<Row>
<Entry_No>1126</Entry_No>
</Row>
<Row>
<Entry_No>1127</Entry_No>
</Row>
</Root>
我想要做的是,如果存在特定的元素或标签(特别是File_directory,Introduction,Directions),则打开指向的文件,抓取他们的文本内容,并将它们放在新元素或标签中,例如Introduction_Body ,Directions_Body等),然后写出新修改的XML文件。
新的XML文件如下所示:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Row>
<Entry_No>1</Entry_No>
<Waterfall_Name>Bridalveil Fall</Waterfall_Name>
<Continent___Super_Region>North America</Continent___Super_Region>
<Country>USA</Country>
<State___Province>California</State___Province>
<Subregion>Southern and Central Sierras</Subregion>
<locale___political_or_official>Mariposa County</locale___political_or_official>
<alt__locale__unofficial_or_more_recognized>Yosemite National Park</alt__locale__unofficial_or_more_recognized>
<Misc__Tags>oakhurst, el portal, mariposa, yosemite, yosemite valley, sierra, california, waterfall, fresno, modesto, wawona, tunnel, merced, pohono, wheelchair</Misc__Tags>
<scenic_rating>4.5</scenic_rating>
<difficulty_rating>1</difficulty_rating>
<distance>roadside; 1/2 mile round trip to base; wheelchair</distance>
<time_commitment>20 minutes</time_commitment>
<GPS_Coordinates>37.71736, -119.64901</GPS_Coordinates>
<date_first_visited>1999-09-04</date_first_visited>
<date_last_visited>2011-06-04</date_last_visited>
<Old_Web_Address>http://www.world-of-waterfalls.com/yosemite-bridalveil-fall.html</Old_Web_Address>
<Post_Slug>yosemite-bridalveil-fall.html</Post_Slug>
</Row>
<Row>
<Entry_No>52</Entry_No>
<Waterfall_Name>Switzer Falls</Waterfall_Name>
<Continent___Super_Region>North America</Continent___Super_Region>
<Country>USA</Country>
<State___Province>California</State___Province>
<Subregion>Southern California</Subregion>
<locale___political_or_official>Los Angeles County</locale___political_or_official>
<alt__locale__unofficial_or_more_recognized>Angeles National Forest, La Canada Flintridge</alt__locale__unofficial_or_more_recognized>
<Misc__Tags>la canada, flintridge, altadena, pasadena, san gabriel, angeles national forest, angeles crest, los angeles, southern california, california, waterfall, arroyo seco, gabrielino trail, clear creek station, adventure pass, picnic</Misc__Tags>
<scenic_rating>2</scenic_rating>
<difficulty_rating>3.5</difficulty_rating>
<distance>4.6 miles round trip (to base of main drop)</distance>
<time_commitment>3.5 hours (to base of main drop)</time_commitment>
<GPS_Coordinates>34.25828, -118.15474</GPS_Coordinates>
<date_first_visited>2003-02-02</date_first_visited>
<date_last_visited>2016-04-23</date_last_visited>
<Old_Web_Address>http://www.world-of-waterfalls.com/california-switzer-falls.html</Old_Web_Address>
<File_directory>./waterfall_writeups/52_Switzer_Falls/</File_directory>
<Introduction>introduction-switzer-falls.html</Introduction>
<Introduction_Body>This would be text from the file ./waterfall_writeups/52_Switzer_Falls/introduction-switzer-falls.html complete with links, img tags, and other lorem ipsum; would I need to do anything special for special characters like Chinese and Japanese Characters, accent markings, etc?</Introduction_Body>
<Directions>directions-switzer-falls.html</Directions>
<Directions_Body>This would be text from the file ./waterfall_writeups/52_Switzer_Falls/directions-switzer-falls.html complete with links, img tags, and other lorem ipsum; would I need to do anything special for special characters like Chinese and Japanese Characters, accent markings, etc?</Directions_Body>
<Nearby_Waterfalls_Tags>southern california, pasadena, angeles crest, waterfall</Nearby_Waterfalls_Tags>
<Itinerary_Tags>itinerary, switzer falls</Itinerary_Tags>
<Trip_Report_Tags>trip report, switzer falls</Trip_Report_Tags>
<Trip_Planning_Article_Tags>featured article, switzer falls</Trip_Planning_Article_Tags>
<Post_Slug>california-switzer-falls.html</Post_Slug>
</Row>
<Row>
<Entry_No>657</Entry_No>
<Waterfall_Name>Detian Waterfall (德天瀑布 [Détiān Pùbù])</Waterfall_Name>
<Continent___Super_Region>Asia</Continent___Super_Region>
<Country>China</Country>
<locale___political_or_official>Guangxi</locale___political_or_official>
<alt__locale__unofficial_or_more_recognized>Daxin County</alt__locale__unofficial_or_more_recognized>
<Misc__Tags>daxin, guichin river, guangxi, vietnam, china, waterfall, ban gioc, transnational, border</Misc__Tags>
<scenic_rating>4</scenic_rating>
<difficulty_rating>1.5</difficulty_rating>
<distance>1km round trip</distance>
<time_commitment>30-45 minutes</time_commitment>
<GPS_Coordinates>22.85577, 106.72273</GPS_Coordinates>
<date_first_visited>2009-04-23</date_first_visited>
<date_last_visited>2009-04-23</date_last_visited>
<Old_Web_Address>http://www.world-of-waterfalls.com/asia-detian-waterfall.html</Old_Web_Address>
<Post_Slug>asia-detian-waterfall.html</Post_Slug>
</Row>
<Row>
<Entry_No>1125</Entry_No>
</Row>
<Row>
<Entry_No>1126</Entry_No>
</Row>
<Row>
<Entry_No>1127</Entry_No>
</Row>
</Root>
在本论坛的某些人的指导下,我至少能够使用以下代码:
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
import os
data_file = 'test_of_2016-09-19.xml'
tree = ET.ElementTree(file=data_file)
root = tree.getroot()
for element in root:
if element.find('File_directory') is not None:
directory = element.find('File_directory').text
if element.find('Introduction') is not None:
introduction = element.find('Introduction').text
if element.find('Directions') is not None:
directions = element.find('Directions').text
#The following code was suggested to me, but I'm having trouble getting them to work and understanding what each line is doing
intro_tree = ET.ElementTree(directory+introduction) #throws NameError: name 'ET' is not defined
intro_text = intro_tree.find('body').text #won't work since intro_tree not defined, but even then, I'm not sure what this line is trying to do
intro = SubElement(element,'Introduction') #throws NameError: name 'SubElement' is not defined
intro.text = intro_text #didn't get this far, but what is the intent of this line?
# Do the same for Directions
directions_tree = ET.ElementTree(directory+directions)
directions_text = directions_tree.find('body').text
directions = SubElement(element,'Direction')
# After the loop, write the file back with new elements added
tree.write('new_' + data_file)
因为我是Python的新手,我在尝试做这个简单的任务时遇到了很多困难,但是我觉得因为我对语法和正确的关键词和/的无知而我是盲目的或方法甚至库使用。有更简单的方法吗?我是否正确使用Python和XML使用ElementTree库来完成这项工作?或者是lxml还是minidom更好?我真的不知道,鉴于所有选项和我缺乏Python背景,所有文献都令人困惑。
任何帮助我解决这个僵局的人都会非常感激。
谢谢, 约翰尼