Question

我正在尝试转换：

<doc id="123" url="http://url.org/thing?curid=123" title="title"> 
Title

text text text more text

</doc>

进入CSV文件（该文件有大量类似格式的“文档”）。如果它是一个常规的XML文件，我想我可以用this这样的解决方案来解决它，但由于上面的代码不是常规的XML格式，我被卡住了。

我要做的是将数据导入到postgresql中，根据我收集的内容，如果它是CSV格式，则导入此信息会更容易（如果有其他方式，请告诉我）。我需要的是分离出“id”，“url”“title”和“text / body”。

加分问题：文本/正文中的第一行是文档的标题，是否可以在转换中删除/操作第一行？

谢谢！

Answer 1

就Python而言：

给定一个XML文件（thedoc.xml），如：

<?xml version="1.0" encoding="UTF-8"?>
<docCollection>
    <doc id="123" url="http://url.org/thing?curid=123" title="Farenheit451"> 
    Farenheit451

    It was a pleasure to burn...
    </doc>

    <doc id="456" url="http://url.org/thing?curid=456" title="Sense and sensitivity"> 
    Sense and sensitivity

    It was sensibile to be sensitive &amp; nice...
    </doc>        
</docCollection>

使用lxml的脚本（thecode.py），如：

from lxml import etree
import pandas
import HTMLParser 

inFile = "./thedoc.xml"
outFile = "./theprocdoc.csv"

#It is likely that your XML might be too big to be parsed into memory,
#for this reason it is better to use the incremental parser from lxml.
#This is initialised here to be triggering an "event" after a "doc" tag
#has been parsed.
ctx = etree.iterparse(inFile, events = ("end",), tag=("doc",))

hp = HTMLParser.HTMLParser()
csvData = []
#For every parsed element in the "context"...
for event, elem in ctx:
    #...isolate the tag's attributes and apply some formating to its text
    #Please note that you can remove the cgi.escape if you are not interested in HTML escaping. Please also note that the body is simply split at the newline character and then rejoined to ommit the title.
    csvData.append({"id":elem.get("id"),
                    "url":elem.get("url"),
                    "title":elem.get("title"),
                    "body":hp.unescape("".join(elem.text.split("\n")[2:]))})
    elem.clear() #It is important to call clear here, to release the memory occupied by the element's parsed data.

#Finally, simply turn the list of dictionaries to a DataFrame and writeout the CSV. I am using pandas' to_csv here for convenience.
pandas.DataFrame(csvData).to_csv(outFile, index = False)

它将生成一个类似于：

的CSV（theprocdoc.csv）

body,id,title,url
        It was a pleasure to burn...    ,123,Farenheit451,http://url.org/thing?curid=123
        It was sensibile to be sensitive...    ,456,Sense and sensibility,http://url.org/thing?curid=456

有关详细信息（由于我无法格式化内联评论中的链接），请参阅lxml.etree.iterparse，cgi.escape，pandas.DataFrame.to_csv。

希望这有帮助。

将“文档格式”/ XML转换为CSV

1 个答案: