我希望使用Python解析可用的Gutenberg目录here。我在网络抓取和解析HTML方面经验丰富,但这种格式让我望而却步。我尝试过使用lxml etree和以下尝试使用RDFlib:
path = 'epub/10/pg%s.rdf'
g = rdflib.Graph()
g.parse(path)
s = g.serialize(format='nt')
print(g)
我正在寻找各种元数据值(标题,作者,Gutenberg网址等)。我在下面提供了一个示例文件。
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xml:base="http://www.gutenberg.org/"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:cc="http://web.resource.org/cc/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:pgterms="http://www.gutenberg.org/2009/pgterms/"
xmlns:dcam="http://purl.org/dc/dcam/"
>
<cc:Work rdf:about="">
<cc:license rdf:resource="http://www.gnu.org/licenses/gpl.html"/>
<rdfs:comment>Archives containing the RDF files for *all* our books can be downloaded at
http://www.gutenberg.org/wiki/Gutenberg:Feeds#The_Complete_Project_Gutenberg_Catalog</rdfs:comment>
</cc:Work>
<pgterms:ebook rdf:about="ebooks/100">
<dcterms:title>The Complete Works of William Shakespeare</dcterms:title>
<pgterms:bookshelf>
<rdf:Description rdf:nodeID="Ncc8361d84fc142969cf27b77ac8d0c24">
<rdf:value>Plays</rdf:value>
<dcam:memberOf rdf:resource="2009/pgterms/Bookshelf"/>
</rdf:Description>
</pgterms:bookshelf>
<dcterms:issued rdf:datatype="http://www.w3.org/2001/XMLSchema#date">1994-01-01</dcterms:issued>
<dcterms:publisher>Project Gutenberg</dcterms:publisher>
<dcterms:rights>Copyrighted. Read the copyright notice inside this book for details.</dcterms:rights>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/files/100/100.txt">
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">5589917</dcterms:extent>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-08-29T12:08:52</dcterms:modified>
<dcterms:format>
<rdf:Description rdf:nodeID="N19fd61f986a94cc18f5dce9ed07e8517">
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain; charset=us-ascii</rdf:value>
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
</rdf:Description>
</dcterms:format>
</pgterms:file>
</dcterms:hasFormat>
<dcterms:license rdf:resource="license"/>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.kindle.images">
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
<dcterms:format>
<rdf:Description rdf:nodeID="N0ee902d343e44cb5a8f639fa55fc6334">
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/x-mobipocket-ebook</rdf:value>
</rdf:Description>
</dcterms:format>
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">9509392</dcterms:extent>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:18:40.171080</dcterms:modified>
</pgterms:file>
</dcterms:hasFormat>
<dcterms:subject>
<rdf:Description rdf:nodeID="N0e2195113aa34bf7abfe001edf6a03a2">
<rdf:value>English drama -- Early modern and Elizabethan, 1500-1600</rdf:value>
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCSH"/>
</rdf:Description>
</dcterms:subject>
<dcterms:creator>
<pgterms:agent rdf:about="2009/agents/65">
<pgterms:name>Shakespeare, William</pgterms:name>
<pgterms:birthdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1564</pgterms:birthdate>
<pgterms:deathdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1616</pgterms:deathdate>
<pgterms:alias>Shakspeare, William</pgterms:alias>
<pgterms:webpage rdf:resource="http://en.wikipedia.org/wiki/William_Shakespeare"/>
<pgterms:alias>Shakspere, William</pgterms:alias>
</pgterms:agent>
</dcterms:creator>
<dcterms:subject>
<rdf:Description rdf:nodeID="Ncb26996951d44761901e30445fc8a9dc">
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCC"/>
<rdf:value>PR</rdf:value>
</rdf:Description>
</dcterms:subject>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/files/100/100.zip">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2035857</dcterms:extent>
<dcterms:format>
<rdf:Description rdf:nodeID="Nb4f5881241fd42e9a0f8a07cb1462008">
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/zip</rdf:value>
</rdf:Description>
</dcterms:format>
<dcterms:format>
<rdf:Description rdf:nodeID="Nc3c66052298f482488fb8f13215f92ba">
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain; charset=us-ascii</rdf:value>
</rdf:Description>
</dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2014-08-29T12:09:20</dcterms:modified>
</pgterms:file>
</dcterms:hasFormat>
<pgterms:downloads rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">4605</pgterms:downloads>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.epub.noimages">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2376083</dcterms:extent>
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:18:13.998200</dcterms:modified>
<dcterms:format>
<rdf:Description rdf:nodeID="N9dc27629e3164dba98c659dcaf47c7fe">
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/epub+zip</rdf:value>
</rdf:Description>
</dcterms:format>
</pgterms:file>
</dcterms:hasFormat>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.html.noimages">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">6944416</dcterms:extent>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:18:00.715792</dcterms:modified>
<dcterms:format>
<rdf:Description rdf:nodeID="N7140e760a0f14ae4ba4b027bd7f7f4f6">
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/html</rdf:value>
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
</rdf:Description>
</dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
</pgterms:file>
</dcterms:hasFormat>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.kindle.noimages">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">9509383</dcterms:extent>
<dcterms:format>
<rdf:Description rdf:nodeID="N34666f5ebdd8461ca1c6b8cfba5103e5">
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/x-mobipocket-ebook</rdf:value>
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
</rdf:Description>
</dcterms:format>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:19:07.134922</dcterms:modified>
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
</pgterms:file>
</dcterms:hasFormat>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.epub.images">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2376084</dcterms:extent>
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
<dcterms:format>
<rdf:Description rdf:nodeID="N1e32eb8531504d378e05acb6440d24b0">
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/epub+zip</rdf:value>
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
</rdf:Description>
</dcterms:format>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:18:09.062427</dcterms:modified>
</pgterms:file>
</dcterms:hasFormat>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.rdf">
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-28T05:00:49.076168</dcterms:modified>
<dcterms:format>
<rdf:Description rdf:nodeID="N1d915c961af44ab7ac9c71e7ec068bde">
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/rdf+xml</rdf:value>
</rdf:Description>
</dcterms:format>
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">11275</dcterms:extent>
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
</pgterms:file>
</dcterms:hasFormat>
<dcterms:language>
<rdf:Description rdf:nodeID="N5ff08142477c4bfeb3bac48c18ba23a4">
<rdf:value rdf:datatype="http://purl.org/dc/terms/RFC4646">en</rdf:value>
</rdf:Description>
</dcterms:language>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.txt.utf-8">
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:17:42.102580</dcterms:modified>
<dcterms:format>
<rdf:Description rdf:nodeID="N98845b3d16bd42d787e9d7cba42bf44b">
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/plain</rdf:value>
</rdf:Description>
</dcterms:format>
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">5589889</dcterms:extent>
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
</pgterms:file>
</dcterms:hasFormat>
<dcterms:type>
<rdf:Description rdf:nodeID="N47bb369dd96248ffb1f412145cdb0713">
<rdf:value>Text</rdf:value>
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/DCMIType"/>
</rdf:Description>
</dcterms:type>
<dcterms:hasFormat>
<pgterms:file rdf:about="http://www.gutenberg.org/ebooks/100.html.images">
<dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">6944416</dcterms:extent>
<dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2016-04-01T01:17:55.634002</dcterms:modified>
<dcterms:format>
<rdf:Description rdf:nodeID="Nd1733441ad824cff97a5d9ad50f0307b">
<dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
<rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">text/html</rdf:value>
</rdf:Description>
</dcterms:format>
<dcterms:isFormatOf rdf:resource="ebooks/100"/>
</pgterms:file>
</dcterms:hasFormat>
</pgterms:ebook>
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/William_Shakespeare">
<dcterms:description>Wikipedia</dcterms:description>
</rdf:Description>
</rdf:RDF>
答案 0 :(得分:2)
你能用正则表达式解析它吗?例如
import re
title = re.search("<dcterms:title>([^<]*)", xml)
编辑如果您想使用xml解析器执行此操作,则需要声明命名空间(在xml文件的顶部定义):
import xml.etree.ElementTree as et
tree = et.parse(path)
ns = {"dcterms": "http://purl.org/dc/terms/"}
title = tree.find(".//dcterms:title", ns)
答案 1 :(得分:2)
我知道你已经快速获得了快捷方式,但我想我也简要说明了基于RDF的方法,你已经非常接近了:你已经设法创造了一个Graph
对象并将RDF文件加载到其中。前进的方法是查询该Graph对象以获取您感兴趣的属性。
举一个简单的例子,要检索ID为http://www.gutenberg.org/ebooks/100
的电子书的标题,你可以这样做(警告:我没有Python程序员,所以可能会有错误):
from rdflib import URIRef, Namespace
id = URIRef("http://www.gutenberg.org/ebooks/100")
# we create a Namespace for the relationship names, to make easy to reuse
pgterms = Namespace("http://www.gutenberg.org/2009/pgterms/")
# print out the object value(s) of the 'title' relation for ebook 100.
for title in g.objects(id, pgterms.title))
print(title)
请注意,我可能在这里错过了一些有效的快捷方式 - 我不太了解RDFLib,只是通过查看他们的documentation几分钟来编造这个例子。很可能直接从您之前加载的图形中获取该命名空间,而不必像这样手动定义它们。
一般原则是这样的:RDF是一个图形,包含单个语句,带有主题,谓词和对象。您可以通过查询该图表来使用它。上面是一个非常简单的查询,只检索单个主题和单个关系的值,但当然你可以做循环,路径,列表等。