我正在尝试使用元数据收集包https://pypi.python.org/pypi/pyoai来收集此网站上的数据https://www.duo.uio.no/oai/request?verb=Identify
我在pyaoi网站上尝试了这个例子,但是没有用。当我测试它时,我得到一个错误。代码是:
from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader
URL = 'http://uni.edu/ir/oaipmh'
registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)
for record in client.listRecords(metadataPrefix='oai_dc'):
print record
这是堆栈跟踪:
Traceback (most recent call last):
File "/Users/arashsaidi/PycharmProjects/get-new-DUO/get-files.py", line 8, in <module>
for record in client.listRecords(metadataPrefix='oai_dc'):
File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 115, in method
return obj(self, **kw)
File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/common.py", line 110, in __call__
return bound_self.handleVerb(self._verb, kw)
File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 65, in handleVerb
kw, self.makeRequestErrorHandling(verb=verb, **kw))
File "/Users/arashsaidi/.virtualenvs/lbk/lib/python2.7/site-packages/oaipmh/client.py", line 273, in makeRequestErrorHandling
raise error.XMLSyntaxError(kw)
oaipmh.error.XMLSyntaxError: {'verb': 'ListRecords', 'metadataPrefix': 'oai_dc'}
我需要访问我上面链接的页面上的所有文件,并生成带有一些元数据的附加文件。
有什么建议吗?
答案 0 :(得分:3)
我最终使用了Sickle包,我发现它有更好的文档和更容易使用:
此代码获取所有集合,然后从每个集合中检索每条记录。这似乎是最好的解决方案,因为有超过30000条记录需要处理。为每个集合执行它可以提供更多控制。希望这可以帮助其他人。我不知道为什么图书馆使用OAI,似乎不是一个很好的方式来组织数据给我...
# gets sickle from OAI
sickle = Sickle('http://www.duo.uio.no/oai/request')
sets = sickle.ListSets() # gets all sets
for recs in sets:
for rec in recs:
if rec[0] == 'setSpec':
try:
print rec[1][0], self.spec_list[rec[1][0]]
records = sickle.ListRecords(metadataPrefix='xoai', set=rec[1][0], ignore_deleted=True)
self.write_file_and_metadata()
except Exception as e:
# simple exception handling if not possible to retrieve record
print('Exception: {}'.format(e))
答案 1 :(得分:1)
似乎pyoai网站(http://uni.edu/ir/oaipmh)的链接已经死亡,因为它返回404.
不过,您应该能够从您的网站获取数据:
from oaipmh.client import Client
from oaipmh.metadata import MetadataRegistry, oai_dc_reader
URL = 'https://www.duo.uio.no/oai/request'
registry = MetadataRegistry()
registry.registerReader('oai_dc', oai_dc_reader)
client = Client(URL, registry)
# identify info
identify = client.identify()
print "Repository name: {0}".format(identify.repositoryName())
print "Base URL: {0}".format(identify.baseURL())
print "Protocol version: {0}".format(identify.protocolVersion())
print "Granularity: {0}".format(identify.granularity())
print "Compression: {0}".format(identify.compression())
print "Deleted record: {0}".format(identify.deletedRecord())
# list records
records = client.listRecords(metadataPrefix='oai_dc')
for record in records:
# do something with the record
pass
# list metadata formats
formats = client.listMetadataFormats()
for f in formats:
# do something with f
pass