我正在尝试分析Wikipedia转储文件。我正在使用gensim.scripts,一个Python库,并在Windows 10 cmd.exe中运行此命令:
python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2 wiki_en_output
这给了我错误:Microsoft Windows [Version 10.0.10586] (c)2015 Microsoft Corporation。保留所有权利。
2015-12-03 15:47:20,459 : INFO : running C:\Python27\lib\site-packages\gensim-0.12.3-py2.7-win32.egg\gensim\scripts\make_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki_en_output
Traceback (most recent call last):
File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "C:\Python27\lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "C:\Python27\lib\site-packages\gensim-0.12.3-py2.7-win32.egg\gensim\scripts\make_wiki.py", line 84, in <module>
wiki = WikiCorpus(inp, lemmatize=lemmatize) # takes about 9h on a macbook pro, for 3.5m articles (june 2011)
File "C:\Python27\lib\site-packages\gensim-0.12.3-py2.7-win32.egg\gensim\corpora\wikicorpus.py", line 270, in __init__
self.dictionary = Dictionary(self.get_texts())
File "C:\Python27\lib\site-packages\gensim-0.12.3-py2.7-win32.egg\gensim\corpora\dictionary.py", line 58, in __init__
self.add_documents(documents, prune_at=prune_at)
File "C:\Python27\lib\site-packages\gensim-0.12.3-py2.7-win32.egg\gensim\corpora\dictionary.py", line 119, in add_documents
for docno, document in enumerate(documents):
File "C:\Python27\lib\site-packages\gensim-0.12.3-py2.7-win32.egg\gensim\corpora\wikicorpus.py", line 290, in get_texts
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
IOError: [Errno 2] No such file or directory: 'enwiki-latest-pages-articles.xml.bz2'
关于我应该怎么做来解决这个问题的想法?
在Windows 10.已安装gensim.scripts。
答案 0 :(得分:1)
只需将整个路径放到下载的enwiki-latest-pages-articles.xml.bz2上,或尝试从下载文件夹中运行gensim脚本。
如果您没有该存档 - 您可以从转储维基媒体网站找到并下载