Question

我正在使用python构建一个应用程序，它涉及从RSS源获取新闻文章。作为我项目的一部分，我决定使用samppipe，以便从文章出现的html页面中提取文章内容。

尽管samppipe最初是为java编写的，但它也被移植到了python中。你可以在这里看到它在github上的页面：https://github.com/misja/python-boilerpipe

问题是我在尝试使用以下方法导入时遇到异常：

from boilerpipe.extract import Extractor

我得到的错误是：

Traceback (most recent call last):
File "", line 1, in
File "build\bdist.win32\egg\boilerpipe\extract__init__.py", line 12, in
File "C:\Python26\lib\site-packages\jpype_jclass.py", line 54, in JClass
raise _RUNTIMEEXCEPTION.PYEXC("Class %s not found" % name)
jpype._jexception.ExceptionPyRaisable: java.lang.Exception: Class 
de.l3s.boilerpipe.sax.HTMLHighlighter not found

可能导致此问题的原因以及如何解决？

Answer 1

这适用于使用Python 2.7.9的Mac OS X 10.8.5。

pip install JPype1    # to install https://pypi.python.org/pypi/JPype1
pip install charade
git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe
sudo python setup.py install

然后你应该能够在python控制台中做到

>>> from boilerpipe.extract import Extractor
>>> extractor = Extractor(extractor='ArticleExtractor', url="http://en.wikipedia.org/wiki/Main_Page")
>>> print extractor.getText()

Answer 2

您缺少锅炉管道java包安装，您可以在此处找到它 - http://code.google.com/p/boilerpipe/downloads/list

你只安装了python boilerpipe包装器。

Answer 3

以下对我来说效果最好：

git clone https://github.com/misja/python-boilerpipe.git
cd python-boilerpipe
sudo python setup.py install

您可能需要：

在Ubuntu上安装JPype（sudo apt-get install python-jpype）
安装charade（sudo pip install charade）

但是你不必安装自来水管JAVA jar，因为安装程序为你加载了这个。

我尝试从pip安装python boilerpipe，但没有运气。我成功运行样板java代码，但一直得到同样的错误。

Answer 4

找不到班级HTMLHighlighter。你设置了JAVA_HOME了吗？ The documentation州：

请务必正确设置JAVA_HOME，因为jpype取决于此设置。

Answer 5

我有同样的问题。我看到了网络挖掘作者提供的设置细节。这是他的关于samppipe的Github页面的链接

https://github.com/misja/python-boilerpipe/blob/master/setup.py

无法在python中导入samppipe

5 个答案: