安装sparknlp后,无法导入sparknlp

时间:2017-12-07 22:52:38

标签: apache-spark pyspark apache-spark-mllib spark-packages

以下在Cloudera CDSW群集网关上成功运行。

Ivy Default Cache set to: /home/cdsw/.ivy2/cache
The jars for the packages stored in: /home/cdsw/.ivy2/jars
:: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
JohnSnowLabs#spark-nlp added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found JohnSnowLabs#spark-nlp;1.2.3 in spark-packages
    found com.typesafe#config;1.3.0 in central
    found org.fusesource.leveldbjni#leveldbjni-all;1.8 in central
downloading http://dl.bintray.com/spark-packages/maven/JohnSnowLabs/spark-nlp/1.2.3/spark-nlp-1.2.3.jar ...
    [SUCCESSFUL ] JohnSnowLabs#spark-nlp;1.2.3!spark-nlp.jar (3357ms)
downloading https://repo1.maven.org/maven2/com/typesafe/config/1.3.0/config-1.3.0.jar ...
    [SUCCESSFUL ] com.typesafe#config;1.3.0!config.jar(bundle) (348ms)
downloading https://repo1.maven.org/maven2/org/fusesource/leveldbjni/leveldbjni-all/1.8/leveldbjni-all-1.8.jar ...
    [SUCCESSFUL ] org.fusesource.leveldbjni#leveldbjni-all;1.8!leveldbjni-all.jar(bundle) (382ms)
:: resolution report :: resolve 3836ms :: artifacts dl 4095ms
    :: modules in use:
    JohnSnowLabs#spark-nlp;1.2.3 from spark-packages in [default]
    com.typesafe#config;1.3.0 from central in [default]
    org.fusesource.leveldbjni#leveldbjni-all;1.8 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    3 artifacts copied, 0 already retrieved (5740kB/37ms)
Setting default log level to "ERROR".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

产生此输出。

import sparknlp
# or 
from sparknlp.annotator import *

但是当我尝试导入sparknlp时如John Snow Labs所描述的pyspark ...

ImportError: No module named sparknlp
ImportError: No module named sparknlp.annotator 

我明白了:

import Tkinter as tk

class AutoGrid(tk.Frame):
    def __init__(self, master=None, **kwargs):
        tk.Frame.__init__(self, master, **kwargs)
        self.columns = None
        self.bind('<Configure>', self.regrid)

    def regrid(self, event=None):
        width = self.winfo_width()
        slaves = self.grid_slaves()
        max_width = max(slave.winfo_width() for slave in slaves)
        cols = width // max_width
        if cols == self.columns: # if the column number has not changed, abort
            return
        for i, slave in enumerate(slaves):
            slave.grid_forget()
            slave.grid(row=i//cols, column=i%cols)
        self.columns = cols

class TestFrame(tk.Frame):
    def __init__(self, master=None, **kwargs):
        tk.Frame.__init__(self, master, bd=5, relief=tk.RAISED, **kwargs)

        tk.Label(self, text="name").pack(pady=10)
        tk.Label(self, text=" info ........ info ").pack(pady=10)
        tk.Label(self, text="data\n"*5).pack(pady=10)

def main():
    root = tk.Tk()
    frame = AutoGrid(root)
    frame.pack(fill=tk.BOTH, expand=True)

    TestFrame(frame).grid() # use normal grid parameters to set up initial layout
    TestFrame(frame).grid(column=1)
    TestFrame(frame).grid(column=2)
    TestFrame(frame).grid()
    TestFrame(frame).grid()
    TestFrame(frame).grid()
    root.mainloop()

if __name__ == '__main__':
    main()

使用sparknlp需要做什么?当然,这可以针对任何Spark包进行推广。

3 个答案:

答案 0 :(得分:2)

我明白了。正确加载的jar文件只是已编译的Scala文件。我仍然必须将包含包装器代码的Python文件放在我可以从中导入的位置。一旦我这样做,一切都很好。

答案 1 :(得分:1)

您可以使用以下命令在PySpark中使用SparkNLP包:

pyspark --packages JohnSnowLabs:spark-nlp:1.3.0

但是这并没有告诉Python在哪里找到绑定。按照类似报告here的说明,可以通过将jar目录添加到PYTHONPATH来解决此问题:

export PYTHONPATH="~/.ivy2/jars/JohnSnowLabs_spark-nlp-1.3.0.jar:$PYTHONPATH"

import sys, glob, os
sys.path.extend(glob.glob(os.path.join(os.path.expanduser("~"), ".ivy2/jars/*.jar")))

答案 2 :(得分:0)

感谢克莱。以下是我设置 PYTHONPATH 的方法:

git clone --branch 3.0.3 https://github.com/JohnSnowLabs/spark-nlp
export PYTHONPATH="./spark-nlp/python:$PYTHONPATH"

然后它对我有用,因为我的 ./spark-nlp/python 文件夹现在包含难以捉摸的 sparknlp 模块。

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3

>>> import sparknlp
>>>