pyspark:给出无效值的StopWordsRemover参数语言环境

时间:2019-03-19 16:44:53

标签: apache-spark pyspark stop-words

我已经使用pyspark将几个文本文件加载到数据框中,将它们拆分为单词,现在想使用StopWordsRemover过滤掉停用词。

但是,当我要实例化StopWordsRemover类时,它将失败并显示错误:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
/usr/local/Cellar/apache-spark/2.4.0/libexec/python/pyspark/sql/utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling None.org.apache.spark.ml.feature.StopWordsRemover.
: java.lang.IllegalArgumentException: StopWordsRemover_daf8924a73f7 parameter locale given invalid value pl_US.
    at org.apache.spark.ml.param.Param.validate(params.scala:77)
    at org.apache.spark.ml.param.ParamPair.<init>(params.scala:656)
    at org.apache.spark.ml.param.Param.$minus$greater(params.scala:87)
    at org.apache.spark.ml.feature.StopWordsRemover.<init>(StopWordsRemover.scala:109)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)


During handling of the above exception, another exception occurred:

IllegalArgumentException                  Traceback (most recent call last)
<ipython-input-17-3dbcf7d12cb6> in <module>
----> 1 remover = StopWordsRemover(inputCol="words", outputCol="filtered")

/usr/local/Cellar/apache-spark/2.4.0/libexec/python/pyspark/__init__.py in wrapper(self, *args, **kwargs)
    108             raise TypeError("Method %s forces keyword arguments." % func.__name__)
    109         self._input_kwargs = kwargs
--> 110         return func(self, **kwargs)
    111     return wrapper
    112 

/usr/local/Cellar/apache-spark/2.4.0/libexec/python/pyspark/ml/feature.py in __init__(self, inputCol, outputCol, stopWords, caseSensitive, locale)
   2595         super(StopWordsRemover, self).__init__()
   2596         self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover",
-> 2597                                             self.uid)
   2598         self._setDefault(stopWords=StopWordsRemover.loadDefaultStopWords("english"),
   2599                          caseSensitive=False, locale=self._java_obj.getLocale())

/usr/local/Cellar/apache-spark/2.4.0/libexec/python/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
     65             java_obj = getattr(java_obj, name)
     66         java_args = [_py2java(sc, arg) for arg in args]
---> 67         return java_obj(*java_args)
     68 
     69     @staticmethod

/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1523         answer = self._gateway_client.send_command(command)
   1524         return_value = get_return_value(
-> 1525             answer, self._gateway_client, None, self._fqn)
   1526 
   1527         for temp_arg in temp_args:

/usr/local/Cellar/apache-spark/2.4.0/libexec/python/pyspark/sql/utils.py in deco(*a, **kw)
     77                 raise QueryExecutionException(s.split(': ', 1)[1], stackTrace)
     78             if s.startswith('java.lang.IllegalArgumentException: '):
---> 79                 raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
     80             raise
     81     return deco

IllegalArgumentException: 'StopWordsRemover_daf8924a73f7 parameter locale given invalid value pl_US.'

我尝试将locale参数设置为"en_US"或传递一个stopWords列表,例如-pyspark : how to configure StopWordsRemover with french language on spark 1.6.3

我正在运行Spark v2.4.0。

3 个答案:

答案 0 :(得分:1)

在使用StopWordsRemover之前 添加以下代码可以解决我的问题。

gcc -zexecstack shell.c

顺便说一句,我的pyspark是2.4.0版本

答案 1 :(得分:0)

对我来说,将JVM参数设置为正确的位置和语言可以解决问题:

  

-Duser.country =美国-Duser.language = zh-CN

答案 2 :(得分:0)

在spark 2.4.0中,您可以使用setLocale函数,或者从2.0.0版本开始,可以使用StopWordsRemover.loadDefaultStopWords方法。

似乎默认语言环境是系统语言环境,请检查下面的第96行。例如,为了实例化英语的新对象,您可以使用val swr = StopWordsRemover.loadDefaultStopWords("english").toSet

最后要设置新的语言环境,请使用StopWordsRemover的有效实例并调用setLocale:

swr.setLocale("en_US")

请在此处查看文档: https://spark.apache.org/docs/2.4.0/api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover

斯卡拉:https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala

Python:https://github.com/apache/spark/blob/52671d631d2a64ed1cfa0c6e01168908faf92df8/python/pyspark/ml/feature.py

更新: 要在python中创建StopWordsRemover实例,只需执行以下步骤:

from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="text", outputCol="words", locale="en_US")

remover.getStopWords()

输出:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'it
self', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has'
, 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between',
'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'w
hen', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'ca
n', 'will', 'just', 'don', 'should', 'now', "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'm", "you're", "he's", "she's", "it's"
, "we're", "they're", "i've", "we've", "you've", "they've", "isn't", "aren't", "wasn't", "weren't", "haven't", "hasn't", "hadn't", "don't", "doesn't", "didn't", "won't", "wouldn't", "shan't",
"shouldn't", "mustn't", "can't", "couldn't", 'cannot', 'could', "here's", "how's", "let's", 'ought', "that's", "there's", "what's", "when's", "where's", "who's", "why's", 'would']