PySpark使用值

时间:2017-07-21 15:35:34

标签: apache-spark dataframe pyspark

我正在尝试使用另一个值--Spark 2.1.0,PySpark API和使用DataFrame替换具有低于设置阈值的出现的所有值。

当我在一个例子(下面的df1)上测试函数时,它可以工作。但不是我的真实数据。两者都有dtype属性 - 字符串。使用df1,我在for循环中运行列'cat'和'integers',这也是我想要的真实数据DF,并且它再次完美地工作。

df1 = spark.createDataFrame([
    (0, "a","1"),
    (1, "b","1"),
    (2, "c","1"),
    (3, "a","1"),
    (4, "a","2"),
    (5, "c","2"),
    (6,"b","1"),
    (7,"b","1"),
], ["id", "cat","integer"])

def cutoff(df,feat,threshold, otherclass='other'):
    if isinstance(threshold, float) and threshold<1:
        threshold = str(int(threshold*df.count()))

    if isinstance(threshold,int):
        threshold = str(threshold)

    temp = df.groupBy(feat).count().orderBy('count')
    replace = temp.filter("count<"+threshold).select(feat).rdd.map(lambda r:r[0]).collect()
    print "replacing ", replace,replace.__class__, " with ", otherclass, " subset ", feat
    df = df.replace(replace,otherclass,feat)

    return df

但是当使用真实数据(从Hive导入SQL)时,我获得了

mydata (just a part):
+---------------+-----+
|   browser_name|count|
+---------------+-----+
|         Chrome| 2197|
|             IE|  719|
|        Firefox|  542|
|  Mobile Safari|  370|
|Android Browser|  361|
|           Edge|  265|
| Chrome WebView|  203|

replacing  [u'Iron', u'UCBrowser', u'Puffin', u'Opera Mini', u'Yandex', u'Maxthon', u'Silk', u'Vivaldi', None, u'MIUI Browser', u'Chromium', u'WebKit', u'IEMobile', u'Facebook', u'Chrome WebView', u'Safari', u'Opera', u'Android Browser', u'Mobile Safari', u'Edge', u'IE', u'Firefox'] <type 'list'>  with  other  subset  browser_name


Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-1704248642413819893.py", line 267, in <module>
    raise Exception(traceback.format_exc())
Exception: Traceback (most recent call last):
  File "/tmp/zeppelin_pyspark-1704248642413819893.py", line 265, in <module>
    exec(code)
  File "<stdin>", line 12, in <module>
  File "<stdin>", line 10, in cutoff
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 1345, in replace
    self._jdf.na().replace(self._jseq(subset), self._jmap(rep_dict)), self.sql_ctx)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o4471.replace.
: scala.MatchError: null
    at org.apache.spark.sql.DataFrameNaFunctions.replace0(DataFrameNaFunctions.scala:351)
    at org.apache.spark.sql.DataFrameNaFunctions.replace(DataFrameNaFunctions.scala:336)
    at sun.reflect.GeneratedMethodAccessor359.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)

所以它抱怨replace功能。但是当我只跑的时候 df.replace(['Chrome','Firefox'],'hovno','browser_name').show(10)它再次像魅力一样(我甚至尝试在unicode中输入列表,因为函数生成它并且没问题)。所以我想知道我在DF函数中做了什么,它不起作用?我理解MatchError无法找到要替换的源值,但它们肯定存在。

万分感谢!

1 个答案:

答案 0 :(得分:0)

问题是我的数据框包含replace无法匹配的无值。作为一种解决方法,我首先使用fillna将None值替换为None,然后它就像魅力一样。