在Pyspark的FPGrowth

时间:2018-06-06 13:33:20

标签: python apache-spark pyspark apache-spark-sql

我从DataFrame发出RDD时遇到错误。

from pyspark.ml.fpm import FPGrowth

sogou = sc.textFile("SogouQ.sample.utf8", use_unicode = False)

def parse(line):
    value = [ x for x in line.split(",") if x]
    return list(set(value))

rdd = sogou.map(parse)
df = sogou.toDF('items')

我收到以下错误:

  

pyspark.sql.utils.ParseException:u" \ nmismatched input''期待{' SELECT',' FROM',' ADD',' AS',' ALL',&#39 ; DISTINCT',' WHERE',' GROUP',' BY',' GROUPING',' SETS' ,' CUBE',' ROLLUP',' ORDER',' HAVING',' LIMIT',' AT',' OR',' AND',' IN',NOT,' NO',' EXISTS&#39 ,' BETWEEN',' LIKE',RLIKE,'是',' NULL',' TRUE',& #39; FALSE',' NULLS'' ASC',' DESC'' FOR',' INTERVAL&# 39;,' CASE',' WHEN',' THEN',' ELSE',' END',&# 39;加入',' CROSS',' OUTER',' INNER',' LEFT'' SEMI&#39 ;,' RIGHT'' FULL',' NATURAL'' ON',' LATERAL',' ; WINDOW',' OVER',' PARTITION'' RANGE',' ROWS'' UNBOUNDED' ,' PRECEDING','关注','当前',' F IRST',' AFTER',' LAST'' ROW',' WITH',' VALUES', '创建','表'目录'' VIEW',' REPLACE',' INSERT& #39;,'删除',' INTO',' DESCRIBE',' EXPLAIN',' FORMAT',& #39; LOGICAL',' CODEGEN',' COST'' CAST',' SHOW',' TABLES&# 39;,' COLUMNS',' COLUMN',' USE',' PARTITIONS',' FUNCTIONS',&# 39; DROP',' UNION',' EXCEPT',' MINUS',' INTERSECT',' TO&#39 ;,' TABLESAMPLE',' STRATIFY',' ALTER',' RENAME',' ARRAY',' ; MAP',' STRUCT',' COMMENT'' SET',' RESET',' DATA' ,' START'' TRANSACTION',' COMMIT',' ROLLBACK',' MACRO',' IGNORE',' BOTH',' LEADING',' TRAILING',' IF',' POSITION', ' DIV ',' PERCENT'' BUCKET'' OUT',' OF',' SORT', ' CLUSTER',' DISTRIBUTE',' OVERWRITE',' TRANSFORM',' REDUCE',' SERDE& #39;,' SERDEPROPERTIES' RECORDREADER',' RECORDWRITER',' DELIMITED',' FIELDS',& #39; TERMINATED',' COLLECTION',' ITEMS'' KEYS',' ESCAPED',' LINES&# 39;,'分离','功能','扩展','刷新','清除',&# 39; CACHE',' UNCACHE',' LAZY',' FORMATTED',' GLOBAL',TEMPORARY,' OPTIONS& #39;,' UNSET',' TBLPROPERTIES',' DBPROPERTIES',' BUCKETS'' SKEWED',& #39;已存储','目录','地理位置','交换','存档'' UNARCHIVE&# 39;,' FILEFORMAT',' TOUCH',' COMPACT',' CONCATENATE',' CHANGE',&# 39; CASCADE',' R. ESTRICT',' CLUSTERED',' SORTED',' PURGE',' INPUTFORMAT'' OUTPUTFORMAT', DATABASE,DATABASES,' DFS'' TRUNCATE',' ANALYZE',' COMPUTE',' LIST',& #39;ç计','分区','外部','定义',' REVOKE',' GRANT&# 39;,' LOCK',' UNLOCK',' MSCK',' REPAIR',' RECOVER',&# 39;出口','进口',' LOAD',' ROLE',' ROLES',' COMPACTIONS&#39 ;,' PRINCIPALS',' TRANSACTIONS'' INDEX',' INDEXES'' LOCKS',' ;选项',' ANTI',' LOCAL',' INPATH',IDENTIFIER,BACKQUOTED_IDENTIFIER}(第1行,第5行)\ n \ n == SQL == \ nitems \ n ----- ^^^ \ n"

该文字包含Chinese。有关系吗? 文字是这样的:

360,安全卫士,
123,123,范冰冰,

当我使用pyspark.mllib.fpgrowth时,rdd工作正常。如何将其转换为数据帧?

1 个答案:

答案 0 :(得分:0)

这里有两个不同的问题:

  • toDF致电。 RDD.toDF有以下签名:

    Signature: rdd.toDF(schema=None, sampleRatio=None)
    

    schema应该

      

    param schema:pyspark.sql.types.StructType或列名列表

    所以在你的情况下它应该是:

    sogou.toDF(["items"])
    
  • parse方法:

    createDataFrame调用的

    df方法需要RDD[tuple]或等效项,可以映射到structs,除非提供了模式。如果您只想使用名称,则应返回tuple

    def parse(line):
        value = [ x for x in line.split(",") if x]
        return list(set(value)),  
    

组合:

>>> def parse(line):
...     value = [ x for x in line.split(",") if x]
...     return list(set(value)),  
... 
... 
>>> rdd = sc.parallelize(["360,安全卫士,", "123,123,范冰冰,"])
>>> rdd.map(parse).toDF(["items"]).show()
+--------------+
|         items|
+--------------+
|   [安全卫士, 360]|
|[123,123,范冰冰,]|
+--------------+

替代(保持当前的解析实现)将是

>>> from pyspark.sql.types import ArrayType, StringType
>>> def parse(line):
...     value = [ x for x in line.split(",") if x]
...     return list(set(value))
    >>> rdd.map(parse).toDF(ArrayType(StringType())).toDF("items").show()
+--------------+     
|         items|
+--------------+
|   [安全卫士, 360]|
|[123,123,范冰冰,]|
+--------------+