我从DataFrame
发出RDD
时遇到错误。
from pyspark.ml.fpm import FPGrowth
sogou = sc.textFile("SogouQ.sample.utf8", use_unicode = False)
def parse(line):
value = [ x for x in line.split(",") if x]
return list(set(value))
rdd = sogou.map(parse)
df = sogou.toDF('items')
我收到以下错误:
pyspark.sql.utils.ParseException:u" \ nmismatched input''期待{' SELECT',' FROM',' ADD',' AS',' ALL',&#39 ; DISTINCT',' WHERE',' GROUP',' BY',' GROUPING',' SETS' ,' CUBE',' ROLLUP',' ORDER',' HAVING',' LIMIT',' AT',' OR',' AND',' IN',NOT,' NO',' EXISTS&#39 ,' BETWEEN',' LIKE',RLIKE,'是',' NULL',' TRUE',& #39; FALSE',' NULLS'' ASC',' DESC'' FOR',' INTERVAL&# 39;,' CASE',' WHEN',' THEN',' ELSE',' END',&# 39;加入',' CROSS',' OUTER',' INNER',' LEFT'' SEMI&#39 ;,' RIGHT'' FULL',' NATURAL'' ON',' LATERAL',' ; WINDOW',' OVER',' PARTITION'' RANGE',' ROWS'' UNBOUNDED' ,' PRECEDING','关注','当前',' F IRST',' AFTER',' LAST'' ROW',' WITH',' VALUES', '创建','表'目录'' VIEW',' REPLACE',' INSERT& #39;,'删除',' INTO',' DESCRIBE',' EXPLAIN',' FORMAT',& #39; LOGICAL',' CODEGEN',' COST'' CAST',' SHOW',' TABLES&# 39;,' COLUMNS',' COLUMN',' USE',' PARTITIONS',' FUNCTIONS',&# 39; DROP',' UNION',' EXCEPT',' MINUS',' INTERSECT',' TO&#39 ;,' TABLESAMPLE',' STRATIFY',' ALTER',' RENAME',' ARRAY',' ; MAP',' STRUCT',' COMMENT'' SET',' RESET',' DATA' ,' START'' TRANSACTION',' COMMIT',' ROLLBACK',' MACRO',' IGNORE',' BOTH',' LEADING',' TRAILING',' IF',' POSITION', ' DIV ',' PERCENT'' BUCKET'' OUT',' OF',' SORT', ' CLUSTER',' DISTRIBUTE',' OVERWRITE',' TRANSFORM',' REDUCE',' SERDE& #39;,' SERDEPROPERTIES' RECORDREADER',' RECORDWRITER',' DELIMITED',' FIELDS',& #39; TERMINATED',' COLLECTION',' ITEMS'' KEYS',' ESCAPED',' LINES&# 39;,'分离','功能','扩展','刷新','清除',&# 39; CACHE',' UNCACHE',' LAZY',' FORMATTED',' GLOBAL',TEMPORARY,' OPTIONS& #39;,' UNSET',' TBLPROPERTIES',' DBPROPERTIES',' BUCKETS'' SKEWED',& #39;已存储','目录','地理位置','交换','存档'' UNARCHIVE&# 39;,' FILEFORMAT',' TOUCH',' COMPACT',' CONCATENATE',' CHANGE',&# 39; CASCADE',' R. ESTRICT',' CLUSTERED',' SORTED',' PURGE',' INPUTFORMAT'' OUTPUTFORMAT', DATABASE,DATABASES,' DFS'' TRUNCATE',' ANALYZE',' COMPUTE',' LIST',& #39;ç计','分区','外部','定义',' REVOKE',' GRANT&# 39;,' LOCK',' UNLOCK',' MSCK',' REPAIR',' RECOVER',&# 39;出口','进口',' LOAD',' ROLE',' ROLES',' COMPACTIONS&#39 ;,' PRINCIPALS',' TRANSACTIONS'' INDEX',' INDEXES'' LOCKS',' ;选项',' ANTI',' LOCAL',' INPATH',IDENTIFIER,BACKQUOTED_IDENTIFIER}(第1行,第5行)\ n \ n == SQL == \ nitems \ n ----- ^^^ \ n"
该文字包含Chinese
。有关系吗?
文字是这样的:
360,安全卫士,
123,123,范冰冰,
当我使用pyspark.mllib.fpgrowth
时,rdd
工作正常。如何将其转换为数据帧?
答案 0 :(得分:0)
这里有两个不同的问题:
toDF
致电。 RDD.toDF
有以下签名:
Signature: rdd.toDF(schema=None, sampleRatio=None)
schema
应该
param schema:
pyspark.sql.types.StructType
或列名列表
所以在你的情况下它应该是:
sogou.toDF(["items"])
parse
方法:
createDataFrame
调用的 df
方法需要RDD[tuple]
或等效项,可以映射到structs
,除非提供了模式。如果您只想使用名称,则应返回tuple
def parse(line):
value = [ x for x in line.split(",") if x]
return list(set(value)),
组合:
>>> def parse(line):
... value = [ x for x in line.split(",") if x]
... return list(set(value)),
...
...
>>> rdd = sc.parallelize(["360,安全卫士,", "123,123,范冰冰,"])
>>> rdd.map(parse).toDF(["items"]).show()
+--------------+
| items|
+--------------+
| [安全卫士, 360]|
|[123,123,范冰冰,]|
+--------------+
替代(保持当前的解析实现)将是
>>> from pyspark.sql.types import ArrayType, StringType
>>> def parse(line):
... value = [ x for x in line.split(",") if x]
... return list(set(value))
>>> rdd.map(parse).toDF(ArrayType(StringType())).toDF("items").show()
+--------------+
| items|
+--------------+
| [安全卫士, 360]|
|[123,123,范冰冰,]|
+--------------+