读取csv并在ASCII字符pyspark上连接行

时间:2018-01-29 09:32:54

标签: apache-spark pyspark pyspark-sql

我有以下格式的csv文件 -

id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends.  á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"

我想在pyspark阅读。我的代码是 -

schema = StructType([
    StructField("Id", StringType()),
    StructField("Sentence", StringType()),
  ])

df = sqlContext.read.format("com.databricks.spark.csv") \
        .option("header", "false") \
        .option("inferSchema", "false") \
        .option("delimiter", "\"") \
        .schema(schema) \
        .load("mycsv.csv")

但我得到的结果是 -

+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id                                                           | Sentence                                                           |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1,                                                          |When I think about the short time that we live and relate it to á  |
|the periods of my life when I think that I did not use this á |null                                                               |
|short time.                                                   |"                                                                  |

...

我想在包含Id和其他Sentence的第2列中阅读。 并且句子应该加在ASCII字符á上,因为我看到它正在下一行读取而没有得到分隔符。

我的输出应该是这样的 -

    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    | Id                                                           | Sentence                                                                 |
    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    |id1,                                                          |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |

我在例子中只考虑了一个id。 我的代码需要进行哪些修改?

1 个答案:

答案 0 :(得分:1)

只需将Spark更新为2.2或更高版本,如果您尚未执行此操作并使用multiline选项:

df = spark.read
    .option("header", "false") \
    .option("inferSchema", "false") \
    .option("delimiter", "\"") \
    .schema(schema) \
    .csv("mycsv.csv", multiLine=True)

如果您这样做,可以使用á删除regexp_replace

df.withColumn("Sentence", regexp_replace("Sentence", "á", "")