我有以下格式的csv文件 -
id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends. á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"
我想在pyspark阅读。我的代码是 -
schema = StructType([
StructField("Id", StringType()),
StructField("Sentence", StringType()),
])
df = sqlContext.read.format("com.databricks.spark.csv") \
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.load("mycsv.csv")
但我得到的结果是 -
+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to á |
|the periods of my life when I think that I did not use this á |null |
|short time. |" |
...
我想在包含Id
和其他Sentence
的第2列中阅读。
并且句子应该加在ASCII字符á
上,因为我看到它正在下一行读取而没有得到分隔符。
我的输出应该是这样的 -
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |
我在例子中只考虑了一个id。 我的代码需要进行哪些修改?
答案 0 :(得分:1)
只需将Spark更新为2.2或更高版本,如果您尚未执行此操作并使用multiline
选项:
df = spark.read
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.csv("mycsv.csv", multiLine=True)
如果您这样做,可以使用á
删除regexp_replace
:
df.withColumn("Sentence", regexp_replace("Sentence", "á", "")