我正在尝试使用pyspark从AWS S3存储桶读取CSV。由于我的CSV有一列带有嵌入式逗号,因此无法正确写入数据框。
我的Spark版本是2.4.4。感谢任何帮助。
下面是代码。
url= "<url of the file from AWS S3>"
spark.sparkContext.addFile(url)
df = spark.read.option('header', 'true').csv(SparkFiles.get("drugsComTrain_raw.csv"), inferSchema=True, quote='\"', escape='\"',sep=",", timestampFormat="mm/dd/yy")
输出显示如下。
+--------------------+----------+--------------------+--------------------+------+---------+-----------+
| uniqueID| drugName| condition| review|rating| date|usefulCount|
+--------------------+----------+--------------------+--------------------+------+---------+-----------+
| 206461| Valsartan|Left Ventricular ...|"It has no side e...| 9|20-May-12| 27|
| 95260|Guanfacine| ADHD|"My son is halfwa...| null| null| null|
|We have tried man...| 8| 27-Apr-10| 192| null| null| null|
+--------------------+----------+--------------------+--------------------+------+---------+-----------+
下面是带有嵌入式逗号的列。
df.select("review").show(1,truncate=False)
+-------------------------------------------------------------------------------+
|review |
+-------------------------------------------------------------------------------+
|"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"|
+-------------------------------------------------------------------------------+
only showing top 1 row
df.select("uniqueID").show(5)
+--------------------+
| uniqueID|
+--------------------+
| 206461|
| 95260|
|We have tried man...|
| 92703|
|The positive side...|
+--------------------+
only showing top 5 rows
uniqueID应该只显示数字。