使用PySpark使用空格分隔的数据创建数据框

时间:2018-10-17 09:17:33

标签: python pyspark

我如何为空格分隔的列创建数据框。 数据类型

yyyy  mm   tmax    tmin      af    rain     sun
1853   1    ---     ---     ---    57.3     ---
1853   2    ---     ---     ---    32.3     ---
1853   3    ---     ---     ---    65.5     ---
1853   4    ---     ---     ---    46.2     ---
1853   5    ---     ---     ---    13.2     ---
1853   6    ---     ---     ---    53.3     ---
1853   7    ---     ---     ---    78.0     ---
1853   8    ---     ---     ---    56.6     ---
1853   9    ---     ---     ---    24.5     ---
1853  10    ---     ---     ---    94.8     ---
1853  11    ---     ---     ---    75.5     ---

3 个答案:

答案 0 :(得分:3)

由于您已将pyspark标记为标签(而不是pandas),因此可以尝试执行以下操作:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Space Import Test').getOrCreate()
df = spark.read.csv('/path/to/your/file',inferSchema=True,header=True,sep=' ',ignoreLeadingWhiteSpace=True)
df.show(10)

答案 1 :(得分:0)

您可以使用pandas并将delim_whitespace参数添加到True

  

delim_whitespace:布尔值,默认为False

     

指定是否将空格(例如''或'\ t')用作分隔符。等效于设置sep ='\ s +'。如果这个选项是   设置为True,则分隔符参数不应该传入任何内容。   来源:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

您的情况:

import pandas

pandas.read_csv("data.txt", delim_whitespace=True)

答案 2 :(得分:0)

import pandas as pd   
data = pd.read_csv('text.txt', sep=" ") ## Sep is space as it your .txt file it is separated by space
data = data.dropna(axis=1, how='all') ## Since you have space before 1st column, we have to drop NA's created by space