我正在尝试根据给定的数据手动创建pyspark数据框:
row_in=[(1566429545575348),(40.353977),(-111.701859)]
rdd=sc.parallelize(row_in)
schema = StructType([StructField("time_epocs", DecimalType(), True),StructField("lat", DecimalType(),True),StructField("long", DecimalType(),True)])
df_in_test=spark.createDataFrame(rdd,schema)
这在我尝试显示数据框时出现错误,因此我不确定如何执行此操作。 但是,Spark文档似乎对我来说有点麻烦(这里:https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=createdataframe#pyspark.sql.SparkSession.createDataFrame),并且在尝试遵循这些说明时,我也遇到了类似的错误。
有人知道该怎么做吗?
答案 0 :(得分:2)
此答案演示了如何使用 createDataFrame
、create_df
和 toDF
创建 PySpark DataFrame。
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["first_name", "age"])
df.show()
+----------+---+
|first_name|age|
+----------+---+
| joe| 34|
| luisa| 22|
+----------+---+
您还可以传递 createDataFrame
一个 RDD 和架构以更精确地构造数据帧:
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
schema = schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
我的 Quinn 项目中的 create_df
实现了两全其美的效果 - 它简洁而全面:
from pyspark.sql.types import *
from quinn.extensions import *
df = spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
df.show()
+----+----+
|name|blah|
+----+----+
|jose| a|
| li| b|
| sam| c|
+----+----+
toDF
与其他方法相比没有任何优势:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
df = rdd.toDF()
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
答案 1 :(得分:1)
尝试:
spark.createDataFrame(
[
(1, 'foo'), # create your data here, be consistent in the types.
(2, 'bar'),
],
['id', 'txt'] # add your columns label here
)
答案 2 :(得分:1)
扩展@Steven的答案:
data = [(i, 'foo') for i in range(1000)] # random data
columns = ['id', 'txt'] # add your columns label here
df = spark.createDataFrame(data, columns)
注意:当schema
是列名列表时,将从数据中推断出每一列的类型。
如果要专门定义架构,请执行以下操作:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df1 = spark.createDataFrame(data, schema)
输出:
>>> df1
DataFrame[id: int, txt: string]
>>> df
DataFrame[id: bigint, txt: string]
答案 3 :(得分:0)
详细阐述/构建@Steven的答案:
field = [StructField("MULTIPLIER",FloatType(), True),StructField("DESCRIPTION", StringType(), True)]
schema = StructType(field)
multiplier_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
将创建一个空白数据框。
我们现在可以简单地在其中添加一行:
l = [(2.3,'this is a sample description']
rdd = sc.parallelize(l)
multiplier_df_temp = spark.createDataFrame(rdd, schema)
multiplier_df=wtp_multiplier_df.union(wtp_multiplier_df_temp)
答案 4 :(得分:-1)
对于初学者来说,是一个从文件导入数据的完整示例:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
ShortType,
StringType,
StructType,
StructField,
TimestampType,
)
import os
here = os.path.abspath(os.path.dirname(__file__))
spark = SparkSession.builder.getOrCreate()
schema = StructType(
[
StructField("id", ShortType(), nullable=False),
StructField("string", StringType(), nullable=False),
StructField("datetime", TimestampType(), nullable=False),
]
)
# read file or construct rows manually
df = spark.read.csv(os.path.join(here, "data.csv"), schema=schema, header=True)
答案 5 :(得分:-1)
带格式
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(1, "foo"),
(2, "bar"),
],
StructType(
[
StructField("id", IntegerType(), False),
StructField("txt", StringType(), False),
]
),
)
print(df.dtypes)
df.show()