我是python和pyspark的新手。我想知道 我如何在pyspark中编写以下spark dataframe函数:
val df = spark.read.format("jdbc").options(
Map(
"url" -> "jdbc:someDB",
"user" -> "root",
"password" -> "password",
"dbtable" -> "tableName",
"driver" -> "someDriver")).load()
我试图在pyspark中编写如下。但是,出现语法错误:
df = spark.read.format("jdbc").options(
map(lambda : ("url","jdbc:someDB"), ("user","root"), ("password","password"), ("dbtable","tableName"), ("driver","someDriver"))).load()
预先感谢
答案 0 :(得分:0)
尝试改用option()
:
df = spark.read \
.format("jdbc") \
.option("url","jdbc:someDB") \
.option("user","root") \
.option("password","password") \
.option("dbtable","tableName") \
.option("driver","someDriver") \
.load()
答案 1 :(得分:0)
要加载具有多个参数的CSV文件,请将参数传递到load()
:
df = spark.read.load("examples/src/main/resources/people.csv",
format="csv", sep=":", inferSchema="true", header="true")
这里是documentation。
答案 2 :(得分:0)
在PySpark中,将选项作为关键字参数传递:
df = spark.read\
.format("jdbc")\
.options(
url="jdbc:someDB",
user="root",
password="password",
dbtable="tableName",
driver="someDriver",
)\
.load()
有时候将它们放在dict
中并在以后使用splat运算符解压缩它们很方便:
options = {
"url": "jdbc:someDB",
"user": "root",
"password": "password",
"dbtable": "tableName",
"driver": "someDriver",
}
df = spark.read\
.format("jdbc")\
.options(**options)\
.load()
关于您的问题的代码段:您碰巧混淆了“地图”的两个不同概念:
Map
是一种数据结构,也称为“关联数组”或“字典”,等效于Python的dict
map
是一个高阶函数,您可以将其应用于可迭代的函数,例如:In [1]: def square(x: int) -> int:
...: return x**2
...:
In [2]: list(map(square, [1, 2, 3, 4, 5]))
Out[2]: [1, 4, 9, 16, 25]
In [3]: # or just use a lambda
In [4]: list(map(lambda x: x**2, [1, 2, 3, 4, 5]))
Out[4]: [1, 4, 9, 16, 25]