Question

我的数据文件与Graph Edges有关。每行的格式为（src节点和dest节点）。这是我的架构定义。 eschema = StructType([StructField("src", StringType(), True), StructField("dst", StringType(), True)]) 我试图读取该行，并用定界符（'，'）对其进行拆分，然后将每个元素转换为一个int。但这以某种方式失败了。

 lines = sc.textFile(filename)
 lines = lines.map(lambda l : map(int, l.split(delim)))
 lines = lines.map(lambda l : Row(l[0], l[1]))

在运行此程序时，出现错误 StructType can not accept object 0 in type <type 'int'> 我正在使用Python 2.7，Spark> 2.0。分割行后，对象的类型为Unicode，而不是字符串，这会有所不同。如何解决这个问题。任何建议都会有很大帮助。谢谢

Answer 1

如果定界符为'，则它只是普通的csv文件。由于您使用的是Spark> 2.0，因此您可以使用现代数据框架api；代替使用spark上下文（按照惯例sc），可以使用spark会话：

df = spark.read.format("csv")\
    .option("header", "true")\ # if you have a header inside the file, otherwise don't put this line
    .option("schema", eschema)\ 
    .load(filename)

除了通过.option("schema", )提供架构之外，您还可以使用.option("inferSchema", "true")来尝试通过查看数据来猜测文件结构。

PySpark错误：StructType不能接受类型<type'int'>中的对象0

1 个答案: