Question

我在pyspark中拥有所有支持库，并且能够为父级创建数据框-

class MyClass {

    public static String myToString(String a, String b) {
        return a + ", " + b;
    }

    //notice the boxing in here
    public static int mySum(int a, int b) {
        return a + b;
    }

    //not kind of an revolutionary function, just for demonstration
    public static<T> T Invoke(BinaryOperator<T> bo, T o1, T o2) {
        return bo.apply(o1, o2);
    }


    public static void main(String[] args) {

        int sum = Invoke(MyClass::mySum, 10, 20);
        String str = Invoke(MyClass::myToString, "a", "b");

        System.out.println(sum);
        System.out.println(str);

}

我无法创建子数据框-

def xmlReader(root, row, filename):

  df = spark.read.format("com.databricks.spark.xml").options(rowTag=row,rootTag=root).load(filename)
  xref = df.select("genericEntity.entityId", "genericEntity.entityName","genericEntity.entityType","genericEntity.inceptionDate","genericEntity.updateTimestamp","genericEntity.entityLongName")
  return xref 

df1 = xmlReader("BOBML","entityTransaction","s3://dev.xml")

df1.head()

我没有得到任何输出，我打算在父级和子级数据框之间进行合并。任何帮助将不胜感激！

Answer 1

24小时后，我能够解决问题，并感谢所有至少关注我的问题的人。

解决方案：

第1步：上传几个库

从pyspark.sql导入SparkSession

从pyspark.sql导入SQLContext

sqlContext = SQLContext（sc）

Step2（父母）：读取xml文件，打印架构，注册临时表并创建数据框。

第3步（儿童）：重复第2步。

第4步：通过联接子级和父级数据框来创建最终数据框。

第5步：将数据加载到S3（write.csv / S3：// Path）或数据库中。

如何使用Pyspark从xml文件创建子数据框？

1 个答案: