Question

compareTo

，但保存的输出没有该列值：例如：

<script type="text/javascript">
function printOrder(orderId = null) {
    if(orderId) {       

        $.ajax({
            url: 'printorder.php',
            type: 'post',
            data: {id: orderId},
            dataType: 'text',
            success:function(response) {
            //alert(response);
        var mywindow = window.open('', 'Stock Management System', 'height=400,width=600');

        $(mywindow.document.head).html('<html><head><title>Order Invoice</title>');
        $(mywindow.document.head).html('<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.0/css/bootstrap.min.css" />');
        $(mywindow.document.head).html('</head>');
        $(mywindow.document.body).html( '<body>' + response + '</body>');
        mywindow.document.close();
        mywindow.focus();
        mywindow.print();
        mywindow.close();
</script>

预期的覆盖文件输出为：

var df = sparkSession.read
     .option("delimiter", delimiter)
     .option("header", true) // Use first line of all files as header
     //      .schema(customSchema)
     .option("inferSchema", "true") // Automatically infer data types
     .format("csv")
     .load(filePath)
    df.show()
    df.write.partitionBy("outlook").csv("output/weather.csv")

Answer 1

在对数据进行分区以进行写入时，spark会创建遵循HDFS分区标准的子文件夹。在这里，您将为数据集中找到的每个“ outlook”值获得一个子文件夹。 “ outlook = overcast”子目录中的所有文件将仅与前景被覆盖的记录有关。因此，无需在数据中存储Outlook列，其值在同一子目录中的所有文件中都相同。

例如，当通过Hive或Spark读回数据时，您必须指定Outlook子目录确实是分区，因此逻辑列可用于投影，分组，过滤或任何您想做的事情。

在spark中，这可以通过指定basePath选项来表达：

val df = spark.read.option("basePath", "output/weather.csv").csv("output/weather.csv/*")

如果您确实需要在每个文件中存储Outlook列，则可能不需要分区。

如何将数据帧拆分为不同的df，并需要保存在不同的文件中？

1 个答案: