Question

我有一个USQL脚本和CSV提取器来加载我的文件。但是有些月份文件可能包含4列，有些月份可能包含5列。

如果我使用4或5个字段的列列表设置我的提取器，则会收到有关文件预期宽度的错误。去检查分隔符等等。毫不奇怪。

对于这个问题的解决方法是什么，请给出USQL仍然是新手并且缺少一些基本的错误处理？

我尝试使用提取器中的 silent 子句来忽略更宽的列，这对4列很方便。然后获取具有IF条件的行集的行计数，然后该条件具有5列的提取器。然而，这导致行集变量的世界不被用作IF表达式中的标量变量。

我还尝试了C＃样式计数和sizeof（@ AttemptExtractWith4Cols）。都没有工作。

代码段，让您了解我采取的方法：

DECLARE @SomeFilePath string = @"/MonthlyFile.csv";

@AttemptExtractWith4Cols =
    EXTRACT Col1 string,
            Col2 string,
            Col3 string,
            Col4 string
    FROM @SomeFilePath
    USING Extractors.Csv(silent : true); //can't be good.

//can't assign rowset to scalar variable!
DECLARE @RowSetCount int = (SELECT COUNT(*) FROM @AttemptExtractWith4Cols);

//tells me @AttemptExtractWith4Cols doesn't exist in the current context!
DECLARE @RowSetCount int = @AttemptExtractWith4Cols.Count();

IF (@RowSetCount == 0) THEN
    @AttemptExtractWith5Cols =
        EXTRACT Col1 string,
                Col2 string,
                Col3 string,
                Col4 string,
                Col5 string
        FROM @SomeFilePath
        USING Extractors.Csv(); //not silent
END;


//etc

当然，如果在USQL中存在 TRY CATCH 块这样的事情，那将会容易得多。

这甚至是一种合理的方法吗？

非常感谢任何输入。

感谢您的时间。

Answer 1

U-SQL现在支持OUTER UNION，因此您可以像这样处理它：

// Scenario 1; file has 4 columns
DECLARE @file1 string = @"/input/file1.csv";

// Scenario 2; file has 5 columns
//DECLARE @file1 string = @"/input/file2.csv";


@file =
    EXTRACT col1 string,
            col2 string,
            col3 string,
            col4 string
    FROM @file1
    USING Extractors.Csv(silent : true)

    OUTER UNION ALL BY NAME ON (col1, col2, col3, col4)

    EXTRACT col1 string,
            col2 string,
            col3 string,
            col4 string,
            col5 string
    FROM @file1
    USING Extractors.Csv(silent : true);


@output =
    SELECT *
    FROM @file;


OUTPUT @output
    TO "/output/output.csv"
USING Outputters.Csv();

在我的示例中，file1有4列，file2有5列。该脚本在任一情况下都可以成功运行。

我的结果：

希望这是有道理的。

Answer 2

OUTER UNION是一个很好的解决方案。或者，如果您希望文件中的行不同，也可以编写自己的通用提取器。有关示例，请参阅this blog post。

Answer 3

这是另一个我觉得有用的解决方案。您可以将文件作为单个文本列读取（使用“\ t”作为分隔符，因为没有），然后使用C＃字符串函数动态拆分。我在类似的问题上测试了这个。这种方法的优点是您可以对任意数量的列使用相同的方法。

SELECT
      (String)(ColList[0])   AS ColA
    , (String)(ColList[1])   AS ColB
    , (String)(ColList[2])   AS ColC
    , (String)(ColList[3])   AS ColD
    , (int?)(NumColumns >= 5 ? (String)(ColList[4]) : (String)null)
                             AS ColE
FROM (
    SELECT ColList
         , ColList.Count AS NumColumns
    FROM (
        SELECT SqlArray.Create(RowText.Split(','))   AS ColList
        FROM (
            EXTRACT RowText string
            FROM @SomeFilePath
            USING Extractors.Text(delimiter: '\t', quoting: false)
        ) AS [T1]
    ) AS [T2]
) AS [T3]

警告：此解决方案不知道文字引用。字段值中的任何逗号都会破坏这种逻辑。

在USQL中处理具有不同列的文件

3 个答案: