Question

我有一个源数据集，它由文本文件组成，其中列由一个或多个空格分隔，具体取决于列值的宽度。数据是正确调整的，即在实际数据之前添加空格。

我可以使用其中一个内置提取器，还是必须实现自定义提取器？

Answer 1

如果您的行符合字符串（128kB），则@ wBob的解决方案有效。否则，编写通过提取修复的自定义提取器。根据您对格式的信息，可以使用input.Split()分割成行，然后根据空白规则拆分行，如下所示（提取器模式的完整示例为here或者你可以写一个类似this blog post中描述的那个。

    public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow outputrow) 
     { 
         foreach (Stream current in input.Split(this._row_delim)) 
         { 
             using (StreamReader streamReader = new StreamReader(current, this._encoding)) 
             { 
                 int num = 0; 
                 string[] array = streamReader.ReadToEnd().Split(new string[]{this._col_delim}, StringSplitOptions.None).Where(x => !String.IsNullOrWhiteSpace(x))); 
                 for (int i = 0; i < array.Length; i++) 
                 { 
                     // Now write your code to convert array[i] into the extract schema
                 } 
             } 
             yield return outputrow.AsReadOnly(); 
         } 
     } 
 }

Answer 2

您可以创建一个或多个自定义提取器，将数据作为一行导入，然后拆分并清理，并使用您在U-SQL中可用的c＃方法，如Split和IsNullOrWhiteSpace，类似于这样：

My right-aligned sample data

// Import the row as one column to be split later; NB use a delimiter that will NOT be in the import file
@input =
    EXTRACT rawString string
    FROM "/input/input.txt"
    USING Extractors.Text(delimiter : '|');


// Add a row number to the line and remove white space elements
@working =
    SELECT ROW_NUMBER() OVER() AS rn, new SqlArray<string>(rawString.Split(' ').Where(x => !String.IsNullOrWhiteSpace(x))) AS columns
    FROM @input;


// Prepare the output, referencing the column's position in the array
@output =
    SELECT rn,
           columns[0] AS id,
           columns[1] AS firstName,
           columns[2] AS lastName
    FROM @working;


OUTPUT @output
TO "/output/output.txt"
USING Outputters.Tsv(quoting : false);

我的结果：

HTH

如何处理具有多个空格作为分隔符的文本文件

2 个答案: