u-sql:过滤掉空//空字符串(微软学术图)

时间:2018-03-09 18:27:39

标签: azure-data-lake u-sql

我是u-sql of azure datalake analytics的新手。 我想做我认为非常简单的操作,但遇到了麻烦。 基本上:我想创建一个忽略空字符串的查询。 在select作品中使用它,但在WHERE语句中不使用它。

在我正在制作的声明之下以及我得到的神秘错误

JOB

@xsel_res_1 = 
EXTRACT 
x_paper_id  long,
x_Rank  uint,
x_doi   string,
x_doc_type  string,
x_paper_title   string,
x_original_title    string,
x_book_title    string,
x_paper_year    int,
x_paper_date    DateTime?,
x_publisher string,
x_journal_id    long?,
x_conference_series_id  long?,
x_conference_instance_id    long?,
x_volume    string,
x_issue string,
x_first_page    string,
x_last_page string,
x_reference_count   long,
x_citation_count    long?,
x_estimated_citation    int?
FROM @"adl://xmag.azuredatalakestore.net/graph/2018-02-02/Papers.txt"
USING Extractors.Tsv()
; 

@xsel_res_2 = 
SELECT 
x_paper_id        AS x_paper_id,
x_doi.ToLower()   AS x_doi,
x_doi.Length     AS x_doi_length
FROM @xsel_res_1
WHERE NOT string.IsNullOrEmpty(x_doi)
;

@xsel_res_3 = 
SELECT 
* 
FROM @xsel_res_2
SAMPLE ANY (5)
;

OUTPUT @xsel_res_3
TO @"/graph/2018-02-02/x_output/x_papers_x6.tsv"
USING Outputters.Tsv();

错误

Vertex failed
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract[0][1]             with error: Vertex user code error.
VertexFailedFast: Vertex failed with a fail-fast error

E_RUNTIME_USER_EXTRACT_ROW_ERROR: Error occurred while extracting row    after processing 10 record(s) in the vertex' input split. Column index: 5, column name: 'x_original_title'.

E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD:     Invalid character following the ending quote character in a quoted field.

Row selected
Component
RUNTIME
Message
Invalid character following the ending quote character in a quoted field.
Resolution

Column should be fully surrounded with double-quotes and double-quotes within the field escaped as two double-quotes.

Description
Invalid character is detected following the ending quote character in a quoted field. A column delimiter, row delimiter or EOF is expected. This error can occur if double-quotes within the field are not correctly escaped as two double-quotes.
Details

Row Delimiter: 0x0
Column Delimiter: 0x9
HEX: 61 76 6E 69 20 74 65 72 6D 69 6E 20 75 20 70 6F 76 61 6C 6A 73 6B 6F 6A 20 6C 69 73 74 69 6E 69 20 69 20 6E 61 74 70 69 73 75 20 67 20 31 31 38 35 09 22 50 6F 20 6B 6F 6E 63 75 22 ### 20 28 73 74 61 72 69 20 68 72

更新 顺便说一下,这些操作适用于其他数据集,所以问题不在于我能说的语法

 //Define schema of file, must map all columns
 @searchlog = 
 EXTRACT UserId          int, 
        Start           DateTime, 
        Region          string, 
        Query           string, 
        Duration        int, 
        Urls            string, 
        ClickedUrls     string
FROM @"/Samples/Data/SearchLog.tsv"
USING Extractors.Tsv();


 @searchlog_1 =
 SELECT * FROM  @searchlog
 WHERE NOT string.IsNullOrEmpty(ClickedUrls );


 OUTPUT @searchlog_1
   TO @"/Samples/Output/SearchLog_output_x1.tsv"
    USING Outputters.Tsv();

2 个答案:

答案 0 :(得分:3)

对于这种情况,这是一个不幸的错误显示。

假设文本是utf-8,您可以使用像www.hexutf8.com这样的网站将十六进制转换为:

avni termin u povaljskoj listini natpisu g 1185 "Po koncu" (Stari hr

看起来输入行包含至少一个未正确转义的"字符。它应该是这样的:

avni termin u povaljskoj listini natpisu g 1185 ""Po koncu"" (Stari hr

答案 1 :(得分:3)

@ Saveenr的回答假定您的文件中的值都是引用的。或者,如果它们未被引用(并且不包含列分隔符作为值),那么设置Extractors.Tsv(quoting:false)也可能有所帮助。