如何在U-SQL中使用XML Extractor从我的Azure数据湖分析作业中提取XML元素的属性值。
更新:有关该问题的更多详细信息
我的XML文件如下所示:
<?xml version="1.0" encoding="utf-8"?>
<testelement testatr="xyz">
</testelement>
这是我的U-SQL脚本:
DECLARE @testfile string = "sample2.xml";
@logText =
EXTRACT log string
FROM @testfile
USING Extractors.Tsv();
@gethID = SELECT Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(@logText.log, "testelement/attribute::testatr").ElementAt(0) AS siteName FROM @logText;
OUTPUT @gethID TO "result.out" USING Outputters.Tsv();
调试后我观察到,当XPath类的Load方法尝试加载时发生异常:
"<?xml version=1.0 encoding=utf-8?>"
这是一个例外:
Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: '1.0' is an unexpected token. The expected token is '\"' or '''. Line 1, position 15.\nCurrent row dump: \tlog:\t\"<?xml version=1.0 encoding=utf-8?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n at System.Xml.XmlTextReaderImpl.ParseXmlDeclaration(Boolean isTextDecl)
\n at System.Xml.XmlTextReaderImpl.Read()
\n at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader, Boolean preserveWhitespace)
\n at System.Xml.XmlDocument.Load(XmlReader reader)
\n at System.Xml.XmlDocument.LoadXml(String xml)
\n at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}
更新2:
使用引用后:false我得到另一个例外:
Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: Root element is missing.\nCurrent row dump: \tlog:\t\"<?xml version=\"1.0\" encoding=\"utf-8\"?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n at System.Xml.XmlTextReaderImpl.ParseDocumentContent()
\n at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
\n at System.Xml.XmlDocument.Load(XmlReader reader)
\n at System.Xml.XmlDocument.LoadXml(String xml)
\n at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}
答案 0 :(得分:3)
使用XPath表达式标识值。使用@attr_name
(或完整轴表达式attribute::attr_name
)查询属性。
根据问题更新更新:
看起来解析器不知何故被&#34;在XML声明中。我看到你使用内置的Tsv()提取器,默认情况下当前处理&#34;在字段内作为引用字符,然后删除它。这是我们计划修复的错误。
在此之前,我建议您使用Extractors.Tsv(quoting:false)
。
如果您使用任何内置文本提取器(Extractors.*
)并且如果您使用的话,它不包含选项卡值,请确保您的XML文档不包含任何CR / LF使用.Tsv。
如果您的XML将包含CR和/或LF,那么您将必须使用自定义提取器来使用不同的行分隔符。如果您需要这样做,请给我留言,因为我目前正在跟踪此类请求,以了解我们可以在内置提取器中改进的内容。
如果您的文件只包含一个XML文档(而不是几行XML文档),我建议使用XML提取器,它也是GitHub上XML示例的一部分。
答案 1 :(得分:0)
在新的错误消息上:在XML声明之后看起来XML文档包含CR或LF或两者,因此Tsv()提取器分割XML文档。请参阅上一个答案中的评论:
如果您使用任何内置文本提取器(Extractors。*)并且如果您使用它不包含选项卡值,请确保您的XML文档不包含任何CR / LF。 TSV。
如果您的XML将包含CR和/或LF,那么您将必须使用自定义提取器来使用不同的行分隔符。如果您需要这样做,请给我留言,因为我目前正在跟踪此类请求,以了解我们可以在内置提取器中改进的内容。