首页 文章

ADLA XMLExtractor无法读取属性?

提问于
浏览
2

我一直在使用示例XMLExtractor(从https://github.com/Azure/usql/tree/master/Examples/DataFormats克隆)从我的xml元素中提取属性 .

如果根元素具有任何已定义的属性,则提取器将无法工作 .

例如,我需要从以下XML文件中获取“rec”元素的“sTime”属性:

<lics xmlns="***" lVer="*" pID="*" aKey="*" cTime="*" gDel="*" country="*" fStr="*">
   <rec Ver="*" hID="*.*.*" cSID="Y5/*=" uID="*\Rad.*" uSID="*/*=" cAttrs="*" sTime="*" eTime="*" projID="*" docID="*" imsID="*">
   </rec>
</lics>

使用以下U-SQL脚本:

@e = EXTRACT a string, b string
 FROM @"D:\file.xml"
 USING new Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor(rowPath:"rec",
                         columnPaths:new SQL.MAP<string, string> { {"@sTime", "a"} });

OUTPUT @e TO "D:/output.csv" USING Outputters.Csv(quoting:false);

这会写一个空文件 . 但是,如果我删除“lics”标签的属性,它的工作原理 .

<lics>
   <rec Ver="*" hID="*.*.*" cSID="Y5/*=" uID="*\Rad.*" uSID="*/*=" cAttrs="*" sTime="*" eTime="*" projID="*" docID="*" imsID="*">
   </rec>
</lics>

这是提取器的问题吗?或者是否需要在提取器的任何参数中定义?

2 回答

  • 1

    问题是 Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor 完全忽略了XML命名空间 .

    更好的实现看起来像这样(虽然未经测试):

    [SqlUserDefinedExtractor(AtomicFileProcessing = true)]
    public class XmlDomExtractorNs : IExtractor
    {
        private string rowPath;
        private SqlMap<string, string> columnPaths;
        private string namespaces;
        private Regex xmlns = new Regex("(?:xmlns:)?(\\S+)\\s*=\\s*([\"']?)(\\S+)\\2");
    
        public XmlDomExtractor(string rowPath, SqlMap<string, string> columnPaths, string namespaces)
        {
            this.rowPath = rowPath;
            this.columnPaths = columnPaths;
            this.namespaces = namespaces;
        }
    
        public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
        {
            IColumn column = output.Schema.FirstOrDefault(col => col.Type != typeof(string));
            if (column != null)
            {
                throw new ArgumentException(string.Format("Column '{0}' must be of type 'string', not '{1}'", column.Name, column.Type.Name));
            }
    
            XmlDocument xmlDocument = new XmlDocument();
            xmlDocument.Load(input.BaseStream);
    
            XmlNamespaceManager nsmgr = new XmlNamespaceManager(xmlDocument.NameTable);
            if (this.namespaces != null)
            {
                foreach (Match nsdef in xmlns.Matches(this.namespaces))
                {
                    string prefix = nsdef.Groups[1].Value;
                    string uri = nsdef.Groups[3].Value;
                    nsmgr.AddNamespace(prefix, uri);
                }
            }
    
            foreach (XmlNode xmlNode in xmlDocument.DocumentElement.SelectNodes(this.rowPath, nsmgr))
            {
                foreach(IColumn col in output.Schema)
                {
                    var explicitColumnMapping = this.columnPaths.FirstOrDefault(columnPath => columnPath.Value == col.Name);
                    XmlNode xml = xmlNode.SelectSingleNode(explicitColumnMapping.Key ?? col.Name, nsmgr);
                    output.Set(explicitColumnMapping.Value ?? col.Name, xml == null ? null : xml.InnerXml);
                }
                yield return output.AsReadOnly();
            }
        }
    }
    

    并像这样使用:

    @e = EXTRACT a string, b string
      FROM @"D:\file.xml"
      USING new Your.Namespace.XmlDomExtractorNs(
        rowPath:"lics:rec",
        columnPaths:new SQL.MAP<string, string> { {"@sTime", "a"} },
        namespaces:"lics=http://the/namespace/of/the/doc"
      );
    
    OUTPUT @e TO "D:/output.csv" USING Outputters.Csv(quoting:false);
    

    namespaces参数将被解析为namespace-prefix和namespace-uri部分,然后将用于驱动XPath查询 . 为方便起见,它支持以下任何值格式:

    • 'xmlns:foo="http://uri/1" xmlns:bar="http://uri/2"'

    • "xmlns:foo='http://uri/1' xmlns:bar='http://uri/2'"

    • "xmlns:foo=http://uri/1 xmlns:bar=http://uri/2"

    • "foo=http://uri/1 bar=http://uri/2"

    所以它可以直接从XML源复制它们,也可以手动创建它们而不用太大惊小怪 .

    由于您使用的XML文档具有默认命名空间,并且XPath要求在表达式中使用您需要的任何命名空间的前缀,因此必须为命名空间URI选择命名空间前缀 . 我选择使用上面的 lics .


    FWIW,解析名称空间参数的正则表达式分解如下:

    (?:            # non-capturing group
      xmlns:       #   literal "xmlns:"
    )?             # end non-capturing group, make optional
    (\S+)          # GROUP 1 (prefix): any number of non-whitespace characters
    \s*=\s*        # a literal "=" optionally surrounded by whitespace
    (["']?)        # GROUP 2 (delimiter): either single or double quote, optional
    (\S+)          # GROUP 3 (uri): any number of non-whitespace characters
    \2             # whatever was in group 2 to end the namespace URI
    
  • 3

    我可能会使用另一个SQL.MAP来定义命名空间映射的前缀(并且不需要与文档中相同的前缀) .

    我在这里创建了一个功能请求:https://feedback.azure.com/forums/327234-data-lake/suggestions/11675604-add-xml-namespace-support-to-xml-extractor . 请添加您的投票 .

    UPDATE: The XmlDomExtractor now supports XML Namespaces. Use the following USING clause:

    USING new Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor(rowPath:"ns:rec",
                         columnPaths:new SQL.MAP<string, string> { {"@sTime", "a"} },
                         namespaceDecls: new SqlMap<string,string>{{"ns","***"}});
    

相关问题