首页 文章

在构造使用数据的XmlReader或XPathDocument之前,如何从基于XML的数据源中删除无效的十六进制字符?

提问于
浏览
71

在XmlReader中使用它之前,是否有任何简单/通用的方法来清理基于XML的数据源,以便我可以优雅地使用不符合XML上的十六进制字符限制的XML数据?

注意:

  • 该解决方案需要处理使用UTF-8以外的字符编码的XML数据源,例如:通过在XML文档声明中指定字符编码 . 在剥离无效的十六进制字符时,不破坏源的字符编码一直是一个主要的难点 .

  • 删除无效的十六进制字符应仅删除十六进制编码值,因为您经常可以在数据中找到碰巧包含字符串的href值,该字符串将是十六进制字符的字符串匹配 .

背景:

我需要使用符合特定格式的基于XML的数据源(想想Atom或RSS提要),但希望能够使用已发布的数据源,这些数据源包含符合XML规范的无效十六进制字符 .

在.NET中,如果您有一个表示XML数据源的Stream,然后尝试使用XmlReader和/或XPathDocument对其进行解析,则会由于在XML数据中包含无效的十六进制字符而引发异常 . 我目前解决此问题的尝试是将Stream解析为字符串并使用正则表达式删除和/或替换无效的十六进制字符,但我正在寻找更高性能的解决方案 .

13 回答

  • 1
    private static String removeNonUtf8CompliantCharacters( final String inString ) {
        if (null == inString ) return null;
        byte[] byteArr = inString.getBytes();
        for ( int i=0; i < byteArr.length; i++ ) {
            byte ch= byteArr[i]; 
            // remove any characters outside the valid UTF-8 range as well as all control characters
            // except tabs and new lines
            if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
                byteArr[i]=' ';
            }
        }
        return new String( byteArr );
    }
    
  • -1

    您可以使用以下内容传递非UTF字符:

    string sFinalString  = "";
    string hex = "";
    foreach (char ch in UTFCHAR)
    {
        int tmp = ch;
       if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
        {
        sFinalString  += ch;
        }
        else
        {  
          sFinalString  += "&#" + tmp+";";
        }
    }
    
  • -5

    试试PHP吧!

    $goodUTF8 = iconv("utf-8", "utf-8//IGNORE", $badUTF8);
    
  • 12

    它是 may not be perfect (重点是因为人们错过了这个免责声明),但我在这种情况下所做的就是下面 . 您可以调整以使用流 .

    /// <summary>
    /// Removes control characters and other non-UTF-8 characters
    /// </summary>
    /// <param name="inString">The string to process</param>
    /// <returns>A string with no control characters or entities above 0x00FD</returns>
    public static string RemoveTroublesomeCharacters(string inString)
    {
        if (inString == null) return null;
    
        StringBuilder newString = new StringBuilder();
        char ch;
    
        for (int i = 0; i < inString.Length; i++)
        {
    
            ch = inString[i];
            // remove any characters outside the valid UTF-8 range as well as all control characters
            // except tabs and new lines
            //if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
            //if using .NET version prior to 4, use above logic
            if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
            {
                newString.Append(ch);
            }
        }
        return newString.ToString();
    
    }
    
  • 9

    我喜欢Eugene的白名单概念 . 我需要做与原始海报类似的事情,但我需要支持所有Unicode字符,而不仅仅是0x00FD . XML规范是:

    Char =#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

    在.NET中,Unicode字符的内部表示只有16位,因此我们可以明确地使用't `allow' 0x10000-0x10FFFF . XML规范明确禁止出现从0xD800开始的代理代码点 . 但是,如果我们在白名单中使用这些代理代码点,utf-8编码,我们的字符串可能最终会生成有效的XML,只要从utf-16字符的代理对中生成正确的utf-8编码即可 . .NET字符串 . 我没有't explored this though, so I went with the safer bet and didn'允许我的白名单中的代理人 .

    虽然Eugene的解决方案中的注释具有误导性,但问题是我们排除的字符在XML中无效......它们是完全有效的Unicode代码点 . 我们不会删除“非utf-8字符” . 我们正在删除可能不会出现在格式良好的XML文档中的utf-8字符 .

    public static string XmlCharacterWhitelist( string in_string ) {
        if( in_string == null ) return null;
    
        StringBuilder sbOutput = new StringBuilder();
        char ch;
    
        for( int i = 0; i < in_string.Length; i++ ) {
            ch = in_string[i];
            if( ( ch >= 0x0020 && ch <= 0xD7FF ) || 
                ( ch >= 0xE000 && ch <= 0xFFFD ) ||
                ch == 0x0009 ||
                ch == 0x000A || 
                ch == 0x000D ) {
                sbOutput.Append( ch );
            }
        }
        return sbOutput.ToString();
    }
    
  • 5

    作为删除无效XML字符的方法,我建议您使用XmlConvert.IsXmlChar方法 . 它是从.NET Framework 4开始添加的,也是在Silverlight中呈现的 . 这是一个小样本:

    void Main() {
        string content = "\v\f\0";
        Console.WriteLine(IsValidXmlString(content)); // False
    
        content = RemoveInvalidXmlChars(content);
        Console.WriteLine(IsValidXmlString(content)); // True
    }
    
    static string RemoveInvalidXmlChars(string text) {
        char[] validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
        return new string(validXmlChars);
    }
    
    static bool IsValidXmlString(string text) {
        try {
            XmlConvert.VerifyXmlChars(text);
            return true;
        } catch {
            return false;
        }
    }
    
  • 4

    DRY实现this answer的解决方案(使用不同的构造函数 - 随意使用您在应用程序中需要的那个):

    public class InvalidXmlCharacterReplacingStreamReader : StreamReader
    {
        private readonly char _replacementCharacter;
    
        public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter) : base(fileName)
        {
            this._replacementCharacter = replacementCharacter;
        }
    
        public override int Peek()
        {
            int ch = base.Peek();
            if (ch != -1 && IsInvalidChar(ch))
            {
                return this._replacementCharacter;
            }
            return ch;
        }
    
        public override int Read()
        {
            int ch = base.Read();
            if (ch != -1 && IsInvalidChar(ch))
            {
                return this._replacementCharacter;
            }
            return ch;
        }
    
        public override int Read(char[] buffer, int index, int count)
        {
            int readCount = base.Read(buffer, index, count);
            for (int i = index; i < readCount + index; i++)
            {
                char ch = buffer[i];
                if (IsInvalidChar(ch))
                {
                    buffer[i] = this._replacementCharacter;
                }
            }
            return readCount;
        }
    
        private static bool IsInvalidChar(int ch)
        {
            return (ch < 0x0020 || ch > 0xD7FF) &&
                   (ch < 0xE000 || ch > 0xFFFD) &&
                    ch != 0x0009 &&
                    ch != 0x000A &&
                    ch != 0x000D;
        }
    }
    
  • 26

    现代化dnewcombe's答案,你可以采取一种稍微简单的方法

    public static string RemoveInvalidXmlChars(string input)
    {
        var isValid = new Predicate<char>(value =>
            (value >= 0x0020 && value <= 0xD7FF) ||
            (value >= 0xE000 && value <= 0xFFFD) ||
            value == 0x0009 ||
            value == 0x000A ||
            value == 0x000D);
    
        return new string(Array.FindAll(input.ToCharArray(), isValid));
    }
    

    或者,与Linq

    public static string RemoveInvalidXmlChars(string input)
    {
        return new string(input.Where(value =>
            (value >= 0x0020 && value <= 0xD7FF) ||
            (value >= 0xE000 && value <= 0xFFFD) ||
            value == 0x0009 ||
            value == 0x000A ||
            value == 0x000D).ToArray());
    }
    

    我很想知道这些方法的性能如何比较,以及它们如何与使用 Buffer.BlockCopy 的黑名单方法进行比较 .

  • 1

    这是dnewcome在自定义StreamReader中的答案 . 它只是包装一个真正的流阅读器,并在阅读时替换它们 .

    我只实现了一些方法来节省自己的时间 . 我将它与XDocument.Load和文件流结合使用,只调用了Read(char [] buffer,int index,int count)方法,因此它的工作原理如下 . 您可能需要实现其他方法才能使其适用于您的应用程序 . 我使用这种方法,因为它似乎比其他答案更有效 . 我也只实现了一个构造函数,你显然可以实现你需要的任何StreamReader构造函数,因为它只是一个传递 .

    我选择替换字符而不是删除它们,因为它极大地简化了解决方案 . 这样,文本的长度保持不变,因此不需要跟踪单独的索引 .

    public class InvalidXmlCharacterReplacingStreamReader : TextReader
    {
        private StreamReader implementingStreamReader;
        private char replacementCharacter;
    
        public InvalidXmlCharacterReplacingStreamReader(Stream stream, char replacementCharacter)
        {
            implementingStreamReader = new StreamReader(stream);
            this.replacementCharacter = replacementCharacter;
        }
    
        public override void Close()
        {
            implementingStreamReader.Close();
        }
    
        public override ObjRef CreateObjRef(Type requestedType)
        {
            return implementingStreamReader.CreateObjRef(requestedType);
        }
    
        public void Dispose()
        {
            implementingStreamReader.Dispose();
        }
    
        public override bool Equals(object obj)
        {
            return implementingStreamReader.Equals(obj);
        }
    
        public override int GetHashCode()
        {
            return implementingStreamReader.GetHashCode();
        }
    
        public override object InitializeLifetimeService()
        {
            return implementingStreamReader.InitializeLifetimeService();
        }
    
        public override int Peek()
        {
            int ch = implementingStreamReader.Peek();
            if (ch != -1)
            {
                if (
                    (ch < 0x0020 || ch > 0xD7FF) &&
                    (ch < 0xE000 || ch > 0xFFFD) &&
                    ch != 0x0009 &&
                    ch != 0x000A &&
                    ch != 0x000D
                    )
                {
                    return replacementCharacter;
                }
            }
            return ch;
        }
    
        public override int Read()
        {
            int ch = implementingStreamReader.Read();
            if (ch != -1)
            {
                if (
                    (ch < 0x0020 || ch > 0xD7FF) &&
                    (ch < 0xE000 || ch > 0xFFFD) &&
                    ch != 0x0009 &&
                    ch != 0x000A &&
                    ch != 0x000D
                    )
                {
                    return replacementCharacter;
                }
            }
            return ch;
        }
    
        public override int Read(char[] buffer, int index, int count)
        {
            int readCount = implementingStreamReader.Read(buffer, index, count);
            for (int i = index; i < readCount+index; i++)
            {
                char ch = buffer[i];
                if (
                    (ch < 0x0020 || ch > 0xD7FF) &&
                    (ch < 0xE000 || ch > 0xFFFD) &&
                    ch != 0x0009 &&
                    ch != 0x000A &&
                    ch != 0x000D
                    )
                {
                    buffer[i] = replacementCharacter;
                }
            }
            return readCount;
        }
    
        public override Task<int> ReadAsync(char[] buffer, int index, int count)
        {
            throw new NotImplementedException();
        }
    
        public override int ReadBlock(char[] buffer, int index, int count)
        {
            throw new NotImplementedException();
        }
    
        public override Task<int> ReadBlockAsync(char[] buffer, int index, int count)
        {
            throw new NotImplementedException();
        }
    
        public override string ReadLine()
        {
            throw new NotImplementedException();
        }
    
        public override Task<string> ReadLineAsync()
        {
            throw new NotImplementedException();
        }
    
        public override string ReadToEnd()
        {
            throw new NotImplementedException();
        }
    
        public override Task<string> ReadToEndAsync()
        {
            throw new NotImplementedException();
        }
    
        public override string ToString()
        {
            return implementingStreamReader.ToString();
        }
    }
    
  • 69

    基于正则表达式的方法

    public static string StripInvalidXmlCharacters(string str)
    {
        var invalidXmlCharactersRegex = new Regex("[^\u0009\u000a\u000d\u0020-\ud7ff\ue000-\ufffd]|([\ud800-\udbff](?![\udc00-\udfff]))|((?<![\ud800-\udbff])[\udc00-\udfff])");
        return invalidXmlCharactersRegex.Replace(str, "");
    

    }

    有关详细信息,请参阅我的blogpost

  • -1

    上述解决方案似乎是在转换为XML之前删除无效字符 .

    使用此代码从XML字符串中删除无效的XML字符 . 例如 . &X1A;

    public static string CleanInvalidXmlChars( string Xml, string XMLVersion )
        {
            string pattern = String.Empty;
            switch( XMLVersion )
            {
                case "1.0":
                    pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|7F|8[0-46-9A-F]9[0-9A-F]);";
                    break;
                case "1.1":
                    pattern = @"&#x((10?|[2-F])FFF[EF]|FDD[0-9A-F]|[19][0-9A-F]|7F|8[0-46-9A-F]|0?[1-8BCEF]);";
                    break;
                default:
                    throw new Exception( "Error: Invalid XML Version!" );
            }
    
            Regex regex = new Regex( pattern, RegexOptions.IgnoreCase );
            if( regex.IsMatch( Xml ) )
                Xml = regex.Replace( Xml, String.Empty );
            return Xml;
        }
    

    http://balajiramesh.wordpress.com/2008/05/30/strip-illegal-xml-characters-based-on-w3c-standard/

  • 0

    Neolisk above修改的答案或原始答案 .
    更改:传递\ 0字符,删除完成,而不是替换 . 另外,使用了XmlConvert.IsXmlChar(char)方法

    /// <summary>
        /// Replaces invalid Xml characters from input file, NOTE: if replacement character is \0, then invalid Xml character is removed, instead of 1-for-1 replacement
        /// </summary>
        public class InvalidXmlCharacterReplacingStreamReader : StreamReader
        {
            private readonly char _replacementCharacter;
    
            public InvalidXmlCharacterReplacingStreamReader(string fileName, char replacementCharacter)
                : base(fileName)
            {
                _replacementCharacter = replacementCharacter;
            }
    
            public override int Peek()
            {
                int ch = base.Peek();
                if (ch != -1 && IsInvalidChar(ch))
                {
                    if ('\0' == _replacementCharacter)
                        return Peek(); // peek at the next one
    
                    return _replacementCharacter;
                }
                return ch;
            }
    
            public override int Read()
            {
                int ch = base.Read();
                if (ch != -1 && IsInvalidChar(ch))
                {
                    if ('\0' == _replacementCharacter)
                        return Read(); // read next one
    
                    return _replacementCharacter;
                }
                return ch;
            }
    
            public override int Read(char[] buffer, int index, int count)
            {
                int readCount= 0, ch;
    
                for (int i = 0; i < count && (ch = Read()) != -1; i++)
                {
                    readCount++;
                    buffer[index + i] = (char)ch;
                }
    
                return readCount;
            }
    
    
            private static bool IsInvalidChar(int ch)
            {
                return !XmlConvert.IsXmlChar((char)ch);
            }
        }
    
  • 59

    使用此函数删除无效的xml字符 .

    public static string CleanInvalidXmlChars(string text)   
    {   
           string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";   
           return Regex.Replace(text, re, "");   
    }
    

相关问题