首页 文章

检测错误的UTF-8编码:要嗅探的坏字符列表?

提问于
浏览
9

我在两个应用程序之间共享一个sql-server 2010数据库 . 我们控制的一个应用程序,另一个应用程序是首先创建数据库的第三方应用程序 . 我们的应用程序是基于第三方Webmail应用程序构建的CRM .

该数据库包含varchar列,并且是latin-1编码的 . 第三方应用程序是用PHP编写的,并不关心正确编码数据,所以它将utf-8编码的字节填充到varchar列中,在那里它们被解释为latin-1并且看起来像垃圾 .

我们的CRM应用程序是用.Net编写的,它自动检测数据库排序规则与内存中字符串的编码不同,因此当.Net写入数据库时,它会转换字节以匹配数据库编码 .

所以...从我们的应用程序写入数据库的数据在数据库中看起来是正确的,但来自第三方应用程序的数据却没有 .

当我们的应用程序写入FirstName =Céline时,它将作为Céline存储在数据库中

当webmail应用程序写入FirstName =Céline时,它将作为Céline存储在db中

我们的CRM应用程序需要显示在任一系统中创建的联系人 . 所以我正在编写一个EncodingSniffer类,它查找标记的字符,指示其编码不良的字符串并转换它们 .

目前我有:

private static string[] _flaggedChars = new string[] { 
            "é" 
        };

这对于显示Céline作为Céline很有用,但我需要添加到列表中 .

有没有人知道一个资源,以获得utf-8特殊字符可以解释为iso-8859-1的所有可能方式?

谢谢

Clarification: 因为我在.Net工作 . 当从数据库加载到内存中时,该字符串将转换为Unicode UTF-16 . 因此,无论它是否在数据库中正确编码 . 它现在表示为UTF16字节 . 我需要能够分析UTF-16字节,并确定它们是否因为utf-8字节被塞入iso-8859-1数据库而被搞砸了....清楚如泥吧?

这是我到目前为止所拥有的 . 它清理了大多数错误编码字符的显示,但我仍然遇到É问题:Éric通过webmail存储在数据库中,但是在检测到错误编码并将其更改回来后,它显示为 ? ric看着有2500个联系人的用户,其中有数百个有编码问题,É是唯一没有正确显示的东西......

public static Regex CreateRegex()
    {
        string specials = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö";

        List<string> flags = new List<string>();
        foreach (char c in specials)
        {
            string interpretedAsLatin1 = Encoding.GetEncoding("iso-8859-1").GetString(Encoding.UTF8.GetBytes(c.ToString())).Trim();//take the specials, treat them as utf-8, interpret them as latin-1
            if (interpretedAsLatin1.Length > 0)//utf-8 chars made up of 2 bytes, interpreted as two single byte latin-1 chars.
                flags.Add(interpretedAsLatin1);
        }

        string regex = string.Empty;
        foreach (string s in flags)
        {
            if (regex.Length > 0)
                regex += '|';
            regex += s;
        }
        return new Regex("(" + regex + ")");
    }

    public static string CheckUTF(string data)
    {
        Match match = CreateRegex().Match(data);
        if (match.Success)
            return Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(data));//from iso-8859-1 (latin-1) to utf-8
        else
            return data;
    }

所以:É正在转换为195'Ã',8240'‰'

2 回答

  • 0

    您可能只是尝试将字节字符串解码为UTF-8,如果出现错误,请假设它是ISO-8859-1 .

    编码为ISO-8859-1的文本很少“发生”也是有效的UTF-8 ...除非它的ISO-8859-1实际上只包含ASCII,但在那种情况下你没有遇到问题当然,都是 . 所以这种方法相当稳健 .

    在实际语言中忽略哪些字符比其他字符更频繁地出现,这是一个天真的分析,假设每个字符以相同的频率出现 . 让我们试着找出有效的ISO-8859-1被误认为UTF-8的频率,从而导致mojibake . 我还假设不会发生C1控制字符(U 0080到U 009F) .

    对于字节字符串中的任何给定字节 . 如果字节接近字符串的末尾,则您更有可能检测到格式错误的UTF-8,因为已知某些字节序列不够长,无法生成有效的UTF-8 . 但假设字节不在字符串末尾附近:

    • p(字节解码为ASCII)= 0.57 . 这不提供有关字符串是ASCII,ISO-8859-1还是UTF-8的信息 .

    • 如果此字节为0x80到0xc1或0xf8到0xff,则它可以't be UTF-8, so you' ll检测到该字节 . P = 0.33

    • 如果第一个字节是0xc2到0xdf(p = 0.11),那么它可能是有效的UTF-8,但只有当它跟随一个值在0x80和0xbf之间的字节时才有效 . 下一个字节未能在该范围内的概率是192/224 = 0.86 . 所以UTF-8在这里失败的概率是0.09

    • 如果第一个字节是0xe0到0xef那么它可能是有效的UTF-8,但前提是它后跟2个连续字节 . 因此,您将检测到错误的UTF-8的概率为(16/224)*(1-(0.14 * 0.14))= 0.07

    • 类似于0xf0到0xf7,概率是(8/224)*(1-(0.14 * 0.14 * 0.14))= 0.04 .

    在长字符串中的每个字节处,检测到坏UTF-8的概率是0.33 0.09 0.07 0.04 = 0.53 .

    因此,对于长字符串,ISO-8859-1通过UTF-8解码器静默传递的概率非常小:每个字符串大约减半额外的性格!

    当然,这种分析假定随机的ISO-8859-1字符 . 在实践中,误检率不会那么好(主要是因为实际文本中的大多数字节实际上是ASCII),但它仍然会非常好 .

  • 0

    感谢@Michael完成99%以上的工作!

    这里's a PowerShell version of Michael' s脚本适合任何人帮助 . 这也是@Qubei建议的 Windows-1252 代码页/编码来解决 É 问题;虽然允许您修改这些编码,以防您的数据通过不同的编码组合而变得腐败 .

    #based on c# in question: https://stackoverflow.com/questions/10484833/detecting-bad-utf-8-encoding-list-of-bad-characters-to-sniff
    function Convert-CorruptCodePageString {
        [CmdletBinding(DefaultParameterSetName = 'ByInputText')]
        param (
            [Parameter(Mandatory = $true, ValueFromPipeline = $true, ParameterSetName = 'ByInputText')]
            [string]$InputText
            ,
            [Parameter(Mandatory = $true, ValueFromPipeline = $true, ParameterSetName = 'ByInputObject')]
            [PSObject]$InputObject
            ,
            [Parameter(Mandatory = $true, ParameterSetName = 'ByInputObject')]
            [string]$Property
            ,
            [Parameter()]
            [System.Text.Encoding]$SourceEncoding = [System.Text.Encoding]::GetEncoding('Windows-1252')
            ,
            [Parameter()]
            [System.Text.Encoding]$DestinationEncoding = [system.Text.Encoding]::UTF8
            ,
            [Parameter()]
            [string]$DodgyChars = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö'
        )
        begin {
            [string]$InvalidCharRegex = ($DodgyChars.ToCharArray() | %{
                [byte[]]$dodgyCharBytes = $DestinationEncoding.GetBytes($_.ToString())
                $SourceEncoding.GetString($dodgyCharBytes,0,$dodgyCharBytes.Length).Trim()
            })  -join '|'   
        }
        process {
            if ($PSCmdlet.ParameterSetName -eq 'ByInputText') {
                $InputObject = $null
            } else {
                $InputText = $InputObject."$Property"
            }
            [bool]$IsLikelyCorrupted = $InputText -match $InvalidCharRegex
            if ($IsLikelyCorrupted) { #only bother to decrupt if we think it's corrupted
                [byte[]]$bytes = $SourceEncoding.GetBytes($InputText)
                [string]$outputText = $DestinationEncoding.GetString($bytes,0,$bytes.Length)
            } else {
                [string]$outputText = $InputText
            }
            [pscustomobject]@{
                InputString = $InputText
                OutputString = $outputText
                InputObject = $InputObject
                IsLikelyCorrupted = $IsLikelyCorrupted
            }        
        }
    }
    

    Demo

    #demo of using a simple string without the function (may cause corruption since this doesn't check if the characters being replaced are those likely to have been corrupted / thus is more likely to cause corruption in many strings).
    $x = 'Strømmen'
    $bytes = [System.Text.Encoding]::GetEncoding('Windows-1252').GetBytes($x)
    [system.Text.Encoding]::UTF8.GetString($bytes,0,$bytes.Length)
    
    #demo using the function
    $x | Convert-CorruptCodePageString
    
    #demo of checking all records in a table for an issue / reporting those with issues
    #amend SQL Query, MyDatabaseInstance, and MyDatabaseCatlogue to point to your DB / query the relevant table
    Invoke-SQLQuery -Query 'Select [Description], [RecId] from [DimensionFinancialTag] where [Description] is not null and [Description] > ''''' -DbInstance $MyDatabaseInstance -DbCatalog $MyDatabaseCatalog |
        Convert-CorruptCodePageString -Property 'Description' | 
        ?{$_.IsLikelyCorrupted} | 
        ft @{N='RecordId';E={$_.InputObject.RecId}}, InputString, OutputString
    

    Additional Function used in my Demo

    我不是 Invoke-SqlCmd cmdlet的粉丝,所以我自己滚动了 .

    function Invoke-SQLQuery {
        [CmdletBinding(DefaultParameterSetName = 'ByQuery')]
        param (
            [Parameter(Mandatory = $true)]
            [string]$DbInstance
            ,
            [Parameter(Mandatory = $true)]
            [string]$DbCatalog
            ,
            [Parameter(Mandatory = $true, ParameterSetName = 'ByQuery')]
            [string]$Query
            ,
            [Parameter(Mandatory = $true, ParameterSetName = 'ByPath')]
            [string]$Path
            ,
            [Parameter(Mandatory = $false)]
            [hashtable]$Params = @{}
            ,
            [Parameter(Mandatory = $false)]
            [int]$CommandTimeoutSeconds = 30 #this is the SQL default
            ,
            [Parameter(Mandatory = $false)]
            [System.Management.Automation.Credential()]
            [System.Management.Automation.PSCredential]$Credential=[System.Management.Automation.PSCredential]::Empty 
        )
        begin {
            write-verbose "Call to 'Execute-SQLQuery'"
            $connectionString = ("Server={0};Database={1}" -f $DbInstance,$DbCatalog)
            if ($Credential -eq [System.Management.Automation.PSCredential]::Empty) {
                $connectionString = ("{0};Integrated Security=True" -f $connectionString)
            } else {
                $connectionString = ("{0};User Id={1};Password={2}" -f $connectionString, $Credential.UserName, $Credential.GetNetworkCredential().Password)    
                $PSCmdlet.Name    
            }
            $connection = New-Object System.Data.SqlClient.SqlConnection
            $connection.ConnectionString = $connectionString
            $connection.Open()    
        }
        process {
            #create the command & assign the connection
            $cmd = new-object -TypeName 'System.Data.SqlClient.SqlCommand'
            $cmd.Connection = $connection
    
            #load in our query
            switch ($PSCmdlet.ParameterSetName) {
                'ByQuery' {$cmd.CommandText = $Query; break;}
                'ByPath' {$cmd.CommandText = Get-Content -Path $Path -Raw; break;}
                default {throw "ParameterSet $($PSCmdlet.ParameterSetName) not recognised by Invoke-SQLQuery"}
            }
            #assign parameters as required 
            #NB: these don't need declare statements in our query; so a query of 'select @demo myDemo' would be sufficient for us to pass in a parameter with name @demo and have it used
            #we can also pass in parameters that don't exist; they're simply ignored (sometimes useful if writing generic code that has optional params)
            $Params.Keys | %{$cmd.Parameters.AddWithValue("@$_", $Params[$_]) | out-null}
    
            $reader = $cmd.ExecuteReader()
            while (-not ($reader.IsClosed)) {
                $table = new-object 'System.Data.DataTable'
                $table.Load($reader)
                write-verbose "TableName: $($table.TableName)" #NB: table names aren't always available
                $table | Select-Object -ExcludeProperty RowError, RowState, Table, ItemArray, HasErrors
            }
    
        }
        end {
            $connection.Close()
        }
    }
    

相关问题