首页 文章

从sql server中的字符串中删除html标签的最佳方法?

提问于
浏览
99

我想把所有这些都删掉,只留下标签之间的文字 . 理想情况下也用 < 替换 &lt; 之类的东西等 .

有没有一种简单的方法可以做到这一点,或者有人已经有一些示例t-sql代码?

我没有能力添加扩展存储过程等,所以更喜欢纯t-sql方法(最好是一个向后兼容sql 2000) .

我只想用剥离的html检索数据,而不是更新它,所以理想情况下它会被写成用户定义的函数,以便于重用 .

所以例如转换这个:

<B>Some useful text</B>&nbsp;
<A onclick="return openInfo(this)"
   href="http://there.com/3ce984e88d0531bac5349"
   target=globalhelp>
   <IMG title="Source Description" height=15 alt="Source Description" 
        src="/ri/new_info.gif" width=15 align=top border=0>
</A>&gt;&nbsp;<b>more text</b></TD></TR>

对此:

Some useful text > more text

8 回答

  • 3

    有一个UDF将执行此处描述的操作:

    User Defined Function to Strip HTML

    CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
    RETURNS VARCHAR(MAX) AS
    BEGIN
        DECLARE @Start INT
        DECLARE @End INT
        DECLARE @Length INT
        SET @Start = CHARINDEX('<',@HTMLText)
        SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
        SET @Length = (@End - @Start) + 1
        WHILE @Start > 0 AND @End > 0 AND @Length > 0
        BEGIN
            SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
            SET @Start = CHARINDEX('<',@HTMLText)
            SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
            SET @Length = (@End - @Start) + 1
        END
        RETURN LTRIM(RTRIM(@HTMLText))
    END
    GO
    

    编辑:请注意这是针对SQL Server 2005的,但是如果将关键字MAX更改为类似4000,它也可以在SQL Server 2000中使用 .

  • 4

    源自@Goner Doug的回答,更新了一些内容:

    • 尽可能使用REPLACE
    • 转换预定义的实体,如 &eacute; (我选择了我需要的那些:-)
    • 列表标签的一些转换 <ul> and <li>
    ALTER FUNCTION [dbo].[udf_StripHTML]
    --by Patrick Honorez --- www.idevlop.com
    --inspired by http://stackoverflow.com/questions/457701/best-way-to-strip-html-tags-from-a-string-in-sql-server/39253602#39253602
    (
    @HTMLText varchar(MAX)
    )
    RETURNS varchar(MAX)
    AS
    BEGIN
    DECLARE @Start  int
    DECLARE @End    int
    DECLARE @Length int
    
    set @HTMLText = replace(@htmlText, '<br>',CHAR(13) + CHAR(10))
    set @HTMLText = replace(@htmlText, '
    ',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, '
    ',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, '<li>','- ') set @HTMLText = replace(@htmlText, '</li>',CHAR(13) + CHAR(10)) set @HTMLText = replace(@htmlText, '&rsquo;' collate Latin1_General_CS_AS, '''' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&quot;' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&amp;' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&euro;' collate Latin1_General_CS_AS, '€' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&lt;' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&gt;' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&oelig;' collate Latin1_General_CS_AS, 'oe' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&nbsp;' collate Latin1_General_CS_AS, ' ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&copy;' collate Latin1_General_CS_AS, '©' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&laquo;' collate Latin1_General_CS_AS, '«' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&reg;' collate Latin1_General_CS_AS, '®' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&plusmn;' collate Latin1_General_CS_AS, '±' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&sup2;' collate Latin1_General_CS_AS, '²' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&sup3;' collate Latin1_General_CS_AS, '³' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&micro;' collate Latin1_General_CS_AS, 'µ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&middot;' collate Latin1_General_CS_AS, '·' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ordm;' collate Latin1_General_CS_AS, 'º' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&raquo;' collate Latin1_General_CS_AS, '»' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&frac14;' collate Latin1_General_CS_AS, '¼' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&frac12;' collate Latin1_General_CS_AS, '½' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&frac34;' collate Latin1_General_CS_AS, '¾' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Aelig' collate Latin1_General_CS_AS, 'Æ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Ccedil;' collate Latin1_General_CS_AS, 'Ç' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Egrave;' collate Latin1_General_CS_AS, 'È' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Eacute;' collate Latin1_General_CS_AS, 'É' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Ecirc;' collate Latin1_General_CS_AS, 'Ê' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&Ouml;' collate Latin1_General_CS_AS, 'Ö' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&agrave;' collate Latin1_General_CS_AS, 'à' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&acirc;' collate Latin1_General_CS_AS, 'â' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&auml;' collate Latin1_General_CS_AS, 'ä' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&aelig;' collate Latin1_General_CS_AS, 'æ' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ccedil;' collate Latin1_General_CS_AS, 'ç' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&egrave;' collate Latin1_General_CS_AS, 'è' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&eacute;' collate Latin1_General_CS_AS, 'é' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ecirc;' collate Latin1_General_CS_AS, 'ê' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&euml;' collate Latin1_General_CS_AS, 'ë' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&icirc;' collate Latin1_General_CS_AS, 'î' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ocirc;' collate Latin1_General_CS_AS, 'ô' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ouml;' collate Latin1_General_CS_AS, 'ö' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&divide;' collate Latin1_General_CS_AS, '÷' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&oslash;' collate Latin1_General_CS_AS, 'ø' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ugrave;' collate Latin1_General_CS_AS, 'ù' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&uacute;' collate Latin1_General_CS_AS, 'ú' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&ucirc;' collate Latin1_General_CS_AS, 'û' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&uuml;' collate Latin1_General_CS_AS, 'ü' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&quot;' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&amp;' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&lsaquo;' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS) set @HTMLText = replace(@htmlText, '&rsaquo;' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS) -- Remove anything between <STYLE> tags SET @Start = CHARINDEX('<STYLE', @HTMLText) SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<STYLE', @HTMLText) SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7 SET @Length = (@End - @Start) + 1 END -- Remove anything between <whatever> tags SET @Start = CHARINDEX('<', @HTMLText) SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)) SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<', @HTMLText) SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)) SET @Length = (@End - @Start) + 1 END RETURN LTRIM(RTRIM(@HTMLText)) END
  • 0

    如果您的HTML格式正确,我认为这是一个更好的解决方案:

    create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as
    begin
        declare @textXML xml
        declare @result varchar(max)
        set @textXML = REPLACE( @text, '&', '' );
        with doc(contents) as
        (
            select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)
        )
        select @result = contents.value('.', 'varchar(max)') from doc
        return @result
    end
    go
    
    select dbo.StripHTML('This <i>is</i> an <b>html</b> test')
    
  • 140

    这里's an updated version of this function that incorporates the RedFilter answer (Pinal' s原创)与LazyCoders添加和goodeye拼写错误更正和我自己的添加,以处理HTML内的内嵌 <STYLE> 标签 .

    ALTER FUNCTION [dbo].[udf_StripHTML]
    (
    @HTMLText varchar(MAX)
    )
    RETURNS varchar(MAX)
    AS
    BEGIN
    DECLARE @Start  int
    DECLARE @End    int
    DECLARE @Length int
    
    -- Replace the HTML entity &amp; with the '&' character (this needs to be done first, as
    -- '&' might be double encoded as '&amp;amp;')
    SET @Start = CHARINDEX('&amp;', @HTMLText)
    SET @End = @Start + 4
    SET @Length = (@End - @Start) + 1
    
    WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
    SET @Start = CHARINDEX('&amp;', @HTMLText)
    SET @End = @Start + 4
    SET @Length = (@End - @Start) + 1
    END
    
    -- Replace the HTML entity &lt; with the '<' character
    SET @Start = CHARINDEX('&lt;', @HTMLText)
    SET @End = @Start + 3
    SET @Length = (@End - @Start) + 1
    
    WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
    SET @Start = CHARINDEX('&lt;', @HTMLText)
    SET @End = @Start + 3
    SET @Length = (@End - @Start) + 1
    END
    
    -- Replace the HTML entity &gt; with the '>' character
    SET @Start = CHARINDEX('&gt;', @HTMLText)
    SET @End = @Start + 3
    SET @Length = (@End - @Start) + 1
    
    WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
    SET @Start = CHARINDEX('&gt;', @HTMLText)
    SET @End = @Start + 3
    SET @Length = (@End - @Start) + 1
    END
    
    -- Replace the HTML entity &amp; with the '&' character
    SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
    SET @End = @Start + 4
    SET @Length = (@End - @Start) + 1
    
    WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
    SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
    SET @End = @Start + 4
    SET @Length = (@End - @Start) + 1
    END
    
    -- Replace the HTML entity &nbsp; with the ' ' character
    SET @Start = CHARINDEX('&nbsp;', @HTMLText)
    SET @End = @Start + 5
    SET @Length = (@End - @Start) + 1
    
    WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
    SET @Start = CHARINDEX('&nbsp;', @HTMLText)
    SET @End = @Start + 5
    SET @Length = (@End - @Start) + 1
    END
    
    -- Replace any <br> tags with a newline
    SET @Start = CHARINDEX('<br>', @HTMLText)
    SET @End = @Start + 3
    SET @Length = (@End - @Start) + 1
    
    WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
    SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
    SET @Start = CHARINDEX('<br>', @HTMLText)
    SET @End = @Start + 3
    SET @Length = (@End - @Start) + 1
    END
    
    -- Replace any 
    tags with a newline SET @Start = CHARINDEX('
    ', @HTMLText) SET @End = @Start + 4 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10)) SET @Start = CHARINDEX('
    ', @HTMLText) SET @End = @Start + 4 SET @Length = (@End - @Start) + 1 END -- Replace any
    tags with a newline SET @Start = CHARINDEX('
    ', @HTMLText) SET @End = @Start + 5 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10)) SET @Start = CHARINDEX('
    ', @HTMLText) SET @End = @Start + 5 SET @Length = (@End - @Start) + 1 END -- Remove anything between <STYLE> tags SET @Start = CHARINDEX('<STYLE', @HTMLText) SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7 SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<STYLE', @HTMLText) SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7 SET @Length = (@End - @Start) + 1 END -- Remove anything between <whatever> tags SET @Start = CHARINDEX('<', @HTMLText) SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)) SET @Length = (@End - @Start) + 1 WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '') SET @Start = CHARINDEX('<', @HTMLText) SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText)) SET @Length = (@End - @Start) + 1 END RETURN LTRIM(RTRIM(@HTMLText)) END
  • 5

    这不是一个全新的解决方案,而是对维护者解决方案的修正:

    --note comments to see the corrections
    
    CREATE FUNCTION [dbo].[StripHTML] (@HTMLText VARCHAR(MAX))  
    RETURNS VARCHAR(MAX)  
    AS  
    BEGIN  
     DECLARE @Start  INT  
     DECLARE @End    INT  
     DECLARE @Length INT  
     --DECLARE @TempStr varchar(255) (this is not used)  
    
     SET @Start = CHARINDEX('<',@HTMLText)  
     SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))  
     SET @Length = (@End - @Start) + 1  
    
     WHILE @Start > 0 AND @End > 0 AND @Length > 0  
     BEGIN  
       IF (UPPER(SUBSTRING(@HTMLText, @Start, 4)) <> '<BR>') AND (UPPER(SUBSTRING(@HTMLText, @Start, 5)) <> '</BR>')  
        begin  
          SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')  
          end  
    -- this ELSE and SET is important
       ELSE  
          SET @Length = 0;  
    
    -- minus @Length here below is important
       SET @Start = CHARINDEX('<',@HTMLText, @End-@Length)  
       SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))  
    -- instead of -1 it should be +1
       SET @Length = (@End - @Start) + 1  
     END  
    
     RETURN RTRIM(LTRIM(@HTMLText))  
    END
    
  • 2

    试试这个 . 它是RedFilter发布的版本的修改版本...此SQL删除除BR,B和P之外的所有标记以及任何附带的属性:

    CREATE FUNCTION [dbo].[StripHtml] (@HTMLText VARCHAR(MAX))
    RETURNS VARCHAR(MAX)
    AS
    BEGIN
     DECLARE @Start  INT
     DECLARE @End    INT
     DECLARE @Length INT
     DECLARE @TempStr varchar(255)
    
     SET @Start = CHARINDEX('<',@HTMLText)
     SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
     SET @Length = (@End - @Start) + 1
    
     WHILE @Start > 0 AND @End > 0 AND @Length > 0
     BEGIN
       IF (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '<BR') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<P') AND (UPPER(SUBSTRING(@HTMLText, @Start, 2)) <> '<B') AND (UPPER(SUBSTRING(@HTMLText, @Start, 3)) <> '</B')
       BEGIN
          SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
       END
    
       SET @Start = CHARINDEX('<',@HTMLText, @End)
       SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText, @Start))
       SET @Length = (@End - @Start) - 1
     END
    
     RETURN RTRIM(LTRIM(@HTMLText))
    END
    
  • 8

    如何使用XQuery与单行:

    DECLARE @MalformedXML xml, @StrippedText varchar(max)
    SET @MalformedXML = @xml.query('for $x in //. return ($x)//text()')
    SET @StrippedText = CAST(@MalformedXML as varchar(max))
    

    这循环遍历所有元素并仅返回text() .

    要避免在没有空格的情况下连接元素之间的文本,请使用:

    DECLARE @MalformedXML xml, @StrippedText varchar(max)
    SET @MalformedXML = @xml.query('for $x in //. return concat((($x)//text())[1]," ")')
    SET @StrippedText = CAST(@MalformedXML as varchar(max))
    

    并回答“你如何使用这个列:

    SELECT CAST(html_column.query('for $x in //. return concat((($x)//text()) as varchar(max))
      FROM table
    

    对于上面的代码,请确保 html_column 的数据类型为 xml ,否则,您需要将html的转换版本保存为 xml . 当你加载HTML数据时,我会将它作为一个单独的练习,因为如果发现格式错误的xml,SQL会抛出错误,例如:不匹配的开始/结束标记,无效字符 .

    当你想要构建seachh短语,剥离HTML等时,它们非常适合 .

    请注意,这会返回xml类型,因此CAST或COVERT会在适当的位置返回文本 . 此数据类型的xml版本是无用的,因为它不是格式良好的XML .

  • 12

    Patrick Honorez代码需要稍作改动 .

    它返回包含 &lt;&gt; 的html的不完整结果

    这是因为该部分下面的代码

    • 删除标签之间的任何内容

    实际上会将<>替换为空 . 修复是在末尾应用以下两行:

    set @HTMLText = replace(@htmlText, '&lt;' collate Latin1_General_CS_AS, '<'  collate Latin1_General_CS_AS)
    set @HTMLText = replace(@htmlText, '&gt;' collate Latin1_General_CS_AS, '>'  collate Latin1_General_CS_AS)
    

相关问题