首页 文章

删除父元素,使用saveHTML保留DOMDocument中的所有内部子元素

提问于
浏览
9

我正在使用XPath操作一个简短的HTML片段;当我使用$ doc-> saveHTML()输出更改的片段时,会添加 DOCTYPE ,并且 HTML / BODY 标记会包装输出 . 我想删除它们,但只使用DOMDocument函数将所有子项保留在内部 . 例如:

$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<a href="http://www....."><img src="http://" alt=""></a>
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here
echo htmlentities( $doc->saveHTML() );

这会产生:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html><body>
<p><strong>Title...</strong></p>
<a href="http://www....."><img src="http://" alt=""></a>
<p>...to be one of those crowning achievements...</p>
</body></html>

我尝试了一些简单的技巧,例如:

# removes doctype
$doc->removeChild($doc->firstChild);

# <body> replaces <html>
$doc->replaceChild($doc->firstChild->firstChild, $doc->firstChild);

到目前为止,只删除DOCTYPE并用BODY替换HTML . 但是,此时剩下的是body>可变数量的元素 .

我如何删除 <body> 标签,但保留 all 的子节点,因为它们的结构可变,以一种干净的方式使用PHP的DOM操作?

5 回答

  • 2

    更新

    这是一个不扩展DOMDocument的版本,虽然我认为扩展是正确的方法,因为您正在尝试实现不是内置于DOM API的功能 .

    Note: I'm interpreting "clean" and "without workarounds" as keeping all manipulation to the DOM API. As soon as you hit string manipulation, that's workaround territory.

    就像在原始答案中一样,我正在做的是利用DOMDocumentFragment来操作所有位于根级别的多个节点 . 没有字符串操作,我认为这不是一种解决方法 .

    $doc = new DOMDocument();
    $doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');
    
    // Remove doctype node
    $doc->doctype->parentNode->removeChild($doc->doctype);
    
    // Remove html element, preserving child nodes
    $html = $doc->getElementsByTagName("html")->item(0);
    $fragment = $doc->createDocumentFragment();
    while ($html->childNodes->length > 0) {
        $fragment->appendChild($html->childNodes->item(0));
    }
    $html->parentNode->replaceChild($fragment, $html);
    
    // Remove body element, preserving child nodes
    $body = $doc->getElementsByTagName("body")->item(0);
    $fragment = $doc->createDocumentFragment();
    while ($body->childNodes->length > 0) {
        $fragment->appendChild($body->childNodes->item(0));
    }
    $body->parentNode->replaceChild($fragment, $body);
    
    // Output results
    echo htmlentities($doc->saveHTML());
    

    原始答案

    这个解决方案相当冗长,但这是因为它通过扩展DOM来实现它,以使您的结束代码尽可能短 .

    sliceOutNode 是神奇发生的地方 . 如果您有任何疑问,请与我们联系:

    <?php
    
    class DOMDocumentExtended extends DOMDocument
    {
        public function __construct( $version = "1.0", $encoding = "UTF-8" )
        {
            parent::__construct( $version, $encoding );
    
            $this->registerNodeClass( "DOMElement", "DOMElementExtended" );
        }
    
        // This method will need to be removed once PHP supports LIBXML_NOXMLDECL
        public function saveXML( DOMNode $node = NULL, $options = 0 )
        {
            $xml = parent::saveXML( $node, $options );
    
            if( $options & LIBXML_NOXMLDECL )
            {
                $xml = $this->stripXMLDeclaration( $xml );
            }
    
            return $xml;
        }
    
        public function stripXMLDeclaration( $xml )
        {
            return preg_replace( "|<\?xml(.+?)\?>[\n\r]?|i", "", $xml );
        }
    }
    
    class DOMElementExtended extends DOMElement
    {
        public function sliceOutNode()
        {
            $nodeList = new DOMNodeListExtended( $this->childNodes );
            $this->replaceNodeWithNode( $nodeList->toFragment( $this->ownerDocument ) );
        }
    
        public function replaceNodeWithNode( DOMNode $node )
        {
            return $this->parentNode->replaceChild( $node, $this );
        }
    }
    
    class DOMNodeListExtended extends ArrayObject
    {
        public function __construct( $mixedNodeList )
        {
            parent::__construct( array() );
    
            $this->setNodeList( $mixedNodeList );
        }
    
        private function setNodeList( $mixedNodeList )
        {
            if( $mixedNodeList instanceof DOMNodeList )
            {
                $this->exchangeArray( array() );
    
                foreach( $mixedNodeList as $node )
                {
                    $this->append( $node );
                }
            }
            elseif( is_array( $mixedNodeList ) )
            {
                $this->exchangeArray( $mixedNodeList );
            }
            else
            {
                throw new DOMException( "DOMNodeListExtended only supports a DOMNodeList or array as its constructor parameter." );
            }
        }
    
        public function toFragment( DOMDocument $contextDocument )
        {
            $fragment = $contextDocument->createDocumentFragment();
    
            foreach( $this as $node )
            {
                $fragment->appendChild( $contextDocument->importNode( $node, true ) );
            }
    
            return $fragment;
        }
    
        // Built-in methods of the original DOMNodeList
    
        public function item( $index )
        {
            return $this->offsetGet( $index );
        }
    
        public function __get( $name )
        {
            switch( $name )
            {
                case "length":
                    return $this->count();
                break;
            }
    
            return false;
        }
    }
    
    // Load HTML/XML using our fancy DOMDocumentExtended class
    $doc = new DOMDocumentExtended();
    $doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');
    
    // Remove doctype node
    $doc->doctype->parentNode->removeChild( $doc->doctype );
    
    // Slice out html node
    $html = $doc->getElementsByTagName("html")->item(0);
    $html->sliceOutNode();
    
    // Slice out body node
    $body = $doc->getElementsByTagName("body")->item(0);
    $body->sliceOutNode();
    
    // Pick your poison: XML or HTML output
    echo htmlentities( $doc->saveXML( NULL, LIBXML_NOXMLDECL ) );
    echo htmlentities( $doc->saveHTML() );
    
  • 0

    saveHTML 可以输出文档的子集,这意味着我们可以通过遍历体来逐个输出每个子节点 .

    $doc = new DOMDocument();
    $doc->loadHTML('<p><strong>Title...</strong></p>
    <a href="http://google.com"><img src="http://google.com/img.jpeg" alt=""></a>
    <p>...to be one of those crowning achievements...</p>');
    // manipulation goes here
    
    // Let's traverse the body and output every child node
    $bodyNode = $doc->getElementsByTagName('body')->item(0);
    foreach ($bodyNode->childNodes as $childNode) {
      echo $doc->saveHTML($childNode);
    }
    

    这可能不是最优雅的解决方案,但它确实有效 . 或者,我们可以将所有子节点包装在某个容器元素(例如 div )中,并仅输出该容器(但容器标记将包含在输出中) .

  • 11

    我在这里是如何做到的:

    • 快速帮助函数,为您提供特定DOM元素的HTML内容
    function nodeContent($n, $outer=false) {
       $d = new DOMDocument('1.0');
       $b = $d->importNode($n->cloneNode(true),true);
       $d->appendChild($b); $h = $d->saveHTML();
       // remove outter tags
       if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4));
       return $h;
    }
    
    • 在文档中查找body节点并获取其内容
    $query = $xpath->query("//body")->item(0);
    if($query)
    {
        echo nodeContent($query);
    }
    

    UPDATE 1:

    一些额外的信息:从PHP / 5.3.6开始,DOMDocument-> saveHTML()接受一个可选的DOMNode参数,类似于DOMDocument-> saveXML() . 你可以做

    $xpath = new DOMXPath($doc);
    $query = $xpath->query("//body")->item(0);
    echo $doc->saveHTML($query);
    

    对于其他人来说,辅助功能会有所帮助

  • -1

    tl;dr

    要求: PHP 5.4.0Libxml 2.6.0

    $doc->loadHTML("<p>test</p>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

    explanation

    http://php.net/manual/en/domdocument.loadhtml.php“自PHP 5.4.0和Libxml 2.6.0以来,您也可以使用options参数指定additional Libxml parameters.

    LIBXML_HTML_NOIMPLIED 设置HTML_PARSE_NOIMPLIED标志,该标志关闭隐含的html / body ...元素的自动添加 .

    LIBXML_HTML_NODEFDTD 设置HTML_PARSE_NODEFDTD标志,该标志可防止在找不到默认文档类型时添加默认文档类型 .

  • 15

    您有两种方法可以实现此目的:

    $content = substr($content, strpos($content, '<html><body>') + 12); // Remove Everything Before & Including The Opening HTML & Body Tags.
    $content = substr($content, 0, -14); // Remove Everything After & Including The Closing HTML & Body Tags.
    

    或者更好的是这样:

    $dom->normalizeDocument();
    $content = $dom->saveHTML();
    

相关问题