无法从RSS提要解析img src？-Java 学习之路

我正在尝试基于此示例创建RSS阅读器：

http://www.w3schools.com/php/php_ajax_rss_reader.asp

具体来说，我试图修改这个例子，以便读者可以从任何给定的网络漫画RSS源访问和显示所有可用的漫画图像（以及其他任何内容） . 我意识到可能有必要使代码至少有一点特定于站点，但我试图尽可能地使它成为通用的 . 目前，我已经修改了最初的示例，以生成一个显示给定RSS提要列表的所有漫画的阅读器 . 但是，它还显示我试图摆脱的其他不需要的文本信息 . 到目前为止，这是我的代码，其中有一些Feed特别给我带来了麻烦：

index.php文件：

<html>
<head>
    <script>
        function showRSS() 
        {
          if (window.XMLHttpRequest) 
          {
            // code for IE7+, Firefox, Chrome, Opera, Safari
            xmlhttp=new XMLHttpRequest();
          } else 
          {  // code for IE6, IE5
            xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
          }
          xmlhttp.onreadystatechange=function() 
          {
            if (xmlhttp.readyState==4 && xmlhttp.status==200) 
            {
              document.getElementById("rssOutput").innerHTML=xmlhttp.responseText;
            }
          }
          xmlhttp.open("GET","logger.php",true);
          xmlhttp.send();
        }
    </script>
</head>
<body onload="showRSS()">
    <div id="rssOutput"></div>
</body>
</html>

（非常肯定这个文件没有任何问题;我认为问题出现在下一个文件中，尽管我为了完整而包含了这个问题）

logger.php：

<?php

//function to get all comics from an rss feed
function getComics($xml)
{
    $xmlDoc = new DOMDocument();
    $xmlDoc->load($xml);

    $x=$xmlDoc->getElementsByTagName('item');
    foreach ($x as $x)
    {
      $comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
      //output the comic
      echo ($comic_image . "</p>");
      echo ("<br>");
    }

}

//create array of all RSS feed URLs
$URLs =
[
    "SMBC" => "http://www.smbc-comics.com/rss.php", 
    "garfieldMinusGarfield" => "http://garfieldminusgarfield.net/rss",
    "babyBlues" => "http://www.comicsyndicate.org/Feed/Baby%20Blues",
];

//Loop through all RSS feeds
foreach ($URLs as $xml)
{
    getComics($xml);
}

?>

因为这种方法在漫画图像之间包含额外的文本（很多随机的SMBC，只有一些gMg的广告链接，以及婴儿蓝调的版权链接），我查看了RSS提要并得出结论，问题是它是描述标签，包括图像源，但也包括其他东西 . 接下来，我尝试修改getComics函数以直接扫描图像标记，而不是首先查找description标记 . 我将DOMDocument创建/加载和URL列表之间的部分替换为：

$images=$xmlDoc->getElementsByTagName('img');
    print_r($images);

    foreach ($images as $image)
    {
        //echo $image->item(0)->getAttribute('src');
        echo $image->item(0)->nodeValue;
        echo ("<br>");
    }

但显然getElementsByTagName没有拾取嵌入在description标签内的图像标签，因为我没有输出漫画图像，以及print_r语句的以下输出：

DOMNodeList Object ( [length] => 0 ) DOMNodeList Object ( [length] => 0 )

最后，我尝试了两种方法的组合，尝试在解析描述标记内容的代码中使用getElementsByTagNam（'img'） . 我换了一行：

$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;

有：

$comic_image=$x->getElementsByTagName('description')->item(0)->getElementsByTagName('img');
      print_r($comic_image);

但这也没有发现，产生了输出：

DOMNodeList Object ( [length] => 0 )

很抱歉这个很长的背景，但是我想知道是否有办法从给定的RSS提要中解析出img src而没有其他文本和我不想要的链接？

非常感谢帮助

2 回答

在内部，描述内容被转义，因此以下代码应该起作用：

foreach ($x as $y) {
    $description = $y->getElementsByTagName('description')->item(0);
    $decoded_description = htmlspecialchars_decode($description->nodeValue);
    $description_xml = new DOMDocument();
    $description_xml->loadHTML($decoded_description);
    $comic_image = $description_xml->getElementsByTagName('img')->item(0)->getAttribute('src');

    //output the comic
    echo ($comic_image);
    echo ("<br>");
}

回复于 2024-04-29T23:31:06+08:00

作为稍后阅读本论坛的其他人的参考，这里是我最终得到的代码 . 我用一个调用getImageTag函数的getImageSrc函数替换了每个循环中的所有内容：

//function to find an image tag within a specific section if there is one
function getImageTag ($item,$tagName)
{
    //pull desired section from given item
    $section = $item->getElementsByTagName($tagName)->item(0);
    //reparse description as if it were a string, because for some reason  PHP woon't let you directly go to the source image with getElementsByTagName
    $decoded_section = htmlspecialchars_decode($section->nodeValue);
    $section_xml = new DOMDocument();
    @$section_xml->loadHTML($decoded_section); //the @ is to suppress a bunch of warnings about characters this parser doesn't like
    //pull image tag from section if there
    $image_tag = $section_xml->getElementsByTagName('img')->item(0);
    return $image_tag;
}

//function to get the image source URL from a given item
function getImageSrc ($item)
{
    $image_tag = getImageTag($item,'description');
    if (is_null($image_tag)) //if there was nothing with the tag name of  image in the description section
    {
        //check in content:encoded section, because that's the next most likely place
        $image_tag = getImageTag($item,'encoded');
        if (is_null($image_tag)) //if there was nothing with the tag name of image in the encoded content section
        {
            //if the program gets here,  it's probably because the feed is crap and doesn't include images,
            //or it's because this particular item doesn't have a comic image in it
            $image_src = '';
            //THIS EXCEPTION  WILL PROBABLY NEED TO BE HANDLED LATER TO AVOID POTENTIAL ERRORS
        } else
        {
            $image_src = $image_tag->getAttribute('src');
        }
    } else
    {
        $image_src = $image_tag->getAttribute('src');
    }
    return $image_src;
}

回复于 2024-04-29T23:31:06+08:00

无法从RSS提要解析img src？

2 回答

相关问题