首页 文章

循环通过HtmlAgilityPack创建的节点

提问于
浏览
2

我需要使用HtmlAgilityPack和C#解析这个html代码 . 我可以得到div class =“patent_bibdata”节点,但我不知道如何通过子节点循环 .

在这个样本中有6个hrefs,但我需要将它们分成两组;发明人,分类 . 我对最后两个不感兴趣 . 此div中可以有任意数量的href .

正如您所看到的,在两组之前有一个文本说明了什么是hrefs .

代码段

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = m_hw.Load("http://www.google.com/patents/US3748943");
string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']";
HtmlNode node = m_doc.DocumentNode.SelectSingleNode(xpath);

那你怎么做?

<div class="patent_bibdata">
    <b>Inventors</b>:&nbsp;
    <a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22">
    Ronald T. Lashley
    </a>, 
    <a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22">
    Ronald T. Lashley
    </a><br>
    <b>Current U.S. Classification</b>:&nbsp;
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>;
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a><br>
    <br>
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://patft.uspto.gov/netacgi/nph-Parser%3FSect2%3DPTO1%26Sect2%3DHITOFF%26p%3D1%26u%3D/netahtml/PTO/search-bool.html%26r%3D1%26f%3DG%26l%3D50%26d%3DPALL%26RefSrch%3Dyes%26Query%3DPN/3748943&usg=AFQjCNGKUic_9BaMHWdCZtCghtG5SYog-A">
    View patent at USPTO</a><br>
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&q=http://assignments.uspto.gov/assignments/q%3Fdb%3Dpat%26pat%3D3748943&usg=AFQjCNGbD7fvsJjOib3GgdU1gCXKiVjQsw">
    Search USPTO Assignment Database
    </a><br>
</div>

想要的结果InventorGroup =

<a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22">
    Ronald T. Lashley
    </a>
    <a href="http://www.google.com/search?tbo=p&amp;tbm=pts&amp;hl=en&amp;q=ininventor:%22Ronald+T.+Lashley%22">
    Thomas R. Lashley
    </a>

ClassificationGroup

<a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200P">84/312.00P</a>;
    <a href="http://www.google.com/url?id=3eF8AAAAEBAJ&amp;q=http://www.uspto.gov/web/patents/classification/uspc084/defs084.htm&amp;usg=AFQjCNEZRFtAyKTfNudgc-XVt2-VboD77Q#C084S31200R">84/312.00R</a>

我试图抓的页面:http://www.google.com/patents/US3748943

//安德斯

PS!我知道在这个页面中发明者的名字是相同的,但在大多数情况下它们是不同的!

2 回答

  • 2

    所以很明显我还不了解XPath . 所以我提出了这个解决方案 . 也许不是最聪明的解决方案,但它确实有效!

    //安德斯

    List<string> inventorList = new List<string>();
    List<string> classificationList = new List<string>();
    
    string xpath = "/html/body/table[@id='viewport_table']/tr/td[@id='viewport_td']/div[@class='vertical_module_list_row'][1]/div[@id='overview']/div[@id='overview_v']/table[@id='summarytable']/tr/td/div[@class='patent_bibdata']";
    HtmlNode nodes = m_doc.DocumentNode.SelectSingleNode(xpath);
    bool bInventors = false;
    bool bClassification = false;
    for (int i = 0; i < nodes.ChildNodes.Count; i++)
    {
        HtmlNode node = nodes.ChildNodes[i];
        string txt = node.InnerText;
        if (txt.IndexOf("Inventor") > -1)
        {
            bClassification = false;
            bInventors = true;
        }
        if (txt.IndexOf("Classification") > -1)
        {
            bClassification = true;
            bInventors = false;
        }
        if (txt.IndexOf("USPTO") > -1)
        {
            bClassification = false;
            bInventors = false;
        }
        string name = node.Name;
        if (name.IndexOf("a") > -1)
        {
            if (bInventors)
            {
                string inventor = node.InnerText;
                inventorList.Add(inventor);
            }
            if (bClassification)
            {
                string classification = node.InnerText;
                classificationList.Add(classification);
            }
        }
    
  • 4

    XPATH是你的朋友!像这样的东西会让你的发明者名字:

    HtmlWeb w = new HtmlWeb();
    HtmlDocument doc = w.Load("http://www.google.com/patents/US3748943");
    foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//div[@class='patent_bibdata']/br[1]/preceding-sibling::a"))
    {
        Console.WriteLine(node.InnerHtml);
    }
    

相关问题