我正在抓取的HTML是在下面 . 它包含一个帖子和2个回复:
<div class="share_buttons noprint">...</div>
<strong>Dan</strong> Says:
<span class="small soft"><time datetime="2009-10-05T02:27:38Z">Sun, Oct 04 '09, 7:27 PM</time></span>
<div class="quote_top"> </div>
<div class="quote_item">Hello all, this is my original post.
</div>
<form class="action_heading noprint">
<strong>Page</strong>
...
</form>
<div class="post_number" id="r_140626">1</div>
<strong>AnnieMae</strong> Says:
<span class="small soft"><time datetime="2009-10-05T02:30:27Z">Sun, Oct 04 '09, 7:30 PM</time></span>
<div class="quote_top clear_float"> </div>
<div class="quote_item">What do you think of it?
</div>
<div class="post_number" id="r_140627">2</div>
<strong>Thomas77</strong> Says:
<span class="small soft"><time datetime="2009-10-05T02:32:32Z">Sun, Oct 04 '09, 7:32 PM</time></span>
<div class="quote_top clear_float"> </div>
<div class="quote_item">Not really sure, can't see this pic?
</div>
所以我已经想出如何获得原帖...
'get AUTHOR and DATE of original post
Dim divOriginalPostAuthor As HtmlNode = threadDoc.DocumentNode.SelectSingleNode("//div[@class='share_buttons noprint']/following-sibling::strong")
Dim divOriginalPostDate As HtmlNode = threadDoc.DocumentNode.SelectSingleNode("//div[@class='share_buttons noprint']/following-sibling::span/time")
Dim strDate As String = divOriginalPostDate.InnerText.Trim
strDate = strDate.Remove(0, InStr(strDate, ", ")).Trim
strDate = Replace(strDate, "'", 20)
Dim strAuthor As String = (divOriginalPostAuthor.InnerText).Trim
dtPosted = CDate(strDate)
divOriginalPostText = threadDoc.DocumentNode.SelectSingleNode("//div[@class='share_buttons noprint']/following-sibling::div[@class='quote_item']")
现在我只想弄清楚如何获得回复......我正在考虑获取当前行位置,如下所示:
Dim currentNodePosition As Integer = threadDoc.DocumentNode.SelectSingleNode("//form[@class='action_heading noprint']").Line
然后使用它来迭代回复,因为我增加当前行位置 . 让我觉得棘手的想法是,回复中没有一个“容器”html元素供我立即收集....任何想法?
1 回答
只是为了记录,我想出了这一点,并想为将来需要它的人发布答案 .
因此,它是关于“抓取”用于1个回复的节点并再次在xpath中使用它以确保您只获得在您抓取的节点之后出现的回复 . 我这样做是通过使用HTMLNode.Xpath为您提供任何给定HTMLAgilityPack.htmlnode的xpath字符串,然后添加“/ following-sibling” .