首页 文章

从类VBA中提取 Span 值

提问于
浏览
2

经过大量搜索后,我正在努力使用VBA从下面的HTML中删除数据 . 具体来说,我试图从下面的HTML代码中的每个class =“_ Xnb _QJ”中提取值“DATA ONE”和“DATA THREE”:

<div class="results">
  <div class="_s2 _wPc">
    <div class="_fW _QJ">
    <div class="_Xnb _QJ _Z9b">
    <div class="_Xnb _QJ">
    <div class="_Xnb _QJ">
    <div class="_Xnb _QJ">
      <a href="//Extracted URL//">
        <span class="_fbb">
          <img id="uid_3" //Extracted// >
        </span>
        <span class="_PHb">
          <span class="_MHb">DATA ONE</span>
        </span>
        <span class="_B6e">
          <span class="_x2">DATA TWO</span>
          <span class="_Fs"> DATA THREE </span>

我一直在尝试使用getElementsByClassName来获取“_Xnb _QJ”类的集合,并且对于每个类,使用getElementsByTagName来搜索“_MHb”和“_FS” . 我不能按数字顺序挑选子项,因为这在“_Xnb ..”类之间发生变化,但我需要的数据总是附加相同的(_MHb / FS)类标记 .

我是VBA / HTML的完全新手所以这个代码主要是通过在stackoverflow上的其他地方编辑示例来组装的 . 我想知道我需要的类是否在“href”内而不是直接在_Xnb类之下,这是我无法提取正确数据的原因吗?

下面我的VBA代码的相关部分 - 当我运行它时,代码似乎运行正常但没有收集数据 .

Dim RowNumber As Long
Dim DataOne As String
Dim DataThree As String
Dim QuestionList As IHTMLElementCollection
Dim Question As IHTMLElement
Dim QuestionFields As IHTMLElementCollection
Dim QuestionField As IHTMLElement
RowNumber = 1

Set QuestionList = html.getElementsByClassName("_Xnb _QJ")

For Each Question In QuestionList
Set QuestionFields = Question.getElementsByTagName("SPAN")

For Each QuestionField In QuestionFields
If QuestionField.className = "_MHb" Then
DataOne= QuestionField.innerText
Cells(RowNumber, 1).Value = DataOne
End If

If QuestionField.className = "_Fs" Then
DataThree = QuestionField.innerText
Cells(RowNumber, 2).Value = DataThree
End If

Next QuestionField
RowNumber = RowNumber + 1
Next
Set html = Nothing
MsgBox "Done!"

End Sub

任何帮助将非常感激 .

非常感谢

1 回答

  • 0

    我建议你研究XPath - 一种基于标准的查询语言,用于处理XML文档 . 您也可以将它与HTML文档结合使用 . 它有点神秘,但超级有用,也可用于VBA .

    您的示例HTML看起来有点复杂,因为您有多个具有相同类的 <div> 标记 . 由于 <img> 标记中的 //Extracted// ,它也不是有效的XML . 此外,示例中没有结束标记 . 无论如何,我已经在下面的代码示例中整理了它 .

    我看了你的问题,并解释如下:

    从<span>标签中提取任何文本,其中包含_MHb或Fs类;它是_Xnb _QJ类的<div>标签的后代

    如果是这样,您的XPath查询可以由三部分构成:

    //div[@class='_Xnb _QJ']
    

    含义 - 获取类为 _Xnb _QJ 的任何div标签 .

    (//div[@class='_Xnb _QJ'])[last()]
    

    含义 - 从第一组中获取最里面的项目(记住你有多个嵌套的 <div> 标签与同一个类) .

    (//div[@class='_Xnb _QJ'])[last()]//span[@class='_MHb' or @class='_Fs']
    

    含义 - 为具有 _Mhb_Fs 类的 <span> 标记过滤最里面的 <div> .

    因此,如果包含MSXML库(我认为您已经完成),则可以在VBA中使用XPath . 代码如下所示:

    Option Explicit
    
    Sub Test()
    
        Dim strXml As String
        Dim objXml As New DOMDocument60
        Dim strXPath As String
        Dim objXmlNodeList As IXMLDOMNodeList
        Dim objXmlNode As IXMLDOMNode
    
        'get the sample XML
        strXml = GetXml
    
        'load xml to document
        If Not objXml.LoadXML(strXml) Then
            Debug.Print "Not parsed"
            Exit Sub
        End If
    
        'apply XPath
        'first just let's get the last <div> tag of class _Xnb _QJ
        strXPath = "(//div[@class='_Xnb _QJ'])[last()]"
        'test that query
        Set objXmlNodeList = objXml.SelectNodes(strXPath)
        For Each objXmlNode In objXmlNodeList
            Debug.Print objXmlNode.XML
        Next objXmlNode
    
        'now lets append a filter to only get the <span> texts
        strXPath = strXPath & "//span[@class='_MHb' or @class='_Fs']"
    
        'get output nodes by applying query to xml
        Set objXmlNodeList = objXml.SelectNodes(strXPath)
        For Each objXmlNode In objXmlNodeList
            Debug.Print objXmlNode.Text
        Next objXmlNode
    
    End Sub
    
    Function GetXml() As String
    
        Dim strXml As String
    
        strXml = ""
        strXml = strXml & "<div class=""results"">"
        strXml = strXml & "  <div class=""_s2 _wPc"">"
        strXml = strXml & "    <div class=""_fW _QJ"">"
        strXml = strXml & "      <div class=""_Xnb _QJ _Z9b"">"
        strXml = strXml & "        <div class=""_Xnb _QJ"">"
        strXml = strXml & "          <div class=""_Xnb _QJ"">"
        strXml = strXml & "            <div class=""_Xnb _QJ"">"
        strXml = strXml & "              <a href=""//Extracted URL//"">"
        strXml = strXml & "                <span class=""_fbb"">"
        strXml = strXml & "                  <img id=""uid_3"" />"
        strXml = strXml & "                </span>"
        strXml = strXml & "                <span class=""_PHb"">"
        strXml = strXml & "                  <span class=""_MHb"">DATA ONE</span>"
        strXml = strXml & "                </span>"
        strXml = strXml & "                <span class=""_B6e"">"
        strXml = strXml & "                  <span class=""_x2"">DATA TWO</span>"
        strXml = strXml & "                  <span class=""_Fs""> DATA THREE </span>"
        strXml = strXml & "                </span>"
        strXml = strXml & "              </a>"
        strXml = strXml & "            </div>"
        strXml = strXml & "          </div>"
        strXml = strXml & "        </div>"
        strXml = strXml & "      </div>"
        strXml = strXml & "    </div>"
        strXml = strXml & "  </div>"
        strXml = strXml & "</div>"
    
        GetXml = strXml
    
    End Function
    

    调试输出如下所示:

    <div class="_Xnb _QJ">
        <a href="//Extracted URL//">
            <span class="_fbb">
                <img id="uid_3"/>
            </span>
            <span class="_PHb">
                <span class="_MHb">DATA ONE</span>
            </span>
            <span class="_B6e">
                <span class="_x2">DATA TWO</span>
                <span class="_Fs"> DATA THREE </span>
            </span>
        </a>
    </div>
    DATA ONE
    DATA THREE
    

    这看起来有点复杂 - 但是一旦你尝试了几次就会没问题 .

相关问题