首页 文章

使用beautifulsoup从span类标记中提取文本

提问于
浏览
1

我试图从网站的span类中提取一些文本元素 .

以下是HTML代码的片段:

<h1>2 Some address</h1>
                </div>
                <div id="smi-summary-items">
                    <div id="smi-price-string">&euro;230,000</div>
                    <span class="header_text"> Detached House</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">3 Beds</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">2 Baths</span>
                    <!-- Text_Link_Full_Ad_Unit -->
                    <div id='dfp-text_link_full_ad_unit' class='sale'>
                        <script type='text/javascript'>
                            googletag.cmd.push(function()
                                {
                                    googletag.display('dfp-text_link_full_ad_unit');
                                }
                            );
                        </script>
                    </div>

我想提取“3床”和“2浴室”的文字 .

我尝试了一些解决方案,但主要是获取错误或结果为空 .

谁有人建议解决方案?

2 回答

  • 2

    根据我的理解,您可以按类过滤所需的元素:

    [item.get_text() for item in soup.select("span.header_text")]
    

    完整的工作示例代码:

    from bs4 import BeautifulSoup
    
    data = """
    <div id="smi-summary-items">
        <div id="smi-price-string">&euro;230,000</div>
        <span class="header_text"> Detached House</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">3 Beds</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">2 Baths</span>
        <!-- Text_Link_Full_Ad_Unit -->
        <div id='dfp-text_link_full_ad_unit' class='sale'>
            <script type='text/javascript'>
                googletag.cmd.push(function()
                    {
                        googletag.display('dfp-text_link_full_ad_unit');
                    }
                );
            </script>
        </div>"""
    soup = BeautifulSoup(data, "html.parser")
    print([item.get_text(strip=True) for item in soup.select("span.header_text")])
    

    这产生:

    ['Detached House', '3 Beds', '2 Baths']
    
  • 0

    以下代码适用于从网站的span类中提取文本的某些元素

    >>> from bs4 import BeautifulSoup
    >>> import re
    >>> content = """<h1>2 Some address</h1>
    ...                 </div>
    ...                 <div id="smi-summary-items">
    ...                     <div id="smi-price-string">&euro;230,000</div>
    ...                     <span class="header_text"> Detached House</span>
    <span class="bar">&nbsp;|&nbsp;</span><span class="header_text">3 
    Beds</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">2 
    Baths</span>
    ...                     <!-- Text_Link_Full_Ad_Unit -->
    ...                     <div id='dfp-text_link_full_ad_unit' class='sale'>
    ...                         <script type='text/javascript'>
    ...                             googletag.cmd.push(function()
    ...                                 {
    ...                                     googletag.display('dfp-
    text_link_full_ad_unit');
    ...                                 }
    ...                             );
    ...                         </script>
    ...                     </div>"""
    
    >>> soup = BeautifulSoup(content, "html.parser")
    >>> req = soup.find_all("span", {"class":"header_text"})
    >>> print(req)
    [<span class="header_text"> Detached House</span>, <span 
    class="header_text">3 Beds</span>, <span class="header_text">2 Baths</span>]
    >>> x23 = []
    >>> for i in req:
    ...     x23.append(i.get_text())
    ...
    >>> print(x23)
    [' Detached House', '3 Beds', '2 Baths']
    

相关问题