从Beautifulsoup标签中提取src-Java 学习之路

我试图使用beautifulsoup刮取newegg的产品名称，描述，价格和图像 . 我有以下bs4.element.Tag类型，我想从标签中提取“src”链接 . 以下是我的标签：

df = <a class="itemImage" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16875169194&amp;cm_re=Samsung_edge-_-75-169-194-_-Product" id="img_75-169-194" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'>\n<img alt='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty' src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'/>\n</a>

我怎样才能提取出来

src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg"

从这个标签？我试过了

df.attrs['src']

但我收到了Keyerror .

2 回答

src在img标签中：

from bs4 import BeautifulSoup
tag = """<a class="itemImage" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16875169194&amp;cm_re=Samsung_edge-_-75-169-194-_-Product" id="img_75-169-194" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'>\n<img alt='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty' src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'/>\n</a>"""

soup = BeautifulSoup(tag,"lxml")

src = soup.img["src"]

哪个会给你：

http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg

回复于 2024-05-02T06:17:23+08:00

-1

在python引用中尝试正则表达式
https://docs.python.org/2/library/re.html

import re
s = """
    <a class="itemImage" href="http://www.newegg.com/Product/Product.aspx?Item=N82E16875169194&amp;cm_re=Samsung_edge-_-75-169-194-_-Product" id="img_75-169-194" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'>\n<img alt='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty' src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg" title='Samsung Galaxy S7 Edge Dual SIM Unlocked Smart Phone, Dual Edge 5.5" AMOLED Display, black Color, 32GB Storage 4GB RAM International Version - No US Warranty'/>\n</a>
    """
src_list = re.findall("src=[^\s]*", s)

输出：

src_list = ['src="http://images10.newegg.com/ProductImageCompressAll200/75-169-194-04.jpg"']

回复于 2024-05-02T06:17:23+08:00

从Beautifulsoup标签中提取src

2 回答

相关问题