首页 文章

刮痧eBay特色产品页面链接集合

提问于
浏览
1

我'm attempting to build a web scraping tool using Python and BeautifulSoup that enters an eBay Featured Collection and retrieves the URLs of all the products within the collection (most collections have 17 products, although some have a few more or less). Here'是我试图在我的代码中搜集的集合的URL:http://www.ebay.com/cln/ebayhomeeditor/Surface-Study/324079803018

到目前为止,这是我的代码:

import requests
from bs4 import BeautifulSoup

url = 'http://www.ebay.com/cln/ebayhomeeditor/Surface-Study/324079803018'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

product_links = []

item_thumb = soup.find_all('div', attrs={'class':'itemThumb'})
for link in item_thumb:
    product_links.append(link.find('a').get('href'))

print product_links

此刮刀应将17个链接附加到列表product_links . 但是,它只适用于中途 . 具体来说,它每次只抓取前12个产品链接,剩下的5个产品链接保持不变,即使在相同的HTML标记和属性中找到所有17个链接 . 仔细查看页面的HTML代码,我发现的唯一区别是前12个链接和最后5个链接由我在此处包含的XML脚本分隔:

<script escape-xml="true">
      if (typeof(collectionState) != 'object') {
          var collectionState = {
              itemImageSize: {sWidth: 280, sHeight: 280, lWidth: 580, lHeight: 620},
              page: 1,
              totalPages: 2,
              totalItems: 17,
              pageId: '2057253',
              currentUser: '',
              collectionId: '323101965012',
              serviceHost: 'svcs.ebay.com/buying/collections/v1',
              owner: 'ebaytecheditor',
              csrfToken: '',
              localeId: 'en-US',
              siteId: 'EBAY-US',
              countryId: 'US',
              collectionCosEnabled: 'true',
              collectionCosHostExternal: 'https://api.ebay.com/social/collection/v1',
              collectionCosEditEnabled: 'true',
              isCollectionReorderEnabled: 'false',
              isOwnerSignedIn: false || false,
              partiallySignedInUser: '@@__@@__@@',
              baseDomain: 'ebay.com',
              currentDomain: 'www.ebay.com',
              isTablet: false,
              isMobile: false,
              showViewCount: true
          };
      }
    </script>

这个脚本的功能是什么?这个脚本是否可能是我的刮刀忽略刮掉最后5个链接的原因?有办法解决这个问题吗?

1 回答

  • 0

    最后几个是通过ajax请求生成的http://www.ebay.com/cln/_ajax/2/ebayhomeeditor/324079803018

    enter image description here

    网址是使用ebayhomeeditor制作的,必须是一些产品ID 324079803018,它们都在您访问的页面的原始网址中 .

    获取数据所必需的唯一参数是itemsPerPage,但你可以玩其余的东西,看看它们有什么效果 .

    params =  {"itemsPerPage": "10"}
    soup= BeautifulSoup(requests.get("http://www.ebay.com/cln/_ajax/2/ebayhomeeditor/324079803018", params=params).content)
    print([a["href"] for a in soup.select("div.itemThumb div.itemImg.image.lazy-image a[href]")])
    

    哪个会给你:

    ['http://www.ebay.com/itm/yamazaki-home-tower-book-end-white-stationary-holder-desktop-organizing-steel/171836462366?hash=item280240551e', 'http://www.ebay.com/itm/tetris-constructible-interlocking-desk-lamp-neon-light-nightlight-by-paladone/221571335719?hash=item3396ae4627', 'http://www.ebay.com/itm/iphone-docking-station-dock-native-union-new-in-box/222202878086?hash=item33bc52d886', 'http://www.ebay.com/itm/turnkey-pencil-sharpener-silver-office-home-school-desk-gift-peleg-design/201461359979?hash=item2ee808656b', 'http://www.ebay.com/itm/himori-weekly-times-desk-notepad-desktop-weekly-scheduler-30-weeks-planner/271985620013?hash=item3f539b342d']
    

    所以把它放在一起得到所有网址:

    In [23]: params = {"itemsPerPage": "10"}
    
    In [24]: with requests.Session() as s:
       ....:         soup1 = BeautifulSoup(s.get('http://www.ebay.com/cln/ebayhomeeditor/Surface-Study/324079803018').content,
       ....:                               "html.parser")
       ....:         main_urls = [a["href"] for a in soup1.select("div.itemThumb div.itemImg.image.lazy-image a[href]")]
       ....:         soup2 = BeautifulSoup(s.get("http://www.ebay.com/cln/_ajax/2/ebayhomeeditor/324079803018", params=params).content,
       ....:                               "html.parser")
       ....:         print(len(main_urls))
       ....:         main_urls.extend(a["href"] for a in soup2.select("div.itemThumb div.itemImg.image.lazy-image a[href]"))
       ....:         print(main_urls)
       ....:         print(len(main_urls))
       ....:     
    12
    ['http://www.ebay.com/itm/archi-desk-accessories-pen-cup-designed-by-hsunli-huang-for-moma/262435041373?hash=item3d1a58f05d', 'http://www.ebay.com/itm/moorea-seal-violet-light-crane-scissors/201600302323?hash=item2ef0507cf3', 'http://www.ebay.com/itm/kikkerland-photo-holder-with-6-magnetic-wooden-clothespin-mh69-cable-47-long/361394782932?hash=item5424cec2d4', 'http://www.ebay.com/itm/authentic-22-design-studio-merge-concrete-pen-holder-desk-office-pencil/331846509549?hash=item4d4397e3ed', 'http://www.ebay.com/itm/supergal-bookend-by-artori-design-ad103-metal-black/272273290322?hash=item3f64c0b452', 'http://www.ebay.com/itm/elago-p2-stand-for-ipad-tablet-pcchampagne-gold/191527567203?hash=item2c97eebf63', 'http://www.ebay.com/itm/this-is-ground-mouse-pad-pro-ruler-100-authentic-natural-retail-100/201628986934?hash=item2ef2062e36', 'http://www.ebay.com/itm/hot-fuut-foot-rest-hammock-under-desk-office-footrest-mini-stand-hanging-swing/152166878943?hash=item236dda4edf', 'http://www.ebay.com/itm/unido-silver-white-black-led-desk-office-lamp-adjustable-neck-brightness-level/351654910666?hash=item51e0441aca', 'http://www.ebay.com/itm/in-house-black-desk-office-organizer-paper-clips-memo-notes-monkey-business/201645856763?hash=item2ef30797fb', 'http://www.ebay.com/itm/rifle-paper-co-2017-maps-desk-calendar-illustrated-worldwide-cities/262547131670?hash=item3d21074d16', 'http://www.ebay.com/itm/muji-erasable-pen-black/262272348079?hash=item3d10a66faf', 'http://www.ebay.com/itm/rifle-paper-co-2017-maps-desk-calendar-illustrated-worldwide-cities/262547131670?hash=item3d21074d16', 'http://www.ebay.com/itm/muji-erasable-pen-black/262272348079?hash=item3d10a66faf', 'http://www.ebay.com/itm/yamazaki-home-tower-book-end-white-stationary-holder-desktop-organizing-steel/171836462366?hash=item280240551e', 'http://www.ebay.com/itm/tetris-constructible-interlocking-desk-lamp-neon-light-nightlight-by-paladone/221571335719?hash=item3396ae4627', 'http://www.ebay.com/itm/iphone-docking-station-dock-native-union-new-in-box/222202878086?hash=item33bc52d886', 'http://www.ebay.com/itm/turnkey-pencil-sharpener-silver-office-home-school-desk-gift-peleg-design/201461359979?hash=item2ee808656b', 'http://www.ebay.com/itm/himori-weekly-times-desk-notepad-desktop-weekly-scheduler-30-weeks-planner/271985620013?hash=item3f539b342d']
    19
    
    In [25]:
    

    与返回的内容有一点重叠,所以只需使用一个集来存储列表中的main_urls或call set:

    In [25]: len(set(main_urls))
    Out[25]: 17
    

    不知道为什么会发生这种情况并且没有真正试图解决它,如果它困扰你那么你可以从ajax调用返回源解析"totalItems: 17"并在第一次调用后减去 main_urls 的长度并设置 {"itemsPerPage": str(len(main_urls) - int(parsedtotal))} 但我不会太担心关于它 .

相关问题