我正在尝试构建一个webcrawler来从tsx页面获取趋势股票 . 我目前得到了所有趋势链接,现在我正试图抓取各个页面上的信息 . 基于我的代码,当我尝试在getStockDetails()中输出“quote_wrapper”时,它返回一个空列表 . 我怀疑是因为JavaScript尚未在页面上呈现?不确定这是不是一件事 . 无论如何,我试图输出页面上的所有html进行调试,我也没有看到它 . 我读到只有“渲染”JavaScript的方法是使用Selenium并使用browser.execute_script(“return document.documentElement.outerHTML”) . 它适用于索引页面,因此我尝试在其他页面上使用它 . 我也在代码中对它做了评论 . 如果可以的话,谢谢你的帮助 .
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
from urllib2 import urlopen as uReq
import time
import random
import requests
def getTrendingQuotes(source_code):
# grabs all the trending quotes for that day
links = []
page_soup = soup(source_code, "lxml")
trendingQuotes = page_soup.findAll("div", {"id": "trendingQuotes"})
all_trendingQuotes = trendingQuotes[0].findAll('a')
for link in all_trendingQuotes:
url = link.get('href')
name = link.text
# print(name)
links.append(url)
return links
def getStockDetails(url, browser):
print(url)
source_code = browser.execute_script(
"return document.documentElement.outerHTML")
#What is the correct syntax here?
#I'm trying to get the innerHTML of whole page in selenium driver
#It seems I can only access the JavaScript for the entire page this way
# source_code = browser.execute_script(
# "return" + url +".documentElement.outerHTML")
page_soup = soup(source_code, "html.parser")
# print(page_soup)
quote_wrapper = page_soup.findAll("div", {"class": "quoteWrapper"})
print(quote_wrapper)
def trendingBot(browser):
while True:
source_code = browser.execute_script(
"return document.documentElement.outerHTML")
trending = getTrendingQuotes(source_code)
for trend in trending:
browser.get(trend)
getStockDetails(trend, browser)
break
# print(trend)
def Main():
url = 'https://www.tmxmoney.com/en/index.html'
browser = webdriver.Chrome(
r"C:\Users\austi\OneDrive\Desktop\chromeDriver\chromedriver_win32\chromedriver.exe")
browser.get(url)
print("[+] Success! Bot Starting!")
trendingBot(browser)
browser.quit()
if __name__ == "__main__":
Main()
1 回答
请不要将BeautifulSoup和Selenium混合在一起 . 要使用javascript渲染页面,您需要等到元素生成,使用
WebDriverWait
并使用browser.page_source
获取页面源,但此处不使用它 .