配置Google自定义搜索以像google.search（）一样工作-Java 学习之路

我有一个相对较大的项目，搜索Google已经为我们的缺失值返回了最佳结果 . 在Python中使用google搜索可以获得我需要的确切结果 . 尝试使用自定义搜索以解除查询限制时，返回的结果与我需要的结果并不相近 . 我有以下代码（在Searching in Google with Python中建议），它完全返回我需要的内容，这与我在Google网站上搜索时完全相同，但由于http请求过多而被阻止...

from google import search

def google_scrape(url):
    cj = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    thepage = opener.open(url)
    soup = BeautifulSoup(thepage, "html.parser")
    return soup.title.text

i = 1
# queries = ['For. Policy Econ.','Int. J. Soc. For.','BMC Int Health Hum. Rights',
#            'Environ. Health Persp','Environ. Entomol.','Sociol. Rural.','Ecol. Soc.']

search_results = []    
abbrevs_searched = []   
url_results = []  

error_names = []
error = []

#Note, names_to_search is simply a longer version of the commented our queries list. 
for abbreviation in names_to_search:   
    query = abbreviation
    for url in search(query, num=2,stop=1):
        try:
            a = google_scrape(url)
            print(str(i) + ". " + a)
            search_results.append(a)
            abbrevs_searched.append(query)
            url_results.append(url)
            print(url)
            print(" ")
        except Exception as e:
            error_names.append(query)
            error.append(query)
            print("\n\n***************"," Exeption: ",e)
        i += 1

我通过以下方式设置了Google自定义搜索引擎代码...

import urllib
from bs4 import BeautifulSoup
import http.cookiejar
from apiclient.discovery import build
"""List of names to search on google"""
names_to_search = set(search_list_1+search_list)
service = build('customsearch', 'v1',developerKey="AIz**********************")
rse = service.cse().list(q="For. Policy Econ.",cx='*******************').execute()
rse

我的Google自定义搜索引擎设置已设置为搜索Google.com . 截至目前，除Google.com网站外，所有其他设置均为默认设置 .

1 回答

1
据我所知，python模块的问题不是python模块的限制，而是谷歌不允许用脚本刮取页面的事实 . 当我运行你的程序（使用谷歌模块）时，我得到 HTTP Error 503 . 这是因为在短时间内请求过多的谷歌要求您进行验证码确认，并且没有可以绕过验证码的模块 . 所以任何其他选项都使用Google Custom Search API . 问题在于它旨在搜索您的网页 .

Google自定义搜索可让您为自己的网站，博客或网站集创建搜索引擎 . 阅读更多 .

有一种方法可以搜索整个网络，如Bangkokian在他的_1828356中解释：

要创建搜索整个网络的Google自定义搜索引擎，请执行以下操作：在Google自定义搜索主页中，单击“创建自定义搜索引擎” . 输入搜索引擎的名称和说明 . 在“定义您的搜索引擎”下的“要搜索的站点”框中，输入至少一个有效的URL（现在，只需将www.anyurl.com放到此屏幕即可 . 稍后详细介绍） . 选择所需的CSE版本并接受服务条款，然后单击“下一步” . 选择所需的布局选项，然后单击“下一步” . 单击“后续步骤”部分下的任何链接以导航到“控制”面板 . 在左侧菜单中的“控制面板”下，单击“基本” . 在“搜索首选项”部分中，选择“搜索整个Web”但强调包含的网站 . 单击保存更改 . 在左侧菜单中的“控制面板”下，单击“站点” . 删除在初始设置过程中输入的站点 .

你已经创建了一个自定义搜索引擎，所以在Google Custom Search你需要点击你已经拥有的搜索引擎（它可能是"Google"，在下面的图片上标有红框）：

现在你需要在搜索首选项部分，选择搜索整个网络但强调包含的网站（步骤7），然后单击添加按钮：

输入http://www.example.org/，将其设置为仅包含特定页面，然后单击保存：

之后选择旧网站并单击删除：

更新保存更改：

不幸的是，这不会提供与Web上的serching相同的rusult：

请注意，结果可能与您在Google网页搜索上搜索获得的结果不符 . 阅读更多 .

此外，您只能使用免费版本：

本文仅适用于免费的基本自定义搜索引擎 . 您无法将Google Site Search设置为搜索整个网络 . 阅读更多 .

每天最多有100个搜索查询：

对于CSE用户，API每天免费提供100个搜索查询 . 阅读更多 .

只有其他选择是使用其他搜索引擎的API . 似乎只有一个是免费的FAROO API .

Edit: 您可以在python中使用selenium webdriver模仿浏览器使用情况 . 有options使用Firefox，Chrome，Edge或Safari网络驱动程序（它实际上会打开Chrome并进行搜索），但这很烦人，因为您实际上并不想看到浏览器 . 但是有解决方案可以使用PhantomJS .

PhantomJS是一个带有JavaScript API的无头WebKit脚本 .

从here下载 . 在下面的示例中提取并查看如何使用它（我编写了可以使用的简单类，您只需要更改PhantomJS的路径）：
```
import time
from urllib.parse import quote_plus
from selenium import webdriver


class Browser:

    def __init__(self, path, initiate=True, implicit_wait_time = 10, explicit_wait_time = 2):
        self.path = path
        self.implicit_wait_time = implicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        self.explicit_wait_time = explicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        if initiate:
            self.start()
        return

    def start(self):
        self.driver = webdriver.PhantomJS(path)
        self.driver.implicitly_wait(self.implicit_wait_time)
        return

    def end(self):
        self.driver.quit()
        return

    def go_to_url(self, url, wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        self.driver.get(url)
        print('[*] Fetching results from: {}'.format(url))
        time.sleep(wait_time)
        return

    def get_search_url(self, query, page_num=0, per_page=10, lang='en'):
        query = quote_plus(query)
        url = 'https://www.google.hr/search?q={}&num={}&start={}&nl={}'.format(query, per_page, page_num*per_page, lang)
        return url

    def scrape(self):
        #xpath migth change in future
        links = self.driver.find_elements_by_xpath("//h3[@class='r']/a[@href]") # searches for all links insede h3 tags with class "r"
        results = []
        for link in links:
            d = {'url': link.get_attribute('href'),
                 'title': link.text}
            results.append(d)
        return results

    def search(self, query, page_num=0, per_page=10, lang='en', wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        url = self.get_search_url(query, page_num, per_page, lang)
        self.go_to_url(url, wait_time)
        results = self.scrape()
        return results




path = '<YOUR PATH TO PHANTOMJS>/phantomjs-2.1.1-windows/bin/phantomjs.exe' ## SET YOU PATH TO phantomjs
br = Browser(path)
results = br.search('For. Policy Econ.')
for r in results:
    print(r)

br.end()
```
回复于 2024-05-02T11:14:18+08:00

配置Google自定义搜索以像google.search（）一样工作

1 回答

相关问题