首页 文章

如何使用Splash(JS渲染服务)与代理

提问于
浏览
1

它在Scrapy中自动配置,但不在Curl或普通请求中配置 .

在curl中,我们可以在没有任何代理的情况下执

http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=10&wait=0.5

怎么用代理呢?

我试过这个:

http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=10&wait=0.5 --proxy myproxy:port

但我得到了:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  <title>Lightspeed Systems - Web Access</title>

  <style type="text/css">
    html {
      background: #13396b; /* Old browsers */
      /* IE9 SVG, needs conditional override of 'filter' to 'none' */
      background: url();
      background: -moz-linear-gradient(top,  #13396b 0%, #3e6599 100%); /* FF3.6+ */
      background: -webkit-gradient(linear, left top, left bottom, color-stop(0%,#13396b), color-stop(100%,#3e6599)); /* Chrome,Safari4+ */
      background: -webkit-linear-gradient(top,  #13396b 0%,#3e6599 100%); /* Chrome10+,Safari5.1+ */
      background: -o-linear-gradient(top,  #13396b 0%,#3e6599 100%); /* Opera 11.10+ */
      background: -ms-linear-gradient(top,  #13396b 0%,#3e6599 100%); /* IE10+ */
      background: linear-gradient(to bottom,  #13396b 0%,#3e6599 100%); /* W3C */
      filter: progid:DXImageTransform.Microsoft.gradient( startColorstr='#13396b', endColorstr='#3e6599',GradientType=0 ); /* IE6-8 */
      height: 100%;
    }
    body {
      width: 960px;
      overflow: hidden;
      margin: 50px auto;
      font-family: "HelveticaNeue-Light", "Helvetica Neue Light", "Helvetica Neue", Helvetica, Arial, "Lucida Grande", sans-serif;
      font-size: 14px;
      color: #a2c3ef;
    }
    h1,h2 {
      color: #fff;
    }
    h1 {
      font-size: 32px;
      font-weight: normal;
    }
    h2 {
      font-size: 24px;
      font-weight: lighter;
    }
    a {
      color: #fff;
      font-weight: bold;
    }
    #content {
      margin: 20px 0 20px 30px;
    }
    blockquote#error, blockquote#data {
      color: #fff;
      font-size: 16px;
    }
    #footer p {
      font-size: 12px;
      padding: 7px 12px;
      margin-top: 10px;
      color: #fff;
      text-align: right;
    }
</style>

<!--[if gte IE 9]>
  <style type="text/css">
    .gradient {
      filter: none;
    }
  </style>
<![endif]-->
</head>

<body id=ERR_ACCESS_DENIED>
  <div id="titles">
    <h1>ERROR</h1>
    <h2>Unable to complete URL request</h2>
  </div>
  <hr>
  <div id="content">
    <p>An error has occurred while trying to access <a href="http://<server_ip>:8050/render.html?">http://<server_ip>:8050/render.html?</a>.</p>

    <blockquote id="error">
      <p><b>Access denied.</b></p>
    </blockquote>

    <p>Security permissions are not allowing the request attempt. Please contact your service provider if you feel this is incorrect.</p>
  </div>

  <hr>
  <div id="footer">
  </div>
</body>
</html>
C:\Users\Dr. Printer>curl "http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=30&wait=0.5"
{"description": "Timeout exceeded rendering page", "type": "GlobalTimeoutError", "info": {"timeout": 30.0}, "error": 504}

1 回答

  • 0

    如果我们想使用Crawlera作为代理,我们可以使用这个lua脚本来完成它

    function use_crawlera(splash)
        -- Make sure you pass your Crawlera API key in the 'crawlera_user' arg.
        -- Have a look at the file spiders/quotes-js.py to see how to do it.
        -- Find your Crawlera credentials in https://app.scrapinghub.com/
        local user = splash.args.crawlera_user
    
        local host = 'proxy.crawlera.com'
        local port = 8010
        local session_header = 'X-Crawlera-Session'
        local session_id = 'create'
    
        splash:on_request(function (request)
            -- The commented code below can be used to speed up the crawling
            -- process. They filter requests to undesired domains and useless
            -- resources. Uncomment the ones that make sense to your use case
            -- and add your own rules.
    
            -- Discard requests to advertising and tracking domains.
            if string.find(request.url, 'doubleclick%.net') or
               string.find(request.url, 'analytics%.google%.com') then
               request.abort()
               return
            end
    
            -- Avoid using Crawlera for subresources fetching to increase crawling
            -- speed. The example below avoids using Crawlera for URLS starting
            -- with 'static.' and the ones ending with '.png'.
            if string.find(request.url, '://static%.') ~= nil or
               string.find(request.url, '%.png$') ~= nil then
               return
            end
    
            request:set_header('X-Crawlera-Cookies', 'disable')
            request:set_header(session_header, session_id)
            request:set_proxy{{host, port, username=user, password=''}}
        end)
    
        splash:on_response_headers(function (response)
            if type(response.headers[session_header]) ~= nil then
                session_id = response.headers[session_header]
            end
        end)
    end
    
    function main(splash)
        use_crawlera(splash)
        splash:init_cookies(splash.args.cookies)
        assert(splash:go{{
            splash.args.url,
            headers=splash.args.headers,
            http_method=splash.args.http_method,
        }})   
        assert(splash:wait({0}))
        return {{
            html = splash:html(),
            cookies = splash:get_cookies(),
        }}
    end
    

    不要忘记安装 scrapy-crawlera 并在设置中激活它 . 欲了解更多信息,请参阅https://support.scrapinghub.com/support/solutions/articles/22000188428-using-crawlera-with-splash-scrapy

相关问题