首页 文章

asyncio web scraping 101:使用aiohttp获取多个url

提问于
浏览
13

在之前的问题中, aiohttp 的作者之一建议使用 Python 3.5 中的新 async with 语法fetch multiple urls with aiohttp

import aiohttp
import asyncio

async def fetch(session, url):
    with aiohttp.Timeout(10):
        async with session.get(url) as response:
            return await response.text()

async def fetch_all(session, urls, loop):
    results = await asyncio.wait([loop.create_task(fetch(session, url))
                                  for url in urls])
    return results

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    # breaks because of the first url
    urls = ['http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
            'http://google.com',
            'http://twitter.com']
    with aiohttp.ClientSession(loop=loop) as session:
        the_results = loop.run_until_complete(
            fetch_all(session, urls, loop))
        # do something with the the_results

但是,当其中一个 session.get(url) 请求中断(如上所述因为 http://SDFKHSKHGKLHSKLJHGSDFKSJH.com )时,错误将不会被处理,整个事情就会中断 .

我找了一些方法来插入关于 session.get(url) 的结果的测试,例如查找 try ... except ...if response.status != 200: 的位置,但我只是不理解如何使用 async withawait 和各种对象 .

由于 async with 仍然很新,所以没有很多例子 . 如果 asyncio 向导可以显示如何执行此操作,对许多人来说会非常有帮助 . 毕竟,大多数人想要使用 asyncio 进行测试的第一件事就是同时获得多个资源 .

Goal

我们的目标是检查 the_results 并快速查看:

  • 此网址失败(以及原因:状态代码,可能是异常名称),或

  • 此网址有效,这是一个有用的响应对象

2 回答

  • 4

    我会使用gather而不是 wait ,它可以将异常作为对象返回,而不会提升它们 . 然后,您可以检查每个结果,如果它是某个异常的实例 .

    import aiohttp
    import asyncio
    
    async def fetch(session, url):
        with aiohttp.Timeout(10):
            async with session.get(url) as response:
                return await response.text()
    
    async def fetch_all(session, urls, loop):
        results = await asyncio.gather(
            *[fetch(session, url) for url in urls],
            return_exceptions=True  # default is false, that would raise
        )
    
        # for testing purposes only
        # gather returns results in the order of coros
        for idx, url in enumerate(urls):
            print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK'))
        return results
    
    if __name__ == '__main__':
        loop = asyncio.get_event_loop()
        # breaks because of the first url
        urls = [
            'http://SDFKHSKHGKLHSKLJHGSDFKSJH.com',
            'http://google.com',
            'http://twitter.com']
        with aiohttp.ClientSession(loop=loop) as session:
            the_results = loop.run_until_complete(
                fetch_all(session, urls, loop))
    

    测试:

    $python test.py 
    http://SDFKHSKHGKLHSKLJHGSDFKSJH.com: ERR
    http://google.com: OK
    http://twitter.com: OK
    
  • 13

    我远非asyncio专家,但你想捕捉到捕获套接字错误所需的错误:

    async def fetch(session, url):
        with aiohttp.Timeout(10):
            try:
                async with session.get(url) as response:
                    print(response.status == 200)
                    return await response.text()
            except socket.error as e:
                print(e.strerror)
    

    运行代码并打印the_results:

    Cannot connect to host sdfkhskhgklhskljhgsdfksjh.com:80 ssl:False [Can not connect to sdfkhskhgklhskljhgsdfksjh.com:80 [Name or service not known]]
    True
    True
    ({<Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!DOCTYPE ht...y>\n</html>\n'>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result=None>, <Task finished coro=<fetch() done, defined at <ipython-input-7-535a26aaaefe>:5> result='<!doctype ht.../body></html>'>}, set())
    

    你可以看到我们得到了错误,进一步的调用仍然成功返回html .

    我们应该真正捕获一个OSError,因为socket.error是A deprecated alias of OSError,因为python 3.3:

    async def fetch(session, url):
        with aiohttp.Timeout(10):
            try:
                async with session.get(url) as response:
                    return await response.text()
            except OSError as e:
                print(e)
    

    如果你还想检查响应是200,那么把你的if放在try中,你可以使用reason属性来获取更多信息:

    async def fetch(session, url):
        with aiohttp.Timeout(10):
            try:
                async with session.get(url) as response:
                    if response.status != 200:
                        print(response.reason)
                    return await response.text()
            except OSError as e:
                print(e.strerror)
    

相关问题