首页 文章

无法使用python-requests获取带有“Content-Disposition:attachment;”的网页

提问于
浏览
1

使用我的firefox浏览器,我登录到一个下载站点并单击其中一个查询按钮 . 弹出一个小窗口,名为“打开report1.csv”,我可以选择“打开方式”或“保存文件” . 我保存文件 .

对于这个动作Live HTTP headers告诉我:

https:// myserver / ReportPage?download&NAME = ALL&DATE = THISYEAR GET / ReportPage?download&NAME = ALL&DATE = THISYEAR HTTP / 1.1主机:myserver用户代理:Mozilla / 5.0(X11; Linux x86_64; rv:52.0)Gecko / 20100101 Firefox /52.0接受:text / html,application / xhtml xml,application / xml; q = 0.9,/; q = 0.8 Accept-Language:en-US,en; q = 0.8,de-DE; q = 0.5,de; q = 0.3 Accept-Encoding:gzip,deflate,br Referer:https:// myserver / ReportPage?4&NAME = ALL&DATE = THISYEAR Cookie:JSESSIONID = 88DEDBC6880571FDB0E6E4112D71B7D6连接:keep-alive Upgrade-Insecure-Requests:1 HTTP / 1.1 200 OK Date :星期六,2017年12月30日22:37:40 GMT服务器:Apache-Coyote / 1.1最后修改时间:2017年12月30日星期六22:37:40 GMT到期日:1970年1月1日星期四00:00:00 GMT Pragma:no -cache Cache-Control:no-cache,no-store Content-Disposition:attachment;文件名= “report1.csv”; filename * = UTF-8''report1.csv Content-Type:text / csv Content-Length:332369 Keep-Alive:timeout = 5,max = 100 Connection:Keep-Alive

现在我尝试用请求模拟这个 .

$ python3
>>> import requests
>>> from lxml import html
>>>
>>> s = requests.Session()
>>> s.verify = './myserver.crt'  # certificate of myserver for https
>>>
>>> # get the login web page to enter username and password
... r = s.get( 'https://myserver' )
>>>
>>> # Get url for logging in. It's the action-attribute in the form anywhere.
... # We use xpath.
... tree = html.fromstring(r.text)
>>> loginUrl = 'https://myserver/' + list(tree.xpath("//form[@id='id4']/@action"))[0]
>>> print( loginUrl )   # it contains a session-id
https://myserver/./;jsessionid=77EA70CB95252426439097E274286966?0-1.loginForm
>>>
>>> # logging in with username and password
... r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
>>> print( r.status_code )
200
>>> # try to get the download file using url from Live HTTP headers
... downloadQueryUrl = 'https://myserver/ReportPage?download&NAME=ALL&DATE=THISYEAR'
>>> r = s.get( downloadQueryUrl )
>>> print( r.status_code)
200
>>> print( r. headers )
{'Connection': 'Keep-Alive',
'Date': 'Sun, 31 Dec 2017 14:46:03 GMT',
'Cache-Control': 'no-cache, no-store',
'Keep-Alive': 'timeout=5, max=94',
'Transfer-Encoding': 'chunked',
'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT',
'Pragma': 'no-cache',
'Content-Encoding': 'gzip',
'Content-Type': 'text/html;charset=UTF-8',
'Server': 'Apache-Coyote/1.1',
'Vary': 'Accept-Encoding'}
>>> print( r.url )
https://myserver/ReportPage?4&NAME=ALL&DATE=THISYEAR
>>>

请求成功但我没有得到文件下载页面 . 没有“内容 - 处置:附件”; Headers 中的条目 . 我只获取查询开始的页面,例如来自引用者的页面 .

这与session-cookie有关吗?似乎请求自动管理这个 . csv文件有特殊处理吗?我必须使用流吗? Live HTTP Headers显示的download-Url是正确的吗?也许有一个动态的创作?

如何获得包含“Content-Disposition:attachment;”的网页来自myserver并下载其文件请求?

1 回答

  • 1

    我知道了 . @Patrick Mevzek指出了我正确的方向 . 这次真是万分感谢 .

    登录后,我不会留在第一个登录页面并调用查询 . 相反,我请求报告页面,从中提取query-url并请求query-url . 现在我在其 Headers 中得到了“Content-Disposition:attachment;”的回复 . 它_d1937_的文字到标准输出 . 我更喜欢这样,因为我可以将输出重定向到任何文件 . 信息消息发送到stderr,因此它们不会弄乱重定向的输出 . 典型的电话是 ./download >out.csv .

    为了完整性,这里是脚本模板,没有任何错误检查以澄清其工作 .

    #!/usr/bin/python3
    
    import requests
    import sys
    from lxml import html
    
    s = requests.Session()
    s.verify = './myserver.crt'  # certificate of myserver for https
    
    # get the login web site to enter username and password
    r = s.get( 'https://myserver' )
    
    # Get url for logging in. It's the action-attribute in the form anywhere.
    # We use xpath.
    tree = html.fromstring(r.text)
    loginUrl = 'https://myserver/' + tree.xpath("//form[@id='id4']/@action")[0]
    
    # logging in with username and password and go to ReportPage with queries
    r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
    queryUrl = 'https://myserver/ReportPage?NAME=ALL&DATE=THISYEAR'
    r = s.get( queryUrl )
    
    # Get the download link for this query from this site. It's a link anywhere
    # with value 'Download (UTF8)'
    tree = html.fromstring( r.text )
    downloadUrl = 'https://myserver/' + tree.xpath("//a[.='Download (UTF8)']/@href")[0]
    
    # get the download file
    r = s.get( downloadUrl )
    if r.headers.get('Content-Disposition'):
        print( 'Downloading ...', file=sys.stderr )
        print( r.text )
    
    # log out
    r = s.get( 'https://myserver/logout' )
    

相关问题