无法使用python-requests获取带有“Content-Disposition：attachment;”的网页-Java 学习之路

使用我的firefox浏览器，我登录到一个下载站点并单击其中一个查询按钮 . 弹出一个小窗口，名为“打开report1.csv”，我可以选择“打开方式”或“保存文件” . 我保存文件 .

对于这个动作Live HTTP headers告诉我：

https：// myserver / ReportPage？download＆NAME = ALL＆DATE = THISYEAR GET / ReportPage？download＆NAME = ALL＆DATE = THISYEAR HTTP / 1.1主机：myserver用户代理：Mozilla / 5.0（X11; Linux x86_64; rv：52.0）Gecko / 20100101 Firefox /52.0接受：text / html，application / xhtml xml，application / xml; q = 0.9，/; q = 0.8 Accept-Language：en-US，en; q = 0.8，de-DE; q = 0.5，de; q = 0.3 Accept-Encoding：gzip，deflate，br Referer：https：// myserver / ReportPage？4＆NAME = ALL＆DATE = THISYEAR Cookie：JSESSIONID = 88DEDBC6880571FDB0E6E4112D71B7D6连接：keep-alive Upgrade-Insecure-Requests：1 HTTP / 1.1 200 OK Date ：星期六，2017年12月30日22:37:40 GMT服务器：Apache-Coyote / 1.1最后修改时间：2017年12月30日星期六22:37:40 GMT到期日：1970年1月1日星期四00:00:00 GMT Pragma：no -cache Cache-Control：no-cache，no-store Content-Disposition：attachment;文件名= “report1.csv”; filename * = UTF-8''report1.csv Content-Type：text / csv Content-Length：332369 Keep-Alive：timeout = 5，max = 100 Connection：Keep-Alive

现在我尝试用请求模拟这个 .

$ python3
>>> import requests
>>> from lxml import html
>>>
>>> s = requests.Session()
>>> s.verify = './myserver.crt'  # certificate of myserver for https
>>>
>>> # get the login web page to enter username and password
... r = s.get( 'https://myserver' )
>>>
>>> # Get url for logging in. It's the action-attribute in the form anywhere.
... # We use xpath.
... tree = html.fromstring(r.text)
>>> loginUrl = 'https://myserver/' + list(tree.xpath("//form[@id='id4']/@action"))[0]
>>> print( loginUrl )   # it contains a session-id
https://myserver/./;jsessionid=77EA70CB95252426439097E274286966?0-1.loginForm
>>>
>>> # logging in with username and password
... r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
>>> print( r.status_code )
200
>>> # try to get the download file using url from Live HTTP headers
... downloadQueryUrl = 'https://myserver/ReportPage?download&NAME=ALL&DATE=THISYEAR'
>>> r = s.get( downloadQueryUrl )
>>> print( r.status_code)
200
>>> print( r. headers )
{'Connection': 'Keep-Alive',
'Date': 'Sun, 31 Dec 2017 14:46:03 GMT',
'Cache-Control': 'no-cache, no-store',
'Keep-Alive': 'timeout=5, max=94',
'Transfer-Encoding': 'chunked',
'Expires': 'Thu, 01 Jan 1970 00:00:00 GMT',
'Pragma': 'no-cache',
'Content-Encoding': 'gzip',
'Content-Type': 'text/html;charset=UTF-8',
'Server': 'Apache-Coyote/1.1',
'Vary': 'Accept-Encoding'}
>>> print( r.url )
https://myserver/ReportPage?4&NAME=ALL&DATE=THISYEAR
>>>

请求成功但我没有得到文件下载页面 . 没有“内容 - 处置：附件”; Headers 中的条目 . 我只获取查询开始的页面，例如来自引用者的页面 .

这与session-cookie有关吗？似乎请求自动管理这个 . csv文件有特殊处理吗？我必须使用流吗？ Live HTTP Headers显示的download-Url是正确的吗？也许有一个动态的创作？

如何获得包含“Content-Disposition：attachment;”的网页来自myserver并下载其文件请求？

1 回答

我知道了 . @Patrick Mevzek指出了我正确的方向 . 这次真是万分感谢 .

登录后，我不会留在第一个登录页面并调用查询 . 相反，我请求报告页面，从中提取query-url并请求query-url . 现在我在其 Headers 中得到了“Content-Disposition：attachment;”的回复 . 它_d1937_的文字到标准输出 . 我更喜欢这样，因为我可以将输出重定向到任何文件 . 信息消息发送到stderr，因此它们不会弄乱重定向的输出 . 典型的电话是 ./download >out.csv .

为了完整性，这里是脚本模板，没有任何错误检查以澄清其工作 .

#!/usr/bin/python3

import requests
import sys
from lxml import html

s = requests.Session()
s.verify = './myserver.crt'  # certificate of myserver for https

# get the login web site to enter username and password
r = s.get( 'https://myserver' )

# Get url for logging in. It's the action-attribute in the form anywhere.
# We use xpath.
tree = html.fromstring(r.text)
loginUrl = 'https://myserver/' + tree.xpath("//form[@id='id4']/@action")[0]

# logging in with username and password and go to ReportPage with queries
r = s.post( loginUrl, data = {'username':'ingo','password':'mypassword'} )
queryUrl = 'https://myserver/ReportPage?NAME=ALL&DATE=THISYEAR'
r = s.get( queryUrl )

# Get the download link for this query from this site. It's a link anywhere
# with value 'Download (UTF8)'
tree = html.fromstring( r.text )
downloadUrl = 'https://myserver/' + tree.xpath("//a[.='Download (UTF8)']/@href")[0]

# get the download file
r = s.get( downloadUrl )
if r.headers.get('Content-Disposition'):
    print( 'Downloading ...', file=sys.stderr )
    print( r.text )

# log out
r = s.get( 'https://myserver/logout' )

回复于 2024-04-28T06:56:15+08:00

无法使用python-requests获取带有“Content-Disposition：attachment;”的网页

1 回答

相关问题