首页 文章

使用python请求作为浏览器进行掩码并下载文件

提问于
浏览
2

我正在尝试使用python请求库从此链接下载文件:http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download

单击此链接将仅在使用浏览器时为您提供文件(nasdaq.csv) . 我使用Firefox网络监视器Ctrl-Shift-Q来检索Firefox发送的所有标头 . 所以现在我终于获得了200个服务器响应,但仍然没有文件 . 此脚本生成的文件包含Nasdaq网站的部分内容,而不是csv数据 . 我在这个网站上查看了类似的问题,没有什么能让我相信这对于请求库来说是不可能的 .

码:

import requests

url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"

# Fake Firefox headers 
headers = {"Host" : "www.nasdaq.com", \
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0", \
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \
        "Accept-Language": "en-US,en;q=0.5", \
        "Accept-Encoding": "gzip, deflate", \
        "DNT": "1", \
        "Cookie": "clientPrefs=||||lightg; userSymbolList=EOD+&DIT; userCookiePref=true; selectedsymbolindustry=EOD,; selectedsymboltype=EOD,EVERGREEN GLOBAL DIVIDEND OPPORTUNITY FUND COMMON SHARES OF BENEFICIAL INTEREST,NYSE; c_enabled$=true", \
        "Connection": "keep-alive", }

# Get the list
response = requests.get(url, headers, stream=True)
print(response.status_code)

# Write server response to file
with open("nasdaq.csv", 'wb') as f:
        for chunk in response.iter_content(chunk_size=1024): 
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

3 回答

  • 3

    实际上你不需要那些 Headers . 您甚至不需要保存到文件 .

    import requests
    import csv
    
    url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"
    response = requests.get(url)
    data = csv.DictReader(response.content.splitlines())
    for row in data:
        print row
    

    样本输出:

    {'Sector': 'Technology', 'LastSale': '2.46', 'Name': 'Zynga Inc.', '': '', 'Summary Quote': 'http://www.nasdaq.com/symbol/znga', 'Symbol': 'ZNGA', 'Industry': 'EDP Services', 'MarketCap': '2295110123.7', 'IPOyear': '2011', 'ADR TSO': 'n/a'}
    

    如果您愿意,可以使用 csv.reader 而不是 DictReader .

  • 0

    针对此问题的另一种更短的解决方案是:

    import urllib
    
    downloadFile = urllib.URLopener()
    downloadFile.retrieve("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download", "companylist.csv")
    

    此代码使用URL库创建URL请求对象( downloadFile ),然后从NASDAQ链接检索数据并将其保存为 companylist.csv .

    根据Python文档,如果要发送自定义用户代理(例如Firefox用户代理),则可以子类化 URLopener 并将 version 属性设置为您要使用的用户代理 .

    Note :根据Python文档,从Python v3.3开始,不推荐使用 urllib.URLopener() . 因此,它最终可能会从Python标准中删除 . 但是,从Python v3.6(Dev)开始,仍然支持 urllib.URLopener() 作为传统接口 .

  • 0

    您不需要提供任何标头:

    import requests
    
    url = "http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download"
    
    response = requests.get(url, stream=True)
    print(response.status_code)
    
    # Write server response to file
    with open("nasdaq.csv", 'wb') as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
    

    你也可以写内容:

    import requests
    
    # Write server response to file
    with open("nasdaq.csv", 'wb') as f:
           f.write(requests.get(url).content)
    

    或者使用urlib:

    urllib.urlretrieve("http://www.nasdaq.com/screening/companies-by-industry.aspx?exchange=NASDAQ&render=download","nasdaq.csv")
    

    所有方法都为您提供3137行csv文件:

    "Symbol","Name","LastSale","MarketCap","ADR TSO","IPOyear","Sector","Industry","Summary Quote",
    "TFSC","1347 Capital Corp.","9.79","58230920","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfsc",
    "TFSCR","1347 Capital Corp.","0.15","0","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscr",
    "TFSCU","1347 Capital Corp.","10","41800000","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscu",
    "TFSCW","1347 Capital Corp.","0.178","0","n/a","2014","Finance","Business Services","http://www.nasdaq.com/symbol/tfscw",
    "PIH","1347 Property Insurance Holdings, Inc.","7.51","46441171.61","n/a","2014","Finance","Property-Casualty Insurers","http://www.nasdaq.com/symbol/pih",
    "FLWS","1-800 FLOWERS.COM, Inc.","7.87","510463090.04","n/a","1999","Consumer Services","Other Specialty Stores","http://www.nasdaq.com/symbol/flws",
    "FCTY","1st Century Bancshares, Inc","7.81","80612492.62","n/a","n/a","Finance","Major Banks","http://www.nasdaq.com/symbol/fcty",
    "FCCY","1st Constitution Bancorp (NJ)","12.39","93508122.96","n/a","n/a","Finance","Savings Institutions","http://www.nasdaq.com/symbol/fccy",
    "SRCE","1st Source Corporation","30.54","796548769.38","n/a","n/a","Finance","Major Banks","http://www.nasdaq.com/symbol/srce",
    "VNET","21Vianet Group, Inc.","20.26","1035270865.78","51099253","2011","Technology","Computer Software: Programming, Data Processing","http://www.nasdaq.com/symbol/vnet",
       ...................................
    

    如果由于某种原因它不适合您,那么您可能需要升级您的请求版本 .

相关问题