首页 文章

Boto3从S3 Bucket下载所有文件

提问于
浏览
39

我正在使用boto3从s3存储桶中获取文件 . 我需要像 aws s3 sync 这样的功能

我目前的代码是

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='my_bucket_name')['Contents']
for key in list:
    s3.download_file('my_bucket_name', key['Key'], key['Key'])

只要存储桶只有文件,这工作正常 . 如果存储桶中存在文件夹,则会抛出错误

Traceback (most recent call last):
  File "./test", line 6, in <module>
    s3.download_file('my_bucket_name', key['Key'], key['Key'])
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file
    self._get_object(bucket, key, filename, extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object
    with self._osutil.open(filename, 'wb') as f:
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open
    return open(filename, mode)
IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'

这是使用boto3下载完整s3存储桶的正确方法 . 如何下载文件夹 .

8 回答

  • 1
    import os
    import boto3
    
    #intiate s3 resource
    s3 = boto3.resource('s3')
    
    # select bucket
    my_bucket = s3.Bucket('my_bucket_name')
    
    # download file into current directory
    for object in my_bucket.objects.all():
        # Need to split object.key into path and file name, else it will give error file not found.
        path, filename = os.path.split(object.key)
        my_bucket.download_file(object.key, filename)
    
  • 0

    我有相同的需求,并创建以下函数,递归下载文件 . 仅当目录包含文件时才在本地创建目录 .

    import boto3
    import os
    
    def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
        paginator = client.get_paginator('list_objects')
        for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
            if result.get('CommonPrefixes') is not None:
                for subdir in result.get('CommonPrefixes'):
                    download_dir(client, resource, subdir.get('Prefix'), local, bucket)
            if result.get('Contents') is not None:
                for file in result.get('Contents'):
                    if not os.path.exists(os.path.dirname(local + os.sep + file.get('Key'))):
                         os.makedirs(os.path.dirname(local + os.sep + file.get('Key')))
                    resource.meta.client.download_file(bucket, file.get('Key'), local + os.sep + file.get('Key'))
    

    该函数以这种方式调用:

    def _start():
        client = boto3.client('s3')
        resource = boto3.resource('s3')
        download_dir(client, resource, 'clientconf/', '/tmp')
    
  • 29

    我目前正在通过使用以下内容来完成任务

    #!/usr/bin/python
    import boto3
    s3=boto3.client('s3')
    list=s3.list_objects(Bucket='bucket')['Contents']
    for s3_key in list:
        s3_object = s3_key['Key']
        if not s3_object.endswith("/"):
            s3.download_file('bucket', s3_object, s3_object)
        else:
            import os
            if not os.path.exists(s3_object):
                os.makedirs(s3_object)
    

    虽然它完成了这项工作,但我不确定这样做是否有益 . 我将它留在这里以帮助其他用户和进一步的答案,以更好的方式实现这一目标

  • 46

    迟到总比没有好:)以前的paginator答案非常好 . 但它是递归的,你最终可能会达到Python的递归限制 . 这是另一种方法,需要进行一些额外的检查 .

    import os
    import errno
    import boto3
    
    
    def assert_dir_exists(path):
        """
        Checks if directory tree in path exists. If not it created them.
        :param path: the path to check if it exists
        """
        try:
            os.makedirs(path)
        except OSError as e:
            if e.errno != errno.EEXIST:
                raise
    
    
    def download_dir(client, bucket, path, target):
        """
        Downloads recursively the given S3 path to the target directory.
        :param client: S3 client to use.
        :param bucket: the name of the bucket to download from
        :param path: The S3 directory to download.
        :param target: the local directory to download the files to.
        """
    
        # Handle missing / at end of prefix
        if not path.endswith('/'):
            path += '/'
    
        paginator = client.get_paginator('list_objects_v2')
        for result in paginator.paginate(Bucket=bucket, Prefix=path):
            # Download each file individually
            for key in result['Contents']:
                # Calculate relative path
                rel_path = key['Key'][len(path):]
                # Skip paths ending in /
                if not key['Key'].endswith('/'):
                    local_file_path = os.path.join(target, rel_path)
                    # Make sure directories exist
                    local_file_dir = os.path.dirname(local_file_path)
                    assert_dir_exists(local_file_dir)
                    client.download_file(bucket, key['Key'], local_file_path)
    
    
    client = boto3.client('s3')
    
    download_dir(client, 'bucket-name', 'path/to/data', 'downloads')
    
  • 19

    一次性获取所有文件是一个非常糟糕的主意,您应该分批获取它 .

    我用来从S3获取特定文件夹(目录)的一个实现是,

    def get_directory(directory_path, download_path, exclude_file_names):
        # prepare session
        session = Session(aws_access_key_id, aws_secret_access_key, region_name)
    
        # get instances for resource and bucket
        resource = session.resource('s3')
        bucket = resource.Bucket(bucket_name)
    
        for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
            s3_object = s3_key['Key']
            if s3_object not in exclude_file_names:
                bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])
    

    如果你想让整个桶通过CIL使用它,如下所示@John Rotenstein mentioned

    aws s3 cp --recursive s3://bucket_name download_path
    
  • 4

    Amazon S3没有文件夹/目录 . 这是 flat file structure .

    要保持目录的外观, path names are stored as part of the object Key (filename) . 例如:

    • images/foo.jpg

    在这种情况下,他们整个Key是 images/foo.jpg ,而不仅仅是 foo.jpg .

    我怀疑你的问题是 boto 正在返回一个名为 my_folder/.8Df54234 的文件,并试图将其保存到本地文件系统 . 但是,本地文件系统将 my_folder/ 部分解释为目录名称,并将 that directory does not exist on your local filesystem 解释为 .

    你可以 truncate 文件名只保存 .8Df54234 部分,或者你必须在写文件之前 create the necessary directories . 请注意,它可以是多级嵌套目录 .

    一种更简单的方法是使用AWS Command-Line Interface (CLI),它将为您完成所有这些工作,例如:

    aws s3 cp --recursive s3://my_bucket_name local_folder
    

    还有一个 sync 选项只能复制新文件和修改过的文件 .

  • 1

    我有一个解决方法,在同一个过程中运行AWS CLI .

    安装 awscli 作为python lib:

    pip install awscli
    

    然后定义这个函数:

    from awscli.clidriver import create_clidriver
    
    def aws_cli(*cmd):
        old_env = dict(os.environ)
        try:
    
            # Environment
            env = os.environ.copy()
            env['LC_CTYPE'] = u'en_US.UTF'
            os.environ.update(env)
    
            # Run awscli in the same process
            exit_code = create_clidriver().main(*cmd)
    
            # Deal with problems
            if exit_code > 0:
                raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
        finally:
            os.environ.clear()
            os.environ.update(old_env)
    

    执行:

    aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')
    
  • 8
    for objs in my_bucket.objects.all():
        print(objs.key)
        path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
        try:
            if not os.path.exists(path):
                os.makedirs(path)
            my_bucket.download_file(objs.key, '/tmp/'+objs.key)
        except FileExistsError as fe:                          
            print(objs.key+' exists')
    

    此代码将下载 /tmp/ 目录中的内容 . 如果需要,可以更改目录 .

相关问题