Boto3从S3 Bucket下载所有文件-Java 学习之路

我正在使用boto3从s3存储桶中获取文件 . 我需要像 aws s3 sync 这样的功能

我目前的代码是

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='my_bucket_name')['Contents']
for key in list:
    s3.download_file('my_bucket_name', key['Key'], key['Key'])

只要存储桶只有文件，这工作正常 . 如果存储桶中存在文件夹，则会抛出错误

Traceback (most recent call last):
  File "./test", line 6, in <module>
    s3.download_file('my_bucket_name', key['Key'], key['Key'])
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/inject.py", line 58, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 651, in download_file
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 666, in _download_file
    self._get_object(bucket, key, filename, extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 690, in _get_object
    extra_args, callback)
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 707, in _do_get_object
    with self._osutil.open(filename, 'wb') as f:
  File "/usr/local/lib/python2.7/dist-packages/boto3/s3/transfer.py", line 323, in open
    return open(filename, mode)
IOError: [Errno 2] No such file or directory: 'my_folder/.8Df54234'

这是使用boto3下载完整s3存储桶的正确方法 . 如何下载文件夹 .

8 回答

import os
import boto3

#intiate s3 resource
s3 = boto3.resource('s3')

# select bucket
my_bucket = s3.Bucket('my_bucket_name')

# download file into current directory
for object in my_bucket.objects.all():
    # Need to split object.key into path and file name, else it will give error file not found.
    path, filename = os.path.split(object.key)
    my_bucket.download_file(object.key, filename)

回复于 2024-04-16T17:59:32+08:00

我有相同的需求，并创建以下函数，递归下载文件 . 仅当目录包含文件时才在本地创建目录 .

import boto3
import os

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        if result.get('Contents') is not None:
            for file in result.get('Contents'):
                if not os.path.exists(os.path.dirname(local + os.sep + file.get('Key'))):
                     os.makedirs(os.path.dirname(local + os.sep + file.get('Key')))
                resource.meta.client.download_file(bucket, file.get('Key'), local + os.sep + file.get('Key'))

该函数以这种方式调用：

def _start():
    client = boto3.client('s3')
    resource = boto3.resource('s3')
    download_dir(client, resource, 'clientconf/', '/tmp')

回复于 2024-04-16T17:59:32+08:00

我目前正在通过使用以下内容来完成任务

#!/usr/bin/python
import boto3
s3=boto3.client('s3')
list=s3.list_objects(Bucket='bucket')['Contents']
for s3_key in list:
    s3_object = s3_key['Key']
    if not s3_object.endswith("/"):
        s3.download_file('bucket', s3_object, s3_object)
    else:
        import os
        if not os.path.exists(s3_object):
            os.makedirs(s3_object)

虽然它完成了这项工作，但我不确定这样做是否有益 . 我将它留在这里以帮助其他用户和进一步的答案，以更好的方式实现这一目标

回复于 2024-04-16T17:59:32+08:00

迟到总比没有好:)以前的paginator答案非常好 . 但它是递归的，你最终可能会达到Python的递归限制 . 这是另一种方法，需要进行一些额外的检查 .

import os
import errno
import boto3


def assert_dir_exists(path):
    """
    Checks if directory tree in path exists. If not it created them.
    :param path: the path to check if it exists
    """
    try:
        os.makedirs(path)
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise


def download_dir(client, bucket, path, target):
    """
    Downloads recursively the given S3 path to the target directory.
    :param client: S3 client to use.
    :param bucket: the name of the bucket to download from
    :param path: The S3 directory to download.
    :param target: the local directory to download the files to.
    """

    # Handle missing / at end of prefix
    if not path.endswith('/'):
        path += '/'

    paginator = client.get_paginator('list_objects_v2')
    for result in paginator.paginate(Bucket=bucket, Prefix=path):
        # Download each file individually
        for key in result['Contents']:
            # Calculate relative path
            rel_path = key['Key'][len(path):]
            # Skip paths ending in /
            if not key['Key'].endswith('/'):
                local_file_path = os.path.join(target, rel_path)
                # Make sure directories exist
                local_file_dir = os.path.dirname(local_file_path)
                assert_dir_exists(local_file_dir)
                client.download_file(bucket, key['Key'], local_file_path)


client = boto3.client('s3')

download_dir(client, 'bucket-name', 'path/to/data', 'downloads')

回复于 2024-04-16T17:59:32+08:00

一次性获取所有文件是一个非常糟糕的主意，您应该分批获取它 .

我用来从S3获取特定文件夹（目录）的一个实现是，

def get_directory(directory_path, download_path, exclude_file_names):
    # prepare session
    session = Session(aws_access_key_id, aws_secret_access_key, region_name)

    # get instances for resource and bucket
    resource = session.resource('s3')
    bucket = resource.Bucket(bucket_name)

    for s3_key in self.client.list_objects(Bucket=self.bucket_name, Prefix=directory_path)['Contents']:
        s3_object = s3_key['Key']
        if s3_object not in exclude_file_names:
            bucket.download_file(file_path, download_path + str(s3_object.split('/')[-1])

如果你想让整个桶通过CIL使用它，如下所示@John Rotenstein mentioned，

aws s3 cp --recursive s3://bucket_name download_path

回复于 2024-04-16T17:59:32+08:00

4
Amazon S3没有文件夹/目录 . 这是 flat file structure .

要保持目录的外观， path names are stored as part of the object Key （filename） . 例如：
- images/foo.jpg
在这种情况下，他们整个Key是 images/foo.jpg ，而不仅仅是 foo.jpg .

我怀疑你的问题是 boto 正在返回一个名为 my_folder/.8Df54234 的文件，并试图将其保存到本地文件系统 . 但是，本地文件系统将 my_folder/ 部分解释为目录名称，并将 that directory does not exist on your local filesystem 解释为 .

你可以 truncate 文件名只保存 .8Df54234 部分，或者你必须在写文件之前 create the necessary directories . 请注意，它可以是多级嵌套目录 .

一种更简单的方法是使用AWS Command-Line Interface (CLI)，它将为您完成所有这些工作，例如：
```
aws s3 cp --recursive s3://my_bucket_name local_folder
```
还有一个 sync 选项只能复制新文件和修改过的文件 .
回复于 2024-04-16T17:59:32+08:00

我有一个解决方法，在同一个过程中运行AWS CLI .

安装 awscli 作为python lib：

pip install awscli

然后定义这个函数：

from awscli.clidriver import create_clidriver

def aws_cli(*cmd):
    old_env = dict(os.environ)
    try:

        # Environment
        env = os.environ.copy()
        env['LC_CTYPE'] = u'en_US.UTF'
        os.environ.update(env)

        # Run awscli in the same process
        exit_code = create_clidriver().main(*cmd)

        # Deal with problems
        if exit_code > 0:
            raise RuntimeError('AWS CLI exited with code {}'.format(exit_code))
    finally:
        os.environ.clear()
        os.environ.update(old_env)

执行：

aws_cli('s3', 'sync', '/path/to/source', 's3://bucket/destination', '--delete')

回复于 2024-04-16T17:59:32+08:00

for objs in my_bucket.objects.all():
    print(objs.key)
    path='/tmp/'+os.sep.join(objs.key.split(os.sep)[:-1])
    try:
        if not os.path.exists(path):
            os.makedirs(path)
        my_bucket.download_file(objs.key, '/tmp/'+objs.key)
    except FileExistsError as fe:                          
        print(objs.key+' exists')

此代码将下载 /tmp/ 目录中的内容 . 如果需要，可以更改目录 .

回复于 2024-04-16T17:59:32+08:00

Boto3从S3 Bucket下载所有文件

8 回答

相关问题