将字节转换为字符串？-Java 学习之路

1429

我正在使用此代码从外部程序获取标准输出：

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

communic（）方法返回一个字节数组：

>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

但是，我想将输出作为普通的Python字符串使用 . 所以我可以这样打印：

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

我认为这是binascii.b2a_qp()方法的用途，但是当我尝试它时，我又得到了相同的字节数组：

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

有人知道如何将字节值转换回字符串吗？我的意思是，使用“电池”而不是手动操作 . 而且我希望它能用于Python 3 .

16 回答

45

从http://docs.python.org/3/library/sys.html起，

要从/向标准流写入或读取二进制数据，请使用基础二进制缓冲区 . 例如，要将字节写入stdout，请使用sys.stdout.buffer.write（b'abc'） .

回复于 2024-05-06T05:39:35+08:00
2
如果你应该通过尝试 decode() 获得以下内容：

AttributeError: 'str' object has no attribute 'decode'

您还可以直接在强制转换中指定编码类型：
```
>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'
```
回复于 2024-05-06T05:39:35+08:00
5
如果您不知道编码，那么要以Python 3和Python 2兼容的方式将二进制输入读入字符串，请使用古老的MS-DOS cp437编码：
```
PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))
```
由于编码是未知的，因此期望非英语符号转换为 cp437 的字符（英语字符未翻译，因为它们在大多数单字节编码和UTF-8中匹配） .

将任意二进制输入解码为UTF-8是不安全的，因为您可能会得到：
```
>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte
```
这同样适用于 latin-1 ，这对于Python 2很流行（默认？） . 请参阅Codepage Layout中的缺失点 - 这是Python与臭名昭着的 ordinal not in range 窒息的地方 .

UPDATE 20150604 ：有传言称Python 3具有 surrogateescape 错误策略，用于将内容编码为二进制数据而不会丢失数据并导致崩溃，但它需要转换测试 [binary] -> [str] -> [binary] 以验证性能和可靠性 .

UPDATE 20170116 ：感谢Nearoo的评论 - 还有可能使用 backslashreplace 错误处理程序来删除所有未知字节 . 这仅适用于Python 3，因此即使使用此解决方法，您仍将从不同的Python版本获得不一致的输出：
```
PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))
```
有关详细信息，请参阅https://docs.python.org/3/howto/unicode.html#python-s-unicode-support .

UPDATE 20170119 ：我决定实现适用于Python 2和Python 3的斜线转义解码 . 它应该比 cp437 解决方案慢，但它应该在每个Python版本上产生 identical results .
```
# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))
```
回复于 2024-05-06T05:39:35+08:00
1
我觉得这种方式很简单：
```
bytes = [112, 52, 52]
"".join(map(chr, bytes))
>> p44
```
回复于 2024-05-06T05:39:35+08:00
7
您需要解码字节字符串并将其转换为字符（unicode）字符串 .
```
b'hello'.decode(encoding)
```
要么
```
str(b'hello', encoding)
```
回复于 2024-05-06T05:39:35+08:00
17
In Python 3，默认编码为 "utf-8" ，因此您可以直接使用：
```
b'hello'.decode()
```
这相当于
```
b'hello'.decode(encoding="utf-8")
```
另一方面，in Python 2，编码默认为默认字符串编码 . 因此，你应该使用：
```
b'hello'.decode(encoding)
```
其中 encoding 是您想要的编码 .

在Python 2.7中添加了Note:对关键字参数的支持 .
回复于 2024-05-06T05:39:35+08:00

def toString(string):    
    try:
        return v.decode("utf-8")
    except ValueError:
        return string

b = b'97.080.500'
s = '97.080.500'
print(toString(b))
print(toString(s))

回复于 2024-05-06T05:39:35+08:00

130
我想你真正想要的是这个：
```
>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')
```
Aaron的回答是正确的，除了你需要知道要使用的WHICH编码 . 我相信Windows使用'windows-1252' . 只有在你的内容中有一些不寻常的（非ascii）字符才有意义，但它会产生影响 .

顺便说一句，事实上它是重要的是Python转向使用两种不同类型的二进制和文本数据：它不能在它们之间神奇地转换，因为除非你告诉它，它不知道编码！您将知道的唯一方法是阅读Windows文档（或在此处阅读） .
回复于 2024-05-06T05:39:35+08:00

对于Python 3，这是一种更安全的Pythonic方法，可以从 byte 转换为 string ：

def byte_to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes): #check if its in bytes
        print(bytes_or_str.decode('utf-8'))
    else:
        print("Object not of byte type")

byte_to_str(b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n')

输出：

total 0
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

回复于 2024-05-06T05:39:35+08:00

您需要解码bytes对象以生成字符串：

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'

回复于 2024-05-06T05:39:35+08:00

2373

将universal_newlines设置为True，即

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]

回复于 2024-05-06T05:39:35+08:00

107
虽然@Aaron Maenpaa's answer正常，但用户recently asked

还有更简单的方法吗？ 'fhand.read（） . decode（“ASCII”）'[...]它太长了！

您可以使用
```
command_stdout.decode()
```
decode() 有standard argument

codecs.decode（obj，encoding ='utf-8'，errors ='strict'）
回复于 2024-05-06T05:39:35+08:00
-1
处理来自Windows系统的数据（带有 \r\n 行结尾）时，我的回答是
```
String = Bytes.decode("utf-8").replace("\r\n", "\n")
```
为什么？尝试使用多行Input.txt：
```
Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8")
open("Output.txt", "w").write(String)
```
所有的行结尾都会加倍（到 \r\r\n ），导致额外的空行 . Python的文本读取函数通常将行结尾标准化，以便字符串仅使用 \n . 如果从Windows系统接收二进制数据，Python就没有机会这样做 . 从而，
```
Bytes = open("Input.txt", "rb").read()
String = Bytes.decode("utf-8").replace("\r\n", "\n")
open("Output.txt", "w").write(String)
```
将复制您的原始文件 .
回复于 2024-05-06T05:39:35+08:00

我做了一个清理列表的功能

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista

回复于 2024-05-06T05:39:35+08:00

28
要将字节序列解释为文本，您必须知道相应的字符编码：
```
unicode_text = bytestring.decode(character_encoding)
```
例：
```
>>> b'\xc2\xb5'.decode('utf-8')
'µ'
```
ls 命令可能会生成无法解释为文本的输出 . Unix上的文件名可以是除了斜杠 b'/' 和零 b'\0' 之外的任何字节序列：
```
>>> open(bytes(range(0x100)).translate(None, b'\0/'), 'w').close()
```
尝试使用utf-8编码解码这样的字节汤会引发 UnicodeDecodeError .

可能会更糟 . 如果您使用，解码可能会无声地失败并产生mojibake错误的不兼容编码：
```
>>> '—'.encode('utf-8').decode('cp1252')
'â€”'
```
数据已损坏，但您的程序仍未发现故障已发生 .

通常，要使用的字符编码不嵌入字节序列本身 . 您必须在带外传达此信息 . 某些结果比其他结果更可能，因此存在可以猜测字符编码的模块 . 单个Python脚本可能在不同的位置使用多个字符编码 .

可以使用 os.fsdecode() 函数将 ls 输出转换为Python字符串，即使对于undecodable filenames也是如此（它在Unix上使用 sys.getfilesystemencoding() 和 surrogateescape 错误处理程序）：
```
import os
import subprocess

output = os.fsdecode(subprocess.check_output('ls'))
```
要获取原始字节，可以使用 os.fsencode() .

如果传递 universal_newlines=True 参数，则 subprocess 使用 locale.getpreferredencoding(False) 来解码字节，例如，它可以是Windows上的 cp1252 .

要即时解码字节流，可以使用io.TextIOWrapper()：example .

不同的命令可以对其输出使用不同的字符编码，例如， dir 内部命令（ cmd ）可以使用cp437 . 要解码其输出，您可以显式传递编码（Python 3.6）：
```
output = subprocess.check_output('dir', shell=True, encoding='cp437')
```
文件名可能与 os.listdir() （使用Windows Unicode API）不同，例如， '\xb6' 可以用 '\x14' -Python的cp437编解码器映射 b'\x14' 代替，以控制字符U 0014而不是U 00B6（¶） . 要支持具有任意Unicode字符的文件名，请参阅Decode poweshell output possibly containing non-ascii unicode characters into a python string
回复于 2024-05-06T05:39:35+08:00
3
由于这个问题实际上是在询问 subprocess 输出，因此您可以使用更直接的方法，因为 Popen 接受encoding关键字（在Python 3.6中）：
```
>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt
```
其他用户的一般答案是将字节解码为文本：
```
>>> b'abcde'.decode()
'abcde'
```
没有参数，将使用sys.getdefaultencoding() . 如果您的数据不是 sys.getdefaultencoding() ，则必须在decode调用中明确指定编码：
```
>>> b'caf\xe9'.decode('cp1250')
'café'
```
回复于 2024-05-06T05:39:35+08:00

将字节转换为字符串？

16 回答

相关问题