基本概念

bit（比特）：计算机中最小的数据单位。
byte（字节）：计算机存储数据的单元。
char（字符）：人类能够识别的符号。
string（字符串）：由 char 组成的字符序列。
bytecode（字节码）：以 byte 的形式存储 char 或 string。由于计算机只认识二进制，所以 string 中的每个 char 都需要使用 bytecode 来表示。
encode（编码）：将人类可识别的 char 或 string 转换为机器可识别的 bytecode，并存储到磁盘中。存在多种转换格式，例如：Unicode、ASCII、UTF-8、GBK 等类型。
decode（解码）：encode 的反向过程。

简而言之，encode 就是将 string 翻译为机器可存储的 bytecode，解码就是将 bytecode 翻译为人类可理解的 string。

在这里插入图片描述

ASCII 编码方式

ASCII 是最古老的编码方式，每个 char 占有 8bit，首位为 0 表示使用 ASCII 编码方式。

优点：空间占用低，每个 char 只占 8bit 长度，非常节省磁盘空间。
缺点：容量小，只能编码 128 个 char，只能满足英文编码，不能满足中文（有 7000+ 汉字）等非英文字符编码。

在这里插入图片描述

Unicode 编码方式

Unicode（国际码），顾名思义是可以满足所有语言的统一编码方式，甚至还包括了颜文字和 emoji 表情的编码。所以，Unicode 被认为可以对世界上所有的符号进行编码。

优点：容量大，可以对世界上所有的符号进行编码。
缺点：空间占用高，有些 char 的长度甚至超过 3-4Byte，如果所有的 char 都使用 3-4Byte 定长进行编码的话，那么文件的大小将变为原来（ASCII）的 3-4 倍，这无疑是难以接受的。

由于 Unicode 编码方式的空间占用非常高，所以无法成为磁盘存储的主流编码方式。换句话说，Unicode bytecode 通常是无法直接存储到磁盘的。

在这里插入图片描述

UTF-8/UTF-16 编码方式

由于 Unicode 编码方式存在显而易见的问题，所以提出了 UTF-8/UTF-16 编码方式，且以 UTF-8 为主流。

实际上，UTF-8/UTF-16 都是 Unicode 的一种实现，显著的改进为 UTF-8 是一种可变长的编码方式，即：对于编码较小的字符，存储起来也更小，编码较大的字符，存储起来较大。

例如：

单字节字符编码：以 0 开头，余下 7 位是其 ASCII 编码。
对于 n（n>1）字节字符编码：前 n 位是 1 然后跟一个 0 表示前缀，后 n-1 个字节，每个字节以 10 开头，其余的位置全部为 Unicode 编码，不足的位置以 0 补全。

在这里插入图片描述

GB2312/GBK 编码方式

GB2312/GBK 编码方式是为了解决中文编码的问题，由中国自行定义的一套国家编码标准。

GB2312（中国国标编码方式）：对于所有 char 均采用双字节编码，其中低字节是 ASCII 编码。
GBK（扩展的 GB2312 编码方式）：编码大于 127 的全都变成了中文。

GB 系列编码方式的主要特点是一个汉字使用两个字节编码，连数字，标点符号也增加了双字节编码的格式，也就是我们常说的 “全角”，所以 “逗号” 会有两种，一种是单字节的，一种是双字节的。

Python 中的字符串

当我们使用 Editor（编辑器）打开一个 .py 文件，并编写一行代码（e.g. a = 123），实际上，这个 .py 文件中的所有内容，都会被 encode 成 bytecode 然后存储在内存或磁盘中。

同时，这个文件的编码类型，可能由 OS 指定，也可以由 Editor 指定，也可以由 Python 解析器指定，还可以由用户自行指定。

由 OS 指定：

$ locale
LANG=zh_CN.UTF-8
LC_CTYPE="zh_CN.UTF-8"
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE="zh_CN.UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES="zh_CN.UTF-8"
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=

由 Editor（e.g. VIM）指定：

set fileencodings=utf-8
1

由 Python 解析器指定：

# P2
>>> import sys; sys.getdefaultencoding()
'ascii'

# P3
>>> import sys; sys.getdefaultencoding()
'utf-8'
1
2
3
4
5
6
7

由用户指定：

#-*-coding:utf-8-*-
1

而 Python 字符串编码问题的复杂性就在于 Python2 和 Python3 有着截然不同的字符串编码方式。

# P2
>>> type('abc')
<type 'str'>
>>> type(u'abc')
<type 'unicode'>
>>> type(b'abc')
<type 'str'>

# P3
>>> type('abc')
<class 'str'>
>>> type(u'abc')
<class 'str'>
>>> type(b'abc')
<class 'bytes'>

在这里插入图片描述

Python2 的 encode/decode

内建的字符串编码类型

因为历史包袱的原因，P2 默认的编码方式是 ASCII，后来再支持了 Unicode。所以 Python2 的字符串主要存在 str 和 unicode 两大类型。其中，还将 UTF-8、GBK 等编码的 bytecode 统一归为 str 类型，即：b'abc' 均为 str 类型。

# 将 basestting `abc` encode 为 UTF-8 bytecode 进行存储，Python 将 UTF-8 bytecode 解析成 str 类型实例
>>> type('abc'.encode('utf-8'))
<type 'str'>

# 将 basestting `abc` encode 为 GBK bytecode 进行存储，Python 将 GBK bytecode 解析成 str 类型实例
>>> type('abc'.encode('gbk'))
<type 'str'>

# 所有的 bytecode 都被 Python 解析成 str 类型实例
>>> type(b'abc')
<type 'str'>
1
2
3
4
5
6
7
8
9
10
11

在这里插入图片描述

编码方式的转换

前面我们提到了 unicode bytecode 通常是无法被直接存储到磁盘的，所以当我们输入一个 unicode string 并且期望存储时，首相要将 unicode string encode 为 utf-8 等编码格式，然后在读取时，再重新 decode 为 unicode string，保持其格式的一致性，避免程序出错。

>>> c_char = u'一'          # 赋值 unicode string
>>> c_char                  # unicode string bytecode
u'\u4e00'
>>> print c_char            # print 语句会隐式的将 unicode string bytecode 进行 decode 为中文然后输出
一

>>> c_char.encode('utf-8')  # 将 unicode string bytecode encode 为 utf-8 bytecode 并进行存储
'\xe4\xb8\x80'
>>> c_char.encode('utf-8').decode('utf-8')       # 将 utf-8 bytecode 重新 decode 为 unicode string bytecode
u'\u4e00'
>>> print c_char.encode('utf-8').decode('utf-8') # print 语句会隐式的将 unicode string bytecode 进行 decode 为中文然后输出
一
1
2
3
4
5
6
7
8
9
10
11
12

需要注意的是，encode/decode 的前提是两种编码方式之间存在可以互相转码的 Mapping Tables，否者无法进行转码。例如：当我们尝试将 unicode string bytecode encode 为 ascii 时，会触发 UnicodeEncodeError，指示 unicode string bytecode 已经超出了 ASCII Table，即：含有目标编码中没有的字符。

>>> c_char.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e00' in position 0: ordinal not in range(128)
1
2
3
4

同理，当我们使用 str() 执行强制类型转换时，由于会隐式的调用 ASCII 编码，所以也可能发生这样的错误：

>>> str(c_char)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u4e00' in position 0: ordinal not in range(128)
1
2
3
4

相对的，当 unicode string bytecode 没有超出 ASCII Table 的范围时，则同样可以互相转码。

>>> e_char = u'A'
>>> e_char
u'A'
>>> e_char.encode('ascii')
'A'
1
2
3
4
5

读取字符串的编码方式

事实上，通常我们从网络下载的、从文件中读取的、或者从数据库中获取的数据都必须是 bytecode 数据，不可能是 unicode 编码的。如果要将这些数据转码为 unicode string，需要显式的进行 decode。

>>> file = open('file.txt')
>>> context = file.read()
>>> context                   # 因为我的 OS 默认使用了 utf-8 编码方式进行文件存储
'\xe4\xb8\x80\n'
>>> context.decode('utf-8')   # 将 utf-8 bytecode decode 为 unicode string
u'\u4e00\n'                   # 以 unicode string bytecode 的形式输出
>>> context.decode('gb2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 2-3: illegal multibyte sequence
1
2
3
4
5
6
7
8
9
10

上述的 UnicodeDecodeError，是由于 file.txt 使用了 utf-8 encode，所以也要使用相同的编码方式进行 decode，否则很可能出现这个问题。相反，unicode 则可以 encode 为任意编码方式，所以 unicode 也常被作为各种转码的中间码。

>>> context.decode('utf-8').encode('gb2312')
'\xd2\xbb\n'
>>> print context.decode('utf-8').encode('gb2312').decode('gb2312')
一
1
2
3
4

脚本文件的编码方式

还有另外的一个场景，当我们在 .py 文件执行 unicode string 时，但 Python 脚本的默认编码方式是 ASCII，如果使用 ASCII 编码之外的 string，则会在编译的时候就会出错。

$ cat file.py
print u'你好'

$ python2 file.py
  File "file.py", line 1
SyntaxError: Non-ASCII character '\xe4' in file file.py on line 1, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
1
2
3
4
5
6

解决的办法就是更改 Python 脚本的编码方式。

# -*- coding: utf-8 -*-
print u'你好'
1
2

Python3 的 encode/decode

在这里插入图片描述

P3 不破不立的采用了统一的 Unicode 编码方式，为开发者带来了很大的便利。所以 Python3 的字符串类型更加准确的被设计为了 str（unicode string）和 byte（bytecode）两类型。

换句话说，P3 在内存中使用了 unicode 编码方式来存储 string，在磁盘中默认使用了 UTF-8 来存储 string。因此，在做 encode/decode 转码时，会以 unicode 作为中间编码，即：先将其他编码 decode 成 unicode，再从 unicode encode 成另一种编码。

编码（encode）：将 unicode string 转码为特定编码格式的 bytecode（默认为 UTF-8）并存储。
解码（decode）：将特定编码格式的 bytecode 转码为 unicode string 的过程。

在这里插入图片描述

>>> type('一')
<class 'str'>
>>> type(u'一')
<class 'str'>
>>> type('一'.encode('utf-8'))
<class 'bytes'>
>>> type('一'.encode('gbk'))
<class 'bytes'>
>>> '一'.encode('utf-8').decode('utf-8')
'一'
>>> '一'.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>> '一'.encode('utf-8').encode('gbk')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'
1617
18

可见，P3 中的 unicode string 类型甚至是没有 decode() 属性方法的，同样的 bytecode 也没有 encode() 属性方法，遵循一个简单明了的 “螺旋”。

str()

Py2

elp on class str in module __builtin__:

class str(basestring)
 |  str(object='') -> string
 |
 |  Return a nice string representation of the object.
 |  If the argument is a string, the return value is the same object.
1
2
3
4
5
6
7

Py3

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
1
2
3
4
5
6
7
8
9
10
11
12
13

future.builtins.str()

从上述可知，Py2 和 Py3 的 build-in str() 也有了相应的改动。从 Py2 向 Py3 兼容的角度考虑，我们可以选择使用 future.builtins.str()，即：在 Py2 中使用 Py3 的 str() 函数。

Help on class newstr in module future.types.newstr:

class newstr(__builtin__.unicode)
 |  A backport of the Python 3 str object to Py2
 |
 |  Method resolution order:
 |      newstr
 |      __builtin__.unicode
 |      __builtin__.basestring
 |      __builtin__.object
1
2
3
4
5
6
7
8
9
10

EXAMPLE：

from future.builtins import str
# or the future.builtins module is also accessible as builtins on Py2.
from builtins import str
1
2
3

future.utils.native_str()

相对的，假设我们在全局引入了 Py3 str() 的前提下，某些地方还需要使用到 Py2 str() 时，则可以选择使用 future.utils.native_str()。所谓 native 指的是 native platform，即使用 Py2 原生的 str() 函数。

# ``native_str``: always equal to the native platform string object (because
# this may be shadowed by imports from future.builtins)

# The native platform string and bytes types. Useful because ``str`` and
# ``bytes`` are redefined on Py2 by ``from future.builtins import *``.
native_str = str
1
2
3
4
5
6

EXAMPLE：