Advanced Data Structure in Python

对于 Python 中内置的 list, tuple, set and dict 这样的简单数据结构,我们已经很熟悉了。
一般的应用场景,有这些简单的数据结构,也已经够用了。如果是比较复杂的场景,需要其他更高级的
数据结构,Python 也为我们提供了支持。本文先学习如下几种:

  • collections
  • array
  • heapq
  • bisect
  • weakref
  • copy
  • pprint
  • singly linked-list

Collections

collections 模块中,包含了一些非常有用的数据结构,比如 Counter, defaultdict, OrderedDict, deque, namedtuple 等,其中最常用的就是 Counter, deque, defaultdictCounter 顾名思义,是一个计数器,
继承自内置的 dict,常用的方法除了与 dict 相应的方法之外,就是 most_common 了。

1
2
3
4
5
6
>>> from collections import Counter
>>> Counter(['A', 'A', 'B', 'C', 'D', 'B', 'C', 'A'])
Counter({'A': 3, 'C': 2, 'B': 2, 'D': 1})
>>> d = Counter(['A', 'A', 'B', 'C', 'D', 'B', 'C', 'A'])
>>> d.most_common(3)
[('A', 3), ('C', 2), ('B', 2)]

下面这段代码,用于统计一段文本中每个单词的使用频率:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
>>> from collections import Counter
>>> string = """The Zen of Python, by Tim Peters
...
... Beautiful is better than ugly.
... Explicit is better than implicit.
... Simple is better than complex.
... Complex is better than complicated.
... Flat is better than nested.
... Sparse is better than dense.
... Readability counts.
... Special cases aren't special enough to break the rules.
... Although practicality beats purity.
... Errors should never pass silently.
... Unless explicitly silenced.
... In the face of ambiguity, refuse the temptation to guess.
... There should be one-- and preferably only one --obvious way to do it.
... Although that way may not be obvious at first unless you're Dutch.
... Now is better than never.
... Although never is often better than *right* now.
... If the implementation is hard to explain, it's a bad idea.
... If the implementation is easy to explain, it may be a good idea.
... Namespaces are one honking great idea -- let's do more of those!
... """

>>> import re
>>> words = [w for w in re.split(re.compile(r'[ \n,\.\*-]'), string) if w]
>>> rate = Counter(words)
>>> rate.most_common(10)
[('is', 10), ('better', 8), ('than', 8), ('to', 5), ('the', 5), ('idea', 3), ('be', 3), ('never', 3), ('of', 3), ('one', 3)]
>>> Counter({'is': 10, 'better': 8, 'than': 8, 'to': 5, 'the': 5, 'idea': 3, 'be': 3, 'never': 3, 'of': 3, 'one': 3, 'Although': 3, 'implementation': 2, 'explain': 2, 'should': 2, 'do': 2, 'way': 2, 'obvious': 2, 'it': 2, 'may': 2, 'a': 2, 'If': 2, 'Errors': 1, 'nested': 1, 'only': 1, 'explicitly': 1, 'easy': 1, 'Now': 1, 'good': 1, 'rules': 1, 'Explicit': 1, 'break': 1, 'not': 1, 'now': 1, 'Simple': 1, 'bad': 1, 'silently': 1, 'right': 1, 'often': 1, 'hard': 1, 'Complex': 1, 'are': 1, 'pass': 1, 'special': 1, 'Flat': 1, 'dense': 1, 'temptation': 1, 'enough': 1, 'Namespaces': 1, 'complicated': 1, 'Sparse': 1, "aren't": 1, 'by': 1, 'great': 1, 'those!': 1, 'first': 1, 'There': 1, 'counts': 1, "you're": 1, 'guess': 1, 'Unless': 1, "it's": 1, 'implicit': 1, 'ambiguity': 1, 'more': 1, 'that': 1, 'cases': 1, 'Beautiful': 1, 'honking': 1, 'practicality': 1, 'ugly': 1, 'Special': 1, 'at': 1, 'and': 1, 'Peters': 1, 'Tim': 1, 'beats': 1, 'purity': 1, 'Dutch': 1, 'preferably': 1, 'Python': 1, 'silenced': 1, 'complex': 1, 'Readability': 1, 'unless': 1, "let's": 1, 'The': 1, 'refuse': 1, 'Zen': 1, 'face': 1, 'In': 1})

Deque

Python 中 deque 是双端队列,可以从前后两端添加或删除元素,且都是线程安全的,时间复杂的都近似O(1)
虽然 list 也都支持类似的操作,但 list 内部为固定长度的列表操作做了优化,而且对 list 执行 pop(0)
insert(0, value) 会在内存中移动原来的整个 list

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
!/usr/bin/env python -tt
# -*- coding: utf-8 -*-

import time
import functools
from collections import deque

n = 10 ** 5

def timeit(func):
@functools.wraps(func)
def decorated(c):
t0 = time.time()
func(c)
t1 = time.time()
print('{0:<30} {1:<20} {2:<20}'.format(type(c), func.__name__, t1-t0))
return decorated


@timeit
def append(c):
for i in range(n):
c.append(i)


@timeit
def append_left(c):
if isinstance(c, deque):
for i in range(n):
c.appendleft(i)
else:
for i in range(n):
c.insert(0, i)


@timeit
def pop(c):
for __ in range(n):
c.pop()


@timeit
def pop_left(c):
if isinstance(c, deque):
for __ in range(n):
c.popleft()
else:
for __ in range(n):
c.pop(0)


def test():
print('{0:<30} {1:<20} {2:<20}'.format('type', 'operation', 'time(s)'))
print('-' * 70)
for container in [deque, list]:
for operation in [append, append_left, pop, pop_left]:
c = container(range(n))
operation(c)


if __name__ == '__main__':
test()

其输出结果:

1
2
3
4
5
6
7
8
9
10
type                           operation            time(s)             
----------------------------------------------------------------------
<type 'collections.deque'> append 0.0305550098419
<type 'collections.deque'> append_left 0.0282950401306
<type 'collections.deque'> pop 0.0276329517365
<type 'collections.deque'> pop_left 0.0309929847717
<type 'list'> append 0.0249011516571
<type 'list'> append_left 31.5822160244
<type 'list'> pop 0.0356750488281
<type 'list'> pop_left 5.80585002899

Defaultdict

除了对试图获取一个不存在的键时进行了异常处理(由函数 dafault_factory 提供一个默认值)外,defaultdict 几乎与 Python 内置的 dict 完全一致。

事实上,Python 内置的 dict 也可以通过 setdfault 方法实现类似功能。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
>>> from collections import defaultdict
>>> s = 'This is a test for defaultdict in Python'
>>> words = s.aplit()
>>> location = defaultdict(set)
for i, w in enumerate(words):
... location[w].add(i)
...
>>> location
defaultdict(<type 'set'>, {'a': set([2]), 'defaultdict': set([5]), 'for': set([4]), 'This': set([0]), 'is': set([1]), 'Python': set([7]), 'in': set([6]), 'test': set([3])}))

>>> # Use builtin dict
>>> d = {}
>>> for k, v in enumerate(words):
... d.setdefault(v, []).append(k)
...
>>> d
{'a': [2], 'defaultdict': [5], 'for': [4], 'This': [0], 'is': [1], 'Python': [7], 'in': [6], 'test': [3]}

通过,扩展内置 dict 的一种实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
>>> class MyDict(dict):
... def __getitem__(self, key):
... try:
... return super(MyDict, self).__getitem__(key)
... except KeyError:
... value = self[key] = type(self)()
... return value
...
>>> d = MyDict()
>>> d[1][2] = 3
>>> d[2][3] = 6
>>> d[3]['test'] = 'dict'
>>> d
{1: {2: 3}, 2: {3: 6}, 3: {'test': 'dict'}}

Array

array 模块里定义了一种新的数据结构 array,功能长得跟 list 很像,但保存在
其中的元素必须是相同类型的,因而在存储效率上大大高于 list。尽管如此,对 array
的操作都不会比 list 相应的操作快。另外,需要注意的是,当对 array 做列表推导
时,会自动将整个 array 转换成 list,而用 generator expression 则可以避免。

1
2
3
4
5
6
7
>>> from array import array
>>> a = array('i', [1,2,3,4,5])
>>> b = array(a.typecode, (x**2 for x in a))
>>> a
array('i', [1, 2, 3, 4, 5])
>>> b
array('i', [1, 4, 9, 16, 25])

array 的 size 非常大时,对其元素进行就地(in-plaace)操作,很有助于节省存储空间,
通常用 enumerate 来实现。

1
2
3
4
>>> from array import array
>>> a = array('i', range(10000))
>>> for i, x in enumerate(a):
... a[i] = x ** 2

对于大 array, 就地修改元素的值,比用 generator expression 创建一个新 array
大约能快 15% 。

那么,思考一下,何时需要用到 array 呢?

Heapq

Bisect

Weakref

Copy

Pprint

Singly Linked-List

参考资料: