forked from zpoint/CPython-Internals
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathset.md
More file actions
133 lines (91 loc) · 5.09 KB
/
Copy pathset.md
File metadata and controls
133 lines (91 loc) · 5.09 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# set
# contents
* [related file](#related-file)
* [memory layout](#memory-layout)
* [method](#method)
* [new](#new)
* [add](#add)
* [hash collision](#hash-collision)
* [resize](#resize)
* [why LINEAR_PROBES?](#why-LINEAR_PROBES)
* [clear](#clear)
# related file
* cpython/Objects/setobject.c
* cpython/Include/setobject.h
# memory layout

# method
* ## **new**
* call stack
* static PyObject * set_new(PyTypeObject *type, PyObject *args, PyObject *kwds)
* static PyObject * make_new_set(PyTypeObject *type, PyObject *iterable)
the following picture shows value in each field in an empty set

* ## **add**
* call stack
* static PyObject *set_add(PySetObject *so, PyObject *key)
* static int set_add_key(PySetObject *so, PyObject *key)
* static int set_add_entry(PySetObject *so, PyObject *key, Py_hash_t hash)
initialize a new empty set, whenever an empty set created, the **setentry** field points to **smalltable** field inside the same PySetObject, default value of **PySet_MINSIZE** is 8, it means if there are less than 8 objects in PySetObject, they are stored in the **smalltable**, if there are more than 8 objects, the resize operation will make **setentry** points to a new malloced block
(Actually, the first time resize operation takes place, the smalltable won't be filled, there is a factor to trigger the resize operation, please keep reading)
s = set()

the value in **mask** field is the size of malloced entries - 1, currently it's 7
s.add(0) # hash(0) & mask == 0

add a value 5
s.add(5) # hash(5) & mask == 0

* ### hash collision
add a value 16, because the mask is 7, hash(16) & 7 ===> 0
cpython use **LINEAR_PROBES** to solve hash collision instead of **CHAIN**(linked list)
// pseudocode
#define LINEAR_PROBES 9
while (1)
{
// i is the hash result
// find the position according to the hash value
if (entry in i is empty)
return entry[i]
// reach here, means entry[i] already taken
if (i + LINEAR_PROBES <= mask) {
for (j = 0 ; j < LINEAR_PROBES ; j++) {
if (entry in j is empty)
return j
}
}
// reach here, means entry[i] - entry[i + LINEAR_PROBES] all are taken
// now, rehash take place
perturb >>= PERTURB_SHIFT;
i = (i * 5 + 1 + perturb) & mask;
}
the 0th position has been taken, so the **LINEAR_PROBES** take the next empty position, which is 1th position
s.add(16) # hash(16) & mask == 0

the 0th position has been taken, the first time **LINEAR_PROBES** find the 1th position, which also has been taken, the second time **LINEAR_PROBES** find the 2th position which is empty.
s.add(32) # hash(32) & mask == 0

* ## **resize**
let's insert one more item with value 64, still repeat the **LINEAR_PROBES** progress, insert to the position at index 3
s.add(64) # hash(64) & mask == 0

now, value in field **fill** is 5, **mask** is 7, it will trigger the resize operation
the trigger condition is **fill*5 !< mask * 3**
// from cpython/Objects/setobject.c
if ((size_t)so->fill*5 < mask*3)
return 0;
return set_table_resize(so, so->used>50000 ? so->used*2 : so->used*4);
after resize,value 32 and 64 still repeat the **LINEAR_PROBES** progress, inserted at index 1 and index 2, because the value in **mask** field becomes 31, hash(16) is less than the mask, so 16 can be inserted to index 15
And field **table** no longer points to **smalltable**, instead, it points to a new malloced block

* ## **why LINEAR_PROBES**
* improve cache locality
* reduces the cost of hash collisions
* ## **clear**
* call stack
* static PyObject *set_clear(PySetObject *so, PyObject *Py_UNUSED(ignored))
* static int set_clear_internal(PySetObject *so)
* static void set_empty_to_minsize(PySetObject *so)
call clear to restore the initial state
s.clear()
