@@ -270,8 +270,8 @@ result. This has two primary good effects:
270
270
271
271
Computing minrun
272
272
----------------
273
- If N < MAX_MINRUN, minrun is N. IOW, binary insertion sort is used for the
274
- whole array then; it's hard to beat that given the overheads of trying
273
+ If N < MAX_MINRUN, minrun is N. IOW, binary insertion sort is used for the
274
+ whole array then; it's hard to beat that given the overheads of trying
275
275
something fancier (see note BINSORT).
276
276
277
277
When N is a power of 2, testing on random data showed that minrun values of
@@ -288,7 +288,6 @@ that 32 isn't a good choice for the general case! Consider N=2112:
288
288
289
289
>>> divmod(2112, 32)
290
290
(66, 0)
291
- >>>
292
291
293
292
If the data is randomly ordered, we're very likely to end up with 66 runs
294
293
each of length 32. The first 64 of these trigger a sequence of perfectly
@@ -301,22 +300,94 @@ to get 64 elements into place).
301
300
If we take minrun=33 in this case, then we're very likely to end up with 64
302
301
runs each of length 33, and then all merges are perfectly balanced. Better!
303
302
304
- What we want to avoid is picking minrun such that in
303
+ The original code used a cheap heuristic to pick a minrun that avoided the
304
+ very worst cases of imbalance for the final merge, but "pretty bad" cases
305
+ still existed.
305
306
306
- q, r = divmod(N, minrun)
307
+ In 2025, Stefan Pochmann found a much better approach, based on letting minrun
308
+ vary a bit from one run to the next. Under his scheme, at _all_ levels of the
309
+ merge tree:
307
310
308
- q is a power of 2 and r>0 (then the last merge only gets r elements into
309
- place, and r < minrun is small compared to N), or q a little larger than a
310
- power of 2 regardless of r (then we've got a case similar to "2112", again
311
- leaving too little work for the last merge to do) .
311
+ - The number of runs is a power of 2.
312
+ - At most two different run lengths appear.
313
+ - When two do appear, the smaller is one less than the larger.
314
+ - The lengths of run pairs merged never differ by more than one .
312
315
313
- Instead we pick a minrun in range(MAX_MINRUN / 2, MAX_MINRUN + 1) such that
314
- N/minrun is exactly a power of 2, or if that isn't possible, is close to, but
315
- strictly less than, a power of 2. This is easier to do than it may sound:
316
- take the first log2(MAX_MINRUN) bits of N, and add 1 if any of the remaining
317
- bits are set. In fact, that rule covers every case in this section, including
318
- small N and exact powers of 2; merge_compute_minrun() is a deceptively simple
319
- function.
316
+ So, in all respects, as perfectly balanced as possible.
317
+
318
+ For the 2112 case, that also keeps minrun at 33, but we were lucky there
319
+ that 2112 is 33 times a power of 2. The new approach doesn't rely on luck.
320
+
321
+ For example, with 315 random elements, the old scheme uses fixed minrun=40 and
322
+ produces runs of length 40, except for the last. The new scheme produces a
323
+ mix of lengths 39 and 40:
324
+
325
+ old: 40 40 40 40 40 40 40 35
326
+ new: 39 39 40 39 39 40 39 40
327
+
328
+ Both schemes produce eight runs, a power of 2. That's good for a balanced
329
+ merge tree. But the new scheme allows merges where left and right length
330
+ never differ by more than 1:
331
+
332
+ 39 39 40 39 39 40 39 40
333
+ 78 79 79 79
334
+ 157 158
335
+ 315
336
+
337
+ (This shows merges downward, e.g., two runs of length 39 are merged and
338
+ become a run of length 78.)
339
+
340
+ With larger lists, the old scheme can get even more unbalanced. For example,
341
+ with 32769 elements (that's 2**15 + 1), it uses minrun=33 and produces 993
342
+ runs (of length 33). That's not even a power of 2. The new scheme instead
343
+ produces 1024 runs, all with length 32 except for the last one with length 33.
344
+
345
+ How does it work? Ideally, all runs would be exactly equally long. For the
346
+ above example, each run would have 315/8 = 39.375 elements. Which of course
347
+ doesn't work. But we can get close:
348
+
349
+ For the first run, we'd like 39.375 elements. Since that's impossible, we
350
+ instead use 39 (the floor) and remember the current leftover fraction 0.375.
351
+ For the second run, we add 0.375 + 39.375 = 39.75. Again impossible, so we
352
+ instead use 39 and remember 0.75. For the third run, we add 0.75 + 39.375 =
353
+ 40.125. This time we get 40 and remember 0.125. And so on. Here's a Python
354
+ generator doing that:
355
+
356
+ def gen_minruns_with_floats(n):
357
+ mr = n
358
+ while mr >= MAX_MINRUN:
359
+ mr /= 2
360
+
361
+ mr_current = 0
362
+ while True:
363
+ mr_current += mr
364
+ yield int(mr_current)
365
+ mr_current %= 1
366
+
367
+ But while all arithmetic here can be done exactly using binery floating point,
368
+ floats have less precision that a Py_ssize_t, and mixing floats with ints is
369
+ needlessly expensive anyway.
370
+
371
+ So here's an integer version, where the internal numbers are scaled up by
372
+ 2**e, or rather not divided by 2**e. Instead, only each yielded minrun gets
373
+ divided (by right-shifting). For example instead of adding 39.375 and
374
+ reducing modulo 1, it just adds 315 and reduces modulo 8. And always divides
375
+ by 8 to get each actual minrun value:
376
+
377
+ def gen_minruns_simpler(n):
378
+ e = 0
379
+ while (n >> e) >= MAX_MINRUN:
380
+ e += 1
381
+ mask = (1 << e) - 1
382
+
383
+ mr_current = 0
384
+ while True:
385
+ mr_current += n
386
+ yield mr_current >> e
387
+ mr_current &= mask
388
+
389
+ See note MINRUN CODE for a full implementation and a driver that exhaustively
390
+ verifies the claims above for all list lengths through 2 million.
320
391
321
392
322
393
The Merge Pattern
@@ -820,3 +891,75 @@ partially mitigated by pre-scanning the data to determine whether the data is
820
891
homogeneous with respect to type. If so, it is sometimes possible to
821
892
substitute faster type-specific comparisons for the slower, generic
822
893
PyObject_RichCompareBool.
894
+
895
+ MINRUN CODE
896
+ from itertools import accumulate
897
+ try:
898
+ from itertools import batched
899
+ except ImportError:
900
+ from itertools import islice
901
+ def batched(xs, k):
902
+ it = iter(xs)
903
+ while chunk := tuple(islice(it, k)):
904
+ yield chunk
905
+
906
+ MAX_MINRUN = 64
907
+
908
+ def gen_minruns(n):
909
+ # In listobject.c, initialization is done in merge_init(), and
910
+ # the body of the loop in minrun_next().
911
+ mr_e = 0
912
+ while (n >> mr_e) >= MAX_MINRUN:
913
+ mr_e += 1
914
+ mr_mask = (1 << mr_e) - 1
915
+
916
+ mr_current = 0
917
+ while True:
918
+ mr_current += n
919
+ yield mr_current >> mr_e
920
+ mr_current &= mr_mask
921
+
922
+ def chew(n, show=False):
923
+ if n < 1:
924
+ return
925
+
926
+ sizes = []
927
+ tot = 0
928
+ for size in gen_minruns(n):
929
+ sizes.append(size)
930
+ tot += size
931
+ if tot >= n:
932
+ break
933
+ assert tot == n
934
+ print(n, len(sizes))
935
+
936
+ small, large = MAX_MINRUN // 2, MAX_MINRUN
937
+ while len(sizes) > 1:
938
+ assert not len(sizes) & 1
939
+ assert len(sizes).bit_count() == 1 # i.e., power of 2
940
+ assert sum(sizes) == n
941
+ assert min(sizes) >= min(n, small)
942
+ assert max(sizes) <= large
943
+
944
+ d = set(sizes)
945
+ assert len(d) <= 2
946
+ if len(d) == 2:
947
+ lo, hi = sorted(d)
948
+ assert lo + 1 == hi
949
+
950
+ mr = n / len(sizes)
951
+ for i, s in enumerate(accumulate(sizes, initial=0)):
952
+ assert int(mr * i) == s
953
+
954
+ newsizes = []
955
+ for a, b in batched(sizes, 2):
956
+ assert abs(a - b) <= 1
957
+ newsizes.append(a + b)
958
+ sizes = newsizes
959
+ smsll = large
960
+ large *= 2
961
+
962
+ assert sizes[0] == n
963
+
964
+ for n in range(2_000_001):
965
+ chew(n)
0 commit comments