-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
gh-119109: improve functools.partial
vectorcall with keywords
#124584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Perhaps @vstinner has the time and interest in looking at this. |
I think it is a good compromise between simplicity and performance now. One micro-optimization that I couldn't figure out how to do simply is pre-storing Not sure how much sense it makes yet, but I posted faster-cpython/ideas#699 in relation to this. Ready for review now. |
Was wandering if it might be worth factoring out macros for private use. |
Modules/_functoolsmodule.c
Outdated
} | ||
Py_XDECREF(pto_kw_merged); | ||
|
||
/* Resize Stack if the call has keywords */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed? The stack can be slightly overallocated, but is this a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this can grow to moderate size.
E.g.
len(pto_kwds) = 100
len(kwds) = 100
set(pto_kwds) == set(kwds)
len(stack) = 300
len(used_stack) = 100
say all keys and values are 1-character strings.
Then, size of used objects is 100 * 2 * 42 = 8400.
Over-allocated memory is 200 * 8 = 1600.
So that is 20% of extra memory.
Of course, this is both overestimation and an extreme case.
But still, maybe it is a good idea to eliminate 0.1% of extreme cases.
github search:
/\bpartial\(/
- 900K files/\bpartial\((.*=){7,}.*/
- 1K files
So I say maybe instead of completely removing it, could change if (nkwds && ...)
to if (nkwds > 6 && ...)
.
This would eliminate performance overhead for 99% of cases and would safeguard against unconventional usage.
Also, for say 1 keyword argument, reallocation can have a visible performance impact, but for 7 or more, it will be a negligible percentage of total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worrying more about the code complexity than performance and memory consumption. Although resizing the stack takes a time.
If you want to leave it and make it conditional, use something like nkwds + n_merges > init_stack_size/2
, so the stack will only be resized if this saves significant amount memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favour of keeping this for now. Given the whole block can be removed without breaking anything at any time, I think this is more like a "soft complexity". I will see to a bit better rule and add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Py_ssize_t noveralloc = n_merges + nkwds
if (noveralloc > 6 && noveralloc * 10 > init_stack_size && ...) {
E.g.: For init_stack_size = 1000
overallocation of 100 or more would be trimmed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 👍
No refleaks: ❯ ./python.exe -m test -R 3:3 test_functools
Using random seed: 458260461
0:00:00 load avg: 1.72 Run 1 test sequentially in a single process
0:00:00 load avg: 1.72 [1/1] test_functools
beginning 6 repetitions. Showing number of leaks (. for 0 or less, X for 10 or more)
123:456
XX. ...
0:00:02 load avg: 1.66 [1/1] test_functools passed
== Tests result: SUCCESS ==
1 test OK.
Total duration: 2.1 sec
Total tests: run=321 skipped=1
Total test files: run=1/1
Result: SUCCESS |
I get the same on And also the same if I remove all tests except Maybe not related to this PR? EDIT: I assumed this shows that there are refleaks? Or am I failing to interpret something? Never used |
(Potentially closes #128050)
This IMO is the best approach to resolve fallback "issue". It:
a) Eliminates the need for the fallback or any need to switch between implementation after initial construction
b) Delivers performance benefits for vectorcall when
partial
has keywordsBenchmark:
No penalty for calls without
pto_kwds
.Non negligible speed improvement for calls with
pto_kwds
: 27 - 55%