Skip to content

gh-119109: improve functools.partial vectorcall with keywords #124584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

dg-pb
Copy link
Contributor

@dg-pb dg-pb commented Sep 26, 2024

(Potentially closes #128050)

This IMO is the best approach to resolve fallback "issue". It:
a) Eliminates the need for the fallback or any need to switch between implementation after initial construction
b) Delivers performance benefits for vectorcall when partial has keywords

Benchmark:

# BENCH 2 ARGS
# ------------
S="
from functools import partial
f=lambda a, b: a - b
p1 = partial(f)
p2 = partial(f, b=2)
l = lambda a: f(a, b=2)
"

$PYCMD -c "${S}; print(p1(1, 2))"   # -1     | -1     |
$PYCMD -c "${S}; print(p2(1))"      # -1     | -1     |
                                    # BEFORE | AFTER  | %CHN | LAMBDA LB
$PYCMD -m timeit -s $S 'p1(1, 2)'   #  87 ns |  85 ns |      |
$PYCMD -m timeit -s $S 'p1(1, b=2)' # 100 ns |  96 ns |      |
$PYCMD -m timeit -s $S 'p2(1)'      # 240 ns | 135 ns | -45% |  94 ns
$PYCMD -m timeit -s $S 'p2(a=1)'    # 350 ns | 160 ns | -55% | 110 ns


# BENCH 10 ARGS
# -------------
S="
from functools import partial
func = lambda a, b, c, d, e, f, g, h, i, j: (a + b + c + d + e + f + g + h + i + j)
p = partial(func, f=5, g=6, h=7, i=8, j=9)
l = lambda a, b, c, d, e, f=5, g=6, h=7, i=8, j=9: func(a, b, c, d, e, f=f, g=g, h=h, i=i, j=j)
"

C0="${S}; print(p(0, 1, 2, 3, 4))"
C1='p(0, 1, 2, 3, 4)'
C2='p(a=0, b=1, c=2, d=3, e=4)'                             # disjoint kw and pto_kw
C3='p(a=0, b=1, c=2, d=3, e=4, f=5, g=6)'                   # kw partially overlaps pto_kw
C4='p(a=0, b=1, c=2, d=3, e=4, f=5, g=6, h=7, i=8, j=9)'    # kw overrides pto_kw


$PYCMD -c $C0               #  45     | 45     |
                            #  BEFORE | AFTER  | %CHN | LAMBDA LB
$PYCMD -m timeit -s $S $C1  #  440 ns | 320 ns | -28% | 240 ns
$PYCMD -m timeit -s $S $C2  #  890 ns | 440 ns | -50% | 260 ns
$PYCMD -m timeit -s $S $C3  # 1000 ns | 600 ns | -40% | 270 ns
$PYCMD -m timeit -s $S $C4  # 1250 ns | 700 ns | -44% | 300 ns

# FUNCTION CALL - 210 ms
$PYCMD -m timeit -s $S 'f(a=0, b=1, c=2, d=3, e=4, f=5, g=6, h=7, i=8, j=9)'

No penalty for calls without pto_kwds.
Non negligible speed improvement for calls with pto_kwds: 27 - 55%

@rhettinger rhettinger requested review from vstinner and removed request for rhettinger September 26, 2024 17:33
@rhettinger
Copy link
Contributor

Perhaps @vstinner has the time and interest in looking at this.

@dg-pb dg-pb marked this pull request as draft September 27, 2024 06:31
@dg-pb dg-pb marked this pull request as ready for review September 27, 2024 14:54
@dg-pb
Copy link
Contributor Author

dg-pb commented Sep 29, 2024

I think it is a good compromise between simplicity and performance now.

One micro-optimization that I couldn't figure out how to do simply is pre-storing kwnames tuple so it doesn't need to be created on every call. It would drop another ~50 ns.

Not sure how much sense it makes yet, but I posted faster-cpython/ideas#699 in relation to this.

Ready for review now.

@dg-pb
Copy link
Contributor Author

dg-pb commented Oct 17, 2024

Was wandering if it might be worth factoring out macros for private use.

@picnixz
Copy link
Member

picnixz commented Jan 5, 2025

Is the performance change a direct effect of the simplification or is this the other way around? (namely, can we decouple the performance gains from the simplification?) (not that the PR is too big but for bisecting commits, it's easier when we have atomic changes)

@dg-pb
Copy link
Contributor Author

dg-pb commented Jan 5, 2025

Your title change is incorrect, this has wider implications.

It removes dynamic switching, which causes issues with free threading. And generally results in more linear straight forward flow, so it is a design improvement as well.

The fact that vectorcall will be used in more cases, which is faster is only a benefit. Although from users perspective the performance gain is the only thing worth mentioning.

@picnixz picnixz changed the title gh-119109: improve performance for functools.partial vectorcall with keywords gh-119109: improve functools.partial vectorcall with keywords Jan 5, 2025
@dg-pb
Copy link
Contributor Author

dg-pb commented Jan 5, 2025

Is the performance change a direct effect of the simplification

yes, it is one atomic change. There is no way to split it.

}
Py_XDECREF(pto_kw_merged);

/* Resize Stack if the call has keywords */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed? The stack can be slightly overallocated, but is this a problem?

Copy link
Contributor Author

@dg-pb dg-pb Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory this can grow to moderate size.

E.g.

len(pto_kwds) = 100
len(kwds) = 100
set(pto_kwds) == set(kwds)

len(stack) = 300
len(used_stack) = 100

say all keys and values are 1-character strings.
Then, size of used objects is 100 * 2 * 42 = 8400.
Over-allocated memory is 200 * 8 = 1600.

So that is 20% of extra memory.

Of course, this is both overestimation and an extreme case.

But still, maybe it is a good idea to eliminate 0.1% of extreme cases.

github search:

  • /\bpartial\(/ - 900K files
  • /\bpartial\((.*=){7,}.*/ - 1K files

So I say maybe instead of completely removing it, could change if (nkwds && ...) to if (nkwds > 6 && ...).

This would eliminate performance overhead for 99% of cases and would safeguard against unconventional usage.

Also, for say 1 keyword argument, reallocation can have a visible performance impact, but for 7 or more, it will be a negligible percentage of total.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worrying more about the code complexity than performance and memory consumption. Although resizing the stack takes a time.

If you want to leave it and make it conditional, use something like nkwds + n_merges > init_stack_size/2, so the stack will only be resized if this saves significant amount memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favour of keeping this for now. Given the whole block can be removed without breaking anything at any time, I think this is more like a "soft complexity". I will see to a bit better rule and add a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race between partial_vectorcall_fallback and _PyVectorcall_FunctionInline under free-threading
8 participants