Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastcgi: Optimize FastCGI transport #4978

Merged
merged 10 commits into from
Sep 2, 2022

Conversation

WeidiDeng
Copy link
Member

Trying to fix 3803.

I tried it on a small php site, no problem so far. Extensive testing is required.

@mholt
Copy link
Member

mholt commented Aug 23, 2022

Oooh, I can't wait to review this!!

@mholt mholt added in progress 🏃‍♂️ Being actively worked on under review 🧐 Review is pending before merging optimization 📉 Performance or cost improvements labels Aug 23, 2022
@WeidiDeng
Copy link
Member Author

Out of curiosity, I checked stdlib fcgi server implementation, and found original implementation is mostly a copy of the fcgi.go. For example, writeRecord, writePairs, and record reading.

Searching through go issues, I found nobody mentioned anything about performance, I guess not many people use it anyway.

@francislavoie francislavoie added this to the 2.x milestone Aug 25, 2022
@mholt
Copy link
Member

mholt commented Aug 25, 2022

@WeidiDeng I'd be interested in a profile to compare memory usage or a CPU benchmark to compare performance. I don't have a PHP backend installed though. Would you or someone like to do some performance testing?

@WeidiDeng
Copy link
Member Author

@mholt I deployed nextcloud with docker compose and tested there. Large file upload and stress test with wrk only. I remember with this patch, largest memory allocator is actually http2 frame writing related? Will upload those benchmarks when I find them.

@WeidiDeng
Copy link
Member Author

Latest commit changes are drastic, but in my testing environment there are no problems.

profile.tar.gz

Here is the benchmark using nextcloud, separated by with or without this patch, and in 2 use cases, large file upload and website browsing.

@mholt
Copy link
Member

mholt commented Aug 26, 2022

@WeidiDeng Thanks for the profiles! I'm looking at them now. Interesting, so far the patch version allocates about 13% less heap memory (if I'm reading these right). Have you noticed any significant difference in req/sec?

@WeidiDeng
Copy link
Member Author

image

Upper one is with this patch, lower one is build from master without this patch. Using go 1.19. header.lua is setting request cookies as to simulate user browser nextcloud directories.

Around 10% improvement.

@mholt
Copy link
Member

mholt commented Aug 26, 2022

@WeidiDeng Am I reading the results wrong? The lower one is 3 req/sec higher than the upper one. (But they're so similar anyways.)

@WeidiDeng
Copy link
Member Author

WeidiDeng commented Aug 29, 2022

I write a simple file server using golang fastcgi.

image

Previous test result is because the test bottleneck is mysql server, so there is no significant difference.

profile.tar.gz

In longer test, howeve req/sec difference is smaller.

image

@WeidiDeng
Copy link
Member Author

WeidiDeng commented Aug 30, 2022

@mholt updated benchmark using php_info() as output.

First nginx in docker as baseline.

image

Then caddy.

image

profile.tar.gz

@mholt
Copy link
Member

mholt commented Aug 31, 2022

@WeidiDeng That's awesome!! Way more useful information here I think.

The profile clearly shows about 10% less memory being allocated, and only ~5s spent on GC compared to 36s without the patch. Req/sec appear to be on-par with or higher than nginx. Very good!

Definitely interested in having more people test this and then getting this merged in 🎉

@tgulacsi
Copy link

I've tested it with our main use, MantisBT issue tracker, and it does work flawlessly.
The speed is the same as with 2.4.5 - MantisBT is a very conservative, low request count and resource utilizing PHP app...

@mholt
Copy link
Member

mholt commented Aug 31, 2022

Thanks for testing @tgulacsi ! That's great to know. I will try to review this soon.

@mholt mholt changed the title reverse proxy: optimize fastcgi transport reverse_proxy: Optimize FastCGI transport Aug 31, 2022
@francislavoie francislavoie changed the title reverse_proxy: Optimize FastCGI transport fastcgi: Optimize FastCGI transport Aug 31, 2022
Copy link
Member

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeidiDeng This is really great, and I'm looking forward to merging this!

There's nothing obviously wrong that I can see. I'm not a PHP guy so I'm relying on those of you with PHP apps to test it out. (This will go into a beta release first.) But it LGTM and I'd be ready to merge it any time.

Have you run this with the race detector (go run -race or go build -race)?

Thanks so much for the exciting contribution!

@mholt mholt modified the milestones: 2.x, v2.6.0-beta.1 Sep 2, 2022
Also reduce some duplication
@mholt
Copy link
Member

mholt commented Sep 2, 2022

If we could figure out how to set the logger when the clientCloser is created, we don't have to make it allocate.

A cheap way is to pass CaptureStderr into the Dial() functions, but that's kind of ugly.

I wonder if a better way would be to create a FastCGIClient{} which has the CaptureStderr config. Then it has the Dial() functions as methods which can access fields on the Client. The Dial methods then return a Conn, more like how the Go standard library works.

@mholt
Copy link
Member

mholt commented Sep 2, 2022

Here's the delta/difference between the profiles you posted above for php_info(), @WeidiDeng:

$ go tool pprof -diff_base=master/heap patch/heap 
File: caddy
Type: inuse_space
Time: Aug 29, 2022 at 8:48pm (MDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for -1721.29kB, 6.29% of 27380.04kB total
Dropped 5 nodes (cum <= 136.90kB)
Showing top 10 nodes out of 130
      flat  flat%   sum%        cum   cum%
-3587.50kB 13.10% 13.10% -3587.50kB 13.10%  runtime.allocm
 2056.01kB  7.51%  5.59%  2056.01kB  7.51%  bufio.NewReaderSize
-1184.27kB  4.33%  9.92% -1184.27kB  4.33%  runtime/pprof.StartCPUProfile
 1024.02kB  3.74%  6.18%  1024.02kB  3.74%  net.JoinHostPort
 -544.67kB  1.99%  8.17%  -544.67kB  1.99%  github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy/fastcgi.(*record).read
 -516.01kB  1.88% 10.05%  -516.01kB  1.88%  runtime/pprof.(*profMap).lookup
     515kB  1.88%  8.17%      515kB  1.88%  sync.(*Pool).pinSlow
  514.63kB  1.88%  6.29%   514.63kB  1.88%  google.golang.org/genproto/googleapis/rpc/errdetails.init
     514kB  1.88%  4.41%      514kB  1.88%  hash/crc32.archInitCastagnoli
 -512.50kB  1.87%  6.29%  -512.50kB  1.87%  go.uber.org/zap/internal/bufferpool.init.func1


$ go tool pprof -diff_base=master/profile\?seconds\=30 patch/profile\?seconds\=30 
File: caddy
Type: cpu
Time: Aug 29, 2022 at 8:47pm (MDT)
Duration: 60.04s, Total samples = 70.11s (116.78%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for -25.11s, 35.82% of 70.11s total
Dropped 824 nodes (cum <= 0.35s)
Showing top 10 nodes out of 179
      flat  flat%   sum%        cum   cum%
    -5.53s  7.89%  7.89%    -31.07s 44.32%  runtime.gcDrain
    -4.35s  6.20% 14.09%     -4.62s  6.59%  runtime.(*lfstack).pop (inline)
    -3.19s  4.55% 18.64%     -9.58s 13.66%  runtime.scanobject
    -2.13s  3.04% 21.68%     -2.13s  3.04%  runtime.(*lfstack).push
    -2.06s  2.94% 24.62%     -2.10s  3.00%  runtime.pageIndexOf (inline)
    -1.86s  2.65% 27.27%     -1.86s  2.65%  runtime.markBits.isMarked (inline)
    -1.73s  2.47% 29.74%     -2.10s  3.00%  runtime.findObject
    -1.54s  2.20% 31.94%     -1.54s  2.20%  runtime.memclrNoHeapPointers
    -1.43s  2.04% 33.98%     -1.43s  2.04%  runtime/internal/syscall.Syscall6
    -1.29s  1.84% 35.82%     -8.90s 12.69%  runtime.gentraceback

Look at the green boxes (heap, then CPU):

profile001

profile002

And for your file server profile:

$ go tool pprof -diff_base=master/heap patch/heap 
File: caddy
Type: inuse_space
Time: Aug 29, 2022 at 1:21am (MDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for -26.97MB, 49.32% of 54.68MB total
Dropped 10 nodes (cum <= 0.27MB)
Showing top 10 nodes out of 137
      flat  flat%   sum%        cum   cum%
  -27.04MB 49.45% 49.45%   -27.04MB 49.45%  bufio.NewWriterSize
      -2MB  3.66% 53.11%       -2MB  3.66%  runtime.allocm
    1.06MB  1.93% 51.18%     1.06MB  1.93%  github.com/yuin/goldmark/util.init
    1.03MB  1.89% 49.29%     1.03MB  1.89%  github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.glob..func2
       1MB  1.84% 47.46%        1MB  1.84%  bufio.NewReaderSize
      -1MB  1.83% 49.29%       -1MB  1.83%  runtime.malg
       1MB  1.83% 47.46%        1MB  1.83%  github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.cloneRequest
      -1MB  1.83% 49.29%       -1MB  1.83%  net/http.(*Server).newConn
   -0.55MB  1.00% 50.29%    -0.55MB  1.00%  html.populateMaps
    0.53MB  0.97% 49.32%     0.53MB  0.97%  github.com/alecthomas/chroma.TypeRemappingLexer


$ go tool pprof -diff_base=master/profile\?seconds\=30 patch/profile\?seconds\=30 
File: caddy
Type: cpu
Time: Aug 29, 2022 at 1:21am (MDT)
Duration: 60.32s, Total samples = 624.40s (1035.16%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for -127.22s, 20.37% of 624.40s total
Dropped 1119 nodes (cum <= 3.12s)
Showing top 10 nodes out of 267
      flat  flat%   sum%        cum   cum%
   -18.98s  3.04%  3.04%    -18.98s  3.04%  runtime.memclrNoHeapPointers
   -17.72s  2.84%  5.88%    -18.99s  3.04%  runtime.(*lfstack).pop (inline)
   -17.43s  2.79%  8.67%    -44.92s  7.19%  runtime.gcDrainN
   -15.56s  2.49% 11.16%    -42.02s  6.73%  runtime.scanobject
   -14.68s  2.35% 13.51%    -98.20s 15.73%  runtime.gcDrain
   -10.73s  1.72% 15.23%    -12.95s  2.07%  runtime.findObject
   -10.50s  1.68% 16.91%    -10.60s  1.70%  runtime.pageIndexOf (inline)
    -8.80s  1.41% 18.32%     -8.80s  1.41%  runtime.markBits.isMarked (inline)
    -7.29s  1.17% 19.49%     -7.28s  1.17%  runtime.(*lfstack).push
    -5.53s  0.89% 20.37%    -40.12s  6.43%  runtime.gentraceback

Heap, then CPU:

profile001

profile002

Nice job 👏

(Lots of good info about using pprof tooling here: https://github.com/google/pprof/blob/main/doc/README.md#comparing-profiles)

@mholt
Copy link
Member

mholt commented Sep 2, 2022

@WeidiDeng I pushed one more commit that refactors client, it renames FCGIClient -> client, and should have slightly fewer allocations (like, 2 fewer -- we don't allocate the client or the clientCloser I think, since they're no longer pointers).

The tests pass on my machine, but it'd be good to have some more real-world PHP testing. Also run with the race detector (-race) and check for resource leaks. I think everything gets closed properly though. Thanks!

I'm about ready to merge this if you are. 🙂

@mholt mholt removed the in progress 🏃‍♂️ Being actively worked on label Sep 2, 2022
Copy link
Member

@mholt mholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @WeidiDeng anything else you want to do before we merge this?

@WeidiDeng
Copy link
Member Author

WeidiDeng commented Sep 2, 2022

@mholt I'm ready for it to be merged. About race detector, that one I haven't done. But that lock is unnecessary already, since reverse proxy can't write to the body and the only time it's written is during sending request which isn't concurrent.

@mholt
Copy link
Member

mholt commented Sep 2, 2022

@WeidiDeng Which lock are you referring to?

(I'll merge this in a few minutes then!)

@WeidiDeng
Copy link
Member Author

@WeidiDeng Which lock are you referring to?

(I'll merge this in a few minutes then!)

FCGIClient sync.Mutex only used in writeRecord

@mholt
Copy link
Member

mholt commented Sep 2, 2022

@WeidiDeng Looks like it was removed in this PR -- if there's no races, sounds good to me! Thank you! 😄 Please contribute again. (I will catch up to your other PRs shortly, I'm behind.)

@mholt mholt merged commit 83b2697 into caddyserver:master Sep 2, 2022
@mholt mholt removed the under review 🧐 Review is pending before merging label Sep 2, 2022
@mholt mholt modified the milestones: v2.6.0-beta.1, v2.6.0 Sep 13, 2022
eanavitarte added a commit to eanavitarte/caddy that referenced this pull request May 2, 2023
Hello, first of all I love Caddy! I hope this helps.

THE PROBLEM:
In current "83b2697" caddy version is not working the option "capture_stderr" of the directive "php_fastcgi".

HOW TO TEST THE PROBLEM:
If you want to test it, just create a php file with syntax error of any type and pass it over php_fastcgi.

WHEN STARTED THE PROBLEM:
The problem started in the transition between the "f2a7e7c" commit, and the current "83b2697".

WHY STARTED THE PROBLEM:
On 2022 Sept 2, it was launched the commit "83b2697" which changed some files on fastcgi.go.
These changes was titled: fastcgi: Optimize FastCGI transport (caddyserver#4978).
It was changed a lot of internal functions and they were replaced with a more distributed logic into separate files (which is good).
On those changes was removed a conditional that evaluates "t.CaptureStderr" (line 164, f2a7e7c).
This conditional set the correct logger to be passed to the client (fcgiBackend in f2a7e7c) instance.
So the logger was always set to "noopLogger" (a zap.NewNop()) and no matters you set capture_stderr, it was never registered into log.

HOW TECHNICALLY OCCURRED THE PROBLEM:
Those changes on the "t.CaptureStderr" (line 164, f2a7e7c) behaviour, probably were done because it was assumed that the evaluation of "c.stderr" (line 244, "83b2697") on client.go (another change introduced on "83b2697") would have the same behaviour, and it is right in some way.
I mean, if "c.stderr" is set, it will work as expected.
The problem is that nowhere it was passed the value evaluated on "t.CaptureStderr" to the client instance created, and so, it was never evaluated on "c.stderr" (line 244, "83b2697").

THE SOLUTION:
I must to say that, although it has taken me quite some time to find the solution, it is actually quite simple.
For fixing it I just pass the value in the creation of the client instance, ash show in this pull request.


I hope this little enhancement could help a bit to us.
francislavoie pushed a commit to eanavitarte/caddy that referenced this pull request May 4, 2023
Hello, first of all I love Caddy! I hope this helps.

THE PROBLEM:
In current "83b2697" caddy version is not working the option "capture_stderr" of the directive "php_fastcgi".

HOW TO TEST THE PROBLEM:
If you want to test it, just create a php file with syntax error of any type and pass it over php_fastcgi.

WHEN STARTED THE PROBLEM:
The problem started in the transition between the "f2a7e7c" commit, and the current "83b2697".

WHY STARTED THE PROBLEM:
On 2022 Sept 2, it was launched the commit "83b2697" which changed some files on fastcgi.go.
These changes was titled: fastcgi: Optimize FastCGI transport (caddyserver#4978).
It was changed a lot of internal functions and they were replaced with a more distributed logic into separate files (which is good).
On those changes was removed a conditional that evaluates "t.CaptureStderr" (line 164, f2a7e7c).
This conditional set the correct logger to be passed to the client (fcgiBackend in f2a7e7c) instance.
So the logger was always set to "noopLogger" (a zap.NewNop()) and no matters you set capture_stderr, it was never registered into log.

HOW TECHNICALLY OCCURRED THE PROBLEM:
Those changes on the "t.CaptureStderr" (line 164, f2a7e7c) behaviour, probably were done because it was assumed that the evaluation of "c.stderr" (line 244, "83b2697") on client.go (another change introduced on "83b2697") would have the same behaviour, and it is right in some way.
I mean, if "c.stderr" is set, it will work as expected.
The problem is that nowhere it was passed the value evaluated on "t.CaptureStderr" to the client instance created, and so, it was never evaluated on "c.stderr" (line 244, "83b2697").

THE SOLUTION:
I must to say that, although it has taken me quite some time to find the solution, it is actually quite simple.
For fixing it I just pass the value in the creation of the client instance, ash show in this pull request.


I hope this little enhancement could help a bit to us.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization 📉 Performance or cost improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of FastCGI transport (and clean up code)
4 participants