Go HTTP Server Performance Issue


I am writing an event collector http server which would be under heavy load. Hence in the http handler I am just deserialising the event and then running the actual processing outside of the http request-response cycle in a goroutine.

With this, I see that if I am hitting the server at 400 requests per second, then the latency is under 20ms for 99 percentile. But as soon as I bump the request rate to 500 per second, latency shoots up to over 800ms.

Could anyone please help me with some ideas on what the reason could be so that I can explore more.

package controller import ( "net/http" "encoding/json" "event-server/service" "time" ) func CollectEvent() http.Handler { handleFunc := func(w http.ResponseWriter, r *http.Request) { startTime := time.Now() stats.Incr("TotalHttpRequests", nil, 1) decoder := json.NewDecoder(r.Body) var event service.Event err := decoder.Decode(&event) if err != nil { http.Error(w, "Invalid json: " + err.Error(), http.StatusBadRequest) return } go service.Collect(&event) w.Write([]byte("Accepted")) stats.Timing("HttpResponseDuration", time.Since(startTime), nil, 1) } return http.HandlerFunc(handleFunc) }

I ran a test with 1000 requests per second and profiled it. Following are the results.

(pprof) top20 Showing nodes accounting for 3.97s, 90.85% of 4.37s total Dropped 89 nodes (cum <= 0.02s) Showing top 20 nodes out of 162 flat flat% sum% cum cum% 0.72s 16.48% 16.48% 0.72s 16.48% runtime.mach_semaphore_signal 0.65s 14.87% 31.35% 0.66s 15.10% syscall.Syscall 0.54s 12.36% 43.71% 0.54s 12.36% runtime.usleep 0.46s 10.53% 54.23% 0.46s 10.53% runtime.cgocall 0.34s 7.78% 62.01% 0.34s 7.78% runtime.mach_semaphore_wait 0.33s 7.55% 69.57% 0.33s 7.55% runtime.kevent 0.30s 6.86% 76.43% 0.30s 6.86% syscall.RawSyscall 0.10s 2.29% 78.72% 0.10s 2.29% runtime.mach_semaphore_timedwait 0.07s 1.60% 80.32% 1.25s 28.60% net.dialSingle 0.06s 1.37% 81.69% 0.11s 2.52% runtime.notetsleep 0.06s 1.37% 83.07% 0.06s 1.37% runtime.scanobject 0.06s 1.37% 84.44% 0.06s 1.37% syscall.Syscall6 0.05s 1.14% 85.58% 0.05s 1.14% internal/poll.convertErr 0.05s 1.14% 86.73% 0.05s 1.14% runtime.memmove 0.05s 1.14% 87.87% 0.05s 1.14% runtime.step 0.04s 0.92% 88.79% 0.09s 2.06% runtime.mallocgc 0.03s 0.69% 89.47% 0.58s 13.27% net.(*netFD).connect 0.02s 0.46% 89.93% 0.40s 9.15% net.sysSocket 0.02s 0.46% 90.39% 0.03s 0.69% net/http.(*Transport).getIdleConn 0.02s 0.46% 90.85% 0.13s 2.97% runtime.gentraceback (pprof) top --cum Showing nodes accounting for 70ms, 1.60% of 4370ms total Dropped 89 nodes (cum <= 21.85ms) Showing top 10 nodes out of 162 flat flat% sum% cum cum% 0 0% 0% 1320ms 30.21% net/http.(*Transport).getConn.func4 0 0% 0% 1310ms 29.98% net.(*Dialer).Dial 0 0% 0% 1310ms 29.98% net.(*Dialer).Dial-fm 0 0% 0% 1310ms 29.98% net.(*Dialer).DialContext 0 0% 0% 1310ms 29.98% net/http.(*Transport).dial 0 0% 0% 1310ms 29.98% net/http.(*Transport).dialConn 0 0% 0% 1250ms 28.60% net.dialSerial 70ms 1.60% 1.60% 1250ms 28.60% net.dialSingle 0 0% 1.60% 1170ms 26.77% net.dialTCP 0 0% 1.60% 1170ms 26.77% net.doDialTCP (pprof)


<h2>The problem</h2> <blockquote>

I am using another goroutine because I dont want the processing to happen in the http request-response cycle.


That's a common fallacy (and hence trap). The line of reasoning appears to be sound: you're trying to process requests "somewhere else" in an attempt to handle ingress HTTP requests as fast as possible.

The problem is that that "somewhere else" is still some code which <em>runs concurrently</em> with the rest of your request-handling churn. Hence if that code runs <em>slower</em> than the rate of ingress requests, your processing goroutines will pile up essentially draining one or more resources. Which exactly—depends on the actual processing: if it's CPU-bound, it will create natural contention for the CPU between all those GOMAXPROCS hardware threads of execution; if it's bound to network I/O, it will create load on Go runtime scheruler which has to divide the available execution quanta it has on its hands between all those goroutines wanted to be executed; if it's bound to disk I/O or other syscalls you will have proliferation of OS threads created, and so on and so on…

Essentially, <em>you are <strong>queueing</strong> the work units</em> converted from the ingress HTTP requests, but <em>queues do not fix overload.</em> They might be used to absorb short spikes of overload, but this only works when such spikes are "surrounded" by the periods of load at least slightly below the maximum capacity provided by your system. The fact you're queueing is not directly seen in your case, but it's there, and it's exhibited by pressing your system past its natural capacity—your "queue" starts to grow indefinitely.

Please read <a href="https://ferd.ca/queues-don-t-fix-overload.html" rel="nofollow">this classic essay</a> carefully to understand why your approach is not going to work in realistic production setting. Pay close attention to those pictures of the kitchen sinks.

<h2>What to do about it?</h2>

Unfortunately, it's almost impossible to give your simple solution as we're not working with your code in your settings with your workload. Still, here are a couple of directions to explore.

On the most broad scale, try to see whether you have some easily discernible bottleneck in your system which you presently cannot see. For instance, if all those concurrent worker goroutines eventually talk to an RDBMs instance, its disk I/O may quite easily <em>serialize</em> all those goroutines which will merely wait for their turn to have their data accepted. The bottleneck may be simpler—say, in each worker goroutine you carelessly execute some long-running operation while holding a lock contended on by all those goroutines; this obviously serializes them all.

The next step would be to actually <em>measure</em> (I mean, by writing a benchmark) how many time does it take for a single worker to complete its unit of work. Then you need to measure how this number changes when increasing the concurrency factor. After collecting these data, you will be able to do educated projections about at what <em>realistic</em> rate your system is able to handle the requests.

The next step is to think through your strategy at making your system fulfil those calculated expectations. Usually this means <em>limiting the rate of ingress requests.</em> There are different approaches to achieve this. Look at <a href="https://godoc.org/golang.org/x/time/rate" rel="nofollow">golang.org/x/time/rate</a> for a time-based rate limiter but it's possible to start with lower-tech approaches such as using a buffered channel as a counting semaphore. The requests which would overflow your capacity may be rejected (typically with HTTP status code 429, see <a href="https://tools.ietf.org/html/rfc6585#section-4" rel="nofollow">this</a>). You might also consider queueing them briefly but I'd try this only to serve as a cherry on a pie—that is, when you have the rest sorted out completely.

The question of what to do with rejected requests depends on your setting. Typically you try to "scale horizontally" by deploying more than one service to process your requests and teach your clients to switch over available services. (I'd stress that it means several <em>independent</em> services—if they all share some target sink which collects their data, they might be limited by the ultimate capacity of that sink, and adding more systems won't gain you anything.)

Let me repeat that the general problem has no magic solutions: if your complete stystem (with this HTTP service you're writing being merely its front-end, gateway, part) is only able to handle N RPS of load, no amount of scattering go processRequest() is going to make it handle requests at a higher pace. The easy concurrency Go offers is not a <a href="https://en.wikipedia.org/wiki/No_Silver_Bullet" rel="nofollow">silver bullet</a>, it's a machine gun.