Like any complex production service, Thread experiences outages and issues in production that affect our users from time to time. In order to improve our service quality we run post-mortems of these incidents to identify ways to fix the root causes and improve service reliability moving forward. This post summarises one such analysis.
Typical Django deployments geared for production are usually comprised of a WSGI server such as Gunicorn running your app, and a reverse proxy such as NGINX. The latter tend to scale better and are less susceptible to common Denial Of Service attacks; which is one of the main reasons to not expose your WSGI server directly.
Because WSGI servers like Gunicorn aren’t built to protect from Denial Of Service attacks, it’s easy to trigger DoS conditions against a Django application with even slightly malformed requests. For example, sending a request with a body smaller than the advertised
Content-Length. If the view tries to read the body, it will hang and block the current worker. This issue has more details. With NGINX the
proxy_request_buffering setting is what protects us from this before it hits the backend.
This specific situation should not have been a problem for us given that we run behind a properly configured NGINX instance; so we were pretty surprised when we tracked down a drop in Gunicorn capacity to workers being stuck on a
read() call and waiting for a
GET request’s body to be available 1.
We tracked the original request coming into our infrastructure and at the point of hitting NGINX this requests was correct (i.e. the
Content-Length header matched its body). This pointed to the real cause here: we’d recently introduced a Node.js based server to handle server-side rendering of our React frontend. The architecture is as follows:
Requests from Node.js to Gunicorn do not go through NGINX for 2 main reasons:
The requests causing our Gunicorn workers to lock up were coming from the Node.js server because we were stripping the bodies from
GET requests, corrupting them along the way and creating what was essentially self induced Denial of Service. We do this because the Fetch API does not allow sending GET request with a body 2. As we never rely on this ourselves, we missed stripping related headers as well and this was never caught in normal operations (in fact it took some malicious requests 1 trigger this failure mode).
The fix for this was simple:
Specifically, we were receiving malformed
GET requests sent to our GraphQL endpoint. They had reproduced the request we make from our frontend, replaced some variables with SQL injection and then sent them as
GET. The library we use attempts to read the body of the request regardless of the HTTP method. As we send all our GraphQL requests over
POST and GraphQL over
GET usually uses query parameters we had not encountered this particular failure mode before.