Musings after detecting and fixing memory leak in Python ¶
Is reliability of your software important?
The ProjectX fsck up from last week, from the technical side - that the script should've never been run on prod in the first place is indisputable and there's little need to repeat it yet again - is closely related to the automatic resource management in Python. It works pretty well, most of the time, until one day it doesn't and all hell breaks loose.
It was many years ago, but I fought a few battles against Python garbage collector; I emerged - barely - victorious, though the loses on my side were many. I since forgot most lessons learned, and even if I didn't, they would be close to useless today, as there were at least two major GC overhauls since then. I thought it wouldn't be bad to get back on top of the game.
Furthermore, S. said that he doesn't know the details of garbage collection scheme in Python. I had a feeling that he's not the only one - I mean, I was like this until yesterday, too - and so I thought it would benefit all of our programming efforts to learn the ropes. Hence, I decided to prepare a short (I'll pretend I don't hear you laughing...) write-up on the GC and how it relates to the problem at hand.
CPython Memory Management Scheme
Python is a managed language, like Java
, JS
,
PHP
, Ruby
, Go
, PERL
,
TCL
, Lisp
, Smalltalk
, and so on (and
on, and on - this memory management scheme seems to have "won" in the '90s).
and unlike C
, C++
, D
,
Pascal/Delphi
, Cyclone
, Fortran
,
Ada
, Rust
, Forth
, Asm
, and
so on.
- this means that, most of the time, there is no need to explicitly allocate or deallocate memory
- this makes Python a *memory-safe language*: certain kinds of bugs (use after free, double free, out of bounds read, buffer overflow) cannot happen in pure Python code
2. Since CPython 1.6 Python uses two (only the first one prior to that) complementary automatic memory management techniques:
- reference counting: each object has a counter, which gets incremented each
time the object is referenced from a different context (other stack frame,
other name in the same scope, inside of another container) and decremented
once that reference stops being reachable (by being del
eted, overwritten,
going out of scope, being removed from a container, etc.) Once the counter
drops to 0, the object is deallocated: *immediately, synchronously, and
predictably*. This is a *CPython implementation detail* (ie. it may or may not
be true in PyPy, Jython, IronPython, etc), but there's little chance it'll
ever change, so it's ok to rely on it (unless you need to use another Python
implementation).
- cycle detection: there's a special (doubly-linked) list in the interpreter
state which contains all "tracked" objects, that is, the ones allocated with
PyObject_GC_New
instead of PyObject_New
. Most mutable objects, including
class objects and dicts, are created this way. Every time the
PyObject_GC_New
is called there's a *chance* (depending on the ratio of
allocations to deallocations since last collection) that the garbage collector
will run. When that happens, the list is traversed and cycles - subgraphs of
the object graph forming a closed loop, called "cyclic isolates" in the PEP -
are found; the GC then traverses each cycle and successively decrements the
refcounts ("breaks down the loop"), until they reach 0; then the objects are
deallocated. This happens - from the pure Python code - asynchronously.
The two approaches are complementary: every object in CPython is refcounted,
which means the GC-aware objects are refcounted, too. Depending on the
interpreter compile-time flags and/or runtime configuration the GC can be
completely disabled; then the PyObject_GC_New
is simply equivalent to
PyObject_New
.
3. Memory deallocation is a specific case of a more general process called
"object finalization" (also "object destruction" in C++ and related languages).
In Python, objects are always finalized just before deallocation - in case of
cycles, all of them are finalized before the first one is deallocated. It is
possible to provide a custom finalization procedure for objects by defining a
__del__
magic method on a class. The finalizers are not restricted in what
they can do, so it's possible for a finalizer to _resurrect_ an object by
creating a fresh reference to itself somewhere.
Prior to Python 3.4 the __del__
methods *were never called in case the object
was a part of a cycle*. In other words, the cyclic garbage collector refused to
finalize and deallocate objects with __del__
method defined. *This is no
longer the case*, thanks to
[PEP-442](https://www.python.org/dev/peps/pep-0442/). In other words, the custom
finalizer will be called even if an object is a part of a cycle. It's a very
welcome change, BTW, which makes __del__
methods much safer to use than in the
past.
4. One important use case for object finalization is releasing resources -
graphical contexts (GDK/Cairo), external event loops, loaded libraries, mutexes,
child processes, threads, sockets, files and so on. Many built-in objects
provide a finalizer, either as a __del__
method or (equivalently) C function
stored in tp_finalize
field of the PyTypeObject
struct (tp_del
prior to
3.4).
5. *Important note*: exceptions raised in finalizers are not propagated, and are silently swallowed. This is because there is no guarantee that, at the point in time the finalizer is called, a context where the exception should be raised still exists; an object cycle created inside a function call may get collected after the call already finished, for example.
6. The garbage collection is guaranteed to happen just before the interpreter
state is deleted, which always happens before the python
process exits. Thus,
since 3.4 at least, it's safe to assume the finalizers, including custom ones,
*will* be run at some point.
7. I'm not sure when exactly, but in recent Python versions the memory allocator was improved to be able to return the freed memory to the operating system. In Python 2.x, once the memory was requested from the OS, it was never given back - this could, in certain scenarios involving many long-running processes, lead to resource exhaustion even if the memory was not being actively used. This is *no longer the case*, which is another very welcome change.
---These are the basics - the details are a bit more complicated due to various optimizations (how often the GC is called), the objects being split into 3 generations (the GC in CPython is a "generational" one), the GC maintaining separate lists for objects of certain built-in types and so on, but in practice they don't matter much, unless you're working with a particularly twisted allocation profile. If you want to know more - just ask, I'll either answer or at least point in the general direction of the answer.
--- ### Memory leak in a script that ran on ProjectX prod
Moving on to the situation at hand: the code as used by @S. would normally
be harmless, ie. it would consume at most S3BotoStorage.max_memory_size
of
memory. There are two things which would guarantee this:
1. During a single iteration of the loop, the id_
reference is overwritten; no
reference to the previous id_
is visible in the code. The previous-iteration
version of id_
would have its reference counter decremented (just before
entering the next iteration), which would trigger its finalization and finally
deallocation. As mentioned, this is synchronous and deterministic; we can rely
on it, though we should note that it's a CPython implementation detail.
2. The implementation of tempfile.SpooledTemporaryFile
contains a
_io.BytesIO
object as its _file
member, which in turn provides a finalizer,
which releases both the object itself and all the memory it allocated for
content (ie. its internal buffer). It gets more complex in the case of content
overflow, where SpooledTemporaryFile
changes its internal representation to a
named file - it has to introduce an external finalizer then - but this didn't
happen in our case (and it wouldn't ever happen with the default settings, see
below; also, if it DID happen, there would be no such massive memory leak,
actually).
Here's the _io.BytesIO
instantiation:
https://github.com/python/cpython/blob/47be7d0108b4021ede111dbd15a095c725be46b7/Lib/tempfile.py#L634
and here's its destructor (finalizer): https://github.com/python/cpython/blob/47be7d0108b4021ede111dbd15a095c725be46b7/Modules/_io/bytesio.c#L886-L900
Given that the leak happened, we can conclude that the finalizer was not called. Or more precisely, it didn't get the chance to run due to the server running out of RAM; given sufficiently large available memory (>= 16 GB in this case, for 60k images), the finalizers would still run at the end of the loop, the memory would get reclaimed and everything would work just fine. Until one day we'd need to run the loop for 61k images, that is. (I'm assuming the images to be ~250kb in size, based on how 16700 images were able to exhaust 4gb of RAM).
Aside: the fact that the readable()
method loads the whole file and caches it
(via _get_file
) on the S3BotoStorageFile
instance is unfortunate, but on its
own it wouldn't be dangerous. It would waste some bandwidth - indeed, performing
a HEAD
request would be much better - but the memory consumption would not
exceed the amount needed for a single biggest image. Also, there's a setting,
namely AWS_S3_MAX_MEMORY_SIZE
, which controls when the temporary file "rolls
over", ie. changes its representation from in-memory buffer to a named file.
Unfortunately, the default value - 0
- means "no rollover, ever", which is
kind of dangerous on its own.
Now the question is where exactly were the references to the id_
objects kept.
Looking at the code there is no such place - the CA
list only keeps strings,
which are considered atomic by the GC, ie. they cannot refer to any other
objects; this means the CA
list couldn't have kept the processed objects
alive. The body of the loop never sees more than one id_
instance, so it's
also improbable to be the cause.
There is, however, one object which does see the whole set of IDFrontData
instances: the QuerySet
. That object, or parts of it at least, and all the
objects referenced from there, have to be kept alive until the loop finishes -
it only gets finalized and deallocated after the loop.
A quick look through the QuerySet
code reveals that, indeed, there is a cache
of objects returned:
https://github.com/django/django/blob/227d0c7365cfd0a64d021cb9bdcf77bed2d3f170/django/db/models/query.py#L194
which, to add insult to injury, gets *fully* populated at the beginning of the
loop; the __iter__
method:
https://github.com/django/django/blob/227d0c7365cfd0a64d021cb9bdcf77bed2d3f170/django/db/models/query.py#L273-L289
calls a _fetch_all
method before returning the iterator over the
_results_cache
; _fetch_all
in turn uses a ModelIterable
class (by default)
over self
and calls list
on that iterable:
https://github.com/django/django/blob/227d0c7365cfd0a64d021cb9bdcf77bed2d3f170/django/db/models/query.py#L1284-L1288
The same thing happens when you call len
on the QuerySet, by the way. The
_results_cache
is only cleared when delete
or update
methods are called;
there's no (or maybe I missed it?) public method for resetting the cache.
To summarize, code like this:
for obj in SomeModel.objects.filter(...): ...is semantically equivalent to something like this:
objs = list(SomeModel.objects.filter(...)) for obj in iter(objs): ... del objs
So this is where the references were kept, which explains the memory leak: every
obj
is still reachable via the list, so they cannot be finalized and garbage
collected. Normally it's not a big deal, and this is a sane performance
optimization - the real problem comes from the model instances being *mutable*:
whatever you do to the instance (or objects referred to by the instance), even
without calling save()
on it, the changes are kept in the list until the whole
list is collected.
1. Looking more closely at the QuerySet
code we see that the _fetch_all
method
uses something called _iterable_class(self)
which returns a *generator* -
hence the cast to list
when assigning _results_cache
in _fetch_all
. There
is another public method which exposes that generator - iterator()
:
https://github.com/django/django/blob/227d0c7365cfd0a64d021cb9bdcf77bed2d3f170/django/db/models/query.py#L357-L368
That generator returns results one by one, without saving them anywhere - so, in this case, code like this:
for obj in SomeModel.objects.filter(...).iterator(): ...
would yield the expected results: each object would be only referred from the body of the loop and it would get finalized and deallocated immediately before entering the next iteration.
2. It's of course possible to simply close the file, which would also clear the buffer and release the memory:
for id_ in IDFrontData.objects.filter(user_selected_state="CA"): if id_.source_image.file.readable(): id_.source_image.file.close() CA.append(id_.source_image.url)
That would however require us to know in advance that the readable
method does
what it does (ie. reads the whole file and caches it), which may or may not be
explained in the docs (I didn't check, would guess it isn't).
3. We could create a (deep) copy of each object before processing it in the body of the loop. It would waste some memory and time, but it would guarantee that nothing we do to the instance is saved in the original list:
for id_ in IDFrontData.objects.filter(user_selected_state="CA"): id_copy = copy.deepcopy(id_) if id_copy.source_image.file.readable(): CA.append(id_.source_image.url)
Here the id_copy
would get finalized and deallocated just before the next copy
is created.
4. For fun: we could make use of the mutability of the list and "clean up" after each step:
qs = SomeModel.objects.filter(...) for i, obj in enumerate(qs): ... qs._results_cache[i] = None
This assumes the iter()
call doesn't copy the original list, in other words
that the qs._results_cache
is exactly the same list we iterate over.
There are many more ways to handle the situation - once you know what happened it's easy to find a good workaround. The "once you know" part, though, is the problem, which leads to the final point:
### Further RecommendationsI've been thinking about the reliability of our products for a year now, but I just realized that I've never shared my thoughts on the matter, other than in private chats. I'll use this opportunity to summarize the most obvious ones.
1. The biggest problem (aside from running a script, any script, on a production server) in this situation was the fact that the exact failure reason was only discovered after many hours. It should have been discovered immediately: the RAM usage is being monitored in the AWS and the exact problem is clearly visible on the graphs there. The reason for the delay is that it required someone to look at that graph: this is not reliable (humans are unreliable in general). Instead, once the RAM usage exceeds a sane value, that fact should be automatically reported everywhere, immediately. By "everywhere" I mean: sent on all Slack channels related to ProjectX and devops, sent on Slack in private message to person/people responsible for ProjectX, sent by mail to every dev in TS and to ProjectX product owner, sent by SMS to devops and devs responsible for ProjectX, displayed on a status page (we should have one, by the way) for the product, sent by FedEx and carrier pidgeon to just about everyone...
...I mean, come on guys, the *production died*. Twice! This is not supposed to happen, and when it does, it's a RED ALERT (or it should be) across *the whole organization*. The details of a failure cannot be locked inside private conversations and AWS console that no one ever opens. The more people get involved, and the more information they get from the get go, the faster the situation will get resolved.
2. There should never be a reason for anyone to restart the server manually. Once the server gets unresponsive it should immediately be isolated, a fresh one should be created and the traffic should be rerouted there. The faulty server itself *should not be restarted*: this way important clues needed for resolving the situation could get irretrievably lost, which would mean that we wouldn't be able to fix the issue. The best we could do in that case would be to "wait for it to happen again". Which is wrong on many levels - I hope I don't need to explain why.
Doing this on Kubernetes is far easier, but it's perfectly doable with EC2 instances, too. Also, obviously, we should implement some limits in this logic to make sure it doesn't fall into *infinite* loop. Another thing: the logic for handling failures like this *is also code*, and *untested code is broken by default* (if it isn't that's just a lucky coincidence). This means that we have to do a full dry-run after implementing it - we don't want to test the solution for the first time during the *real* outage, because then we'd almost surely need to debug and fix both the server, and the recovery logic at the same time. Hardly appealing perspective.
3. The operating system should never allow a single process to eat all the RAM
and get away with it without getting killed. In recent Linux kernels there is a
support for per-group and per-process memory limits. It would not protect
against a fork bomb (try running :(){ :|:& };:
in BASH), but in this case the
memory was all inside a single process. That process should have been brutally
murdered long before bringing the whole server down. This is a bit contradictory
with the previous point - some important clues for resolving the issue might
have been lost with the process death - but it's a trade-off: ease of debugging
vs. guaranteeing uptime. We're talking about production here, so (esp. in the
absence of higher-level disaster recovery) it wouldn't hurt to err on the uptime
side.
4. There should never be a need to run any script on a production server. On the other hand, it *cannot be impossible to run one-off tasks* in the context of production deployment. The best way to achieve both goals would be to provide, created on-demand, copy of the server or the whole environment. This would require some tweaks in the code and would be much easier on Kubernetes than on EC2, but it's possible on the latter too. Spinning up such instance should be easy, ideally needing a single click or a single command - we need to disincentivize touching production as much as possible, to the point where *everything* you'd want to do on production is *easier done* outside of it - otherwise situations like the current one *will* happen, it's just a matter of time.
5. We should create a "disaster checklist" - document outlining the exact steps needed to be taken by anyone related to the outage (and in case of production, again, we're *all* related). There should never be a need to think on what to do next; who to call, what data to gather, when to restart, when not to restart, what to tell the users, when to start panicking and so on. People are unreliable even in the best of circumstances; in a stressful situation, possibly late at night, most likely while being dragged away from other tasks, that unreliability can reach astonishing levels. A checklist, while on its own it doesn't guarantee the swift resolution of a problem, helps people to act sane in all kinds of circumstances. The document should be public and should include sections for developers, devops, and managers alike.
6. People are unreliable in general, but especially so in situations they encounter for the first time. It's virtually impossible, even with the instructions in hand, to follow a process (any process) correctly on the first try. This is a truism: talk with any soldier (we've got one former soldier IIRC), or a diver, or a climber, and you'll hear exactly this. This means that, if we want to be good at handling outages, the only thing we can do is to handle outages many times. But we don't want that - ideally we'd like to see no outages at all. There's an obvious way around this: we need to *exercise* using servers which *can* die; if we don't have such servers, then we just need to create them. We can easily create a copy of production environment, bring it down, and perform a dry run of our recovery procedures. Ideally we would do this periodically and update the exercise content after every real problem we encounter.
---Again, like with the GC above, these are the basics - a short summary, even - and there are many finer points to discuss; again, Ask Me Anything, I'll do my best to explain everything.
---I think one important - THE most important - question regarding our reliability is this: *do we want to take the reliability seriously*? Because as of right now - we don't; if we did, we'd look around and actively search for and implement solutions used successfully at Boeing, in the armed forces, on the satellites and SpaceX rockets, basically everywhere where the "mission critical" means "people die if we fsck up". The solutions are out there, described in detail, ready to implement if we really wanted to, really cared about reliability.
--/md Comments