The Right Lang for the Job

Exploring the abbyss of non-mainstream programming languages

Musings after detecting and fixing memory leak in Python

Is reliability of your software important?

Last updated on:

The ProjectX fsck up from last week, from the technical side - that the script should've never been run on prod in the first place is indisputable and there's little need to repeat it yet again - is closely related to the automatic resource management in Python. It works pretty well, most of the time, until one day it doesn't and all hell breaks loose.

It was many years ago, but I fought a few battles against Python garbage collector; I emerged - barely - victorious, though the loses on my side were many. I since forgot most lessons learned, and even if I didn't, they would be close to useless today, as there were at least two major GC overhauls since then. I thought it wouldn't be bad to get back on top of the game.

Furthermore, S. said that he doesn't know the details of garbage collection scheme in Python. I had a feeling that he's not the only one - I mean, I was like this until yesterday, too - and so I thought it would benefit all of our programming efforts to learn the ropes. Hence, I decided to prepare a short (I'll pretend I don't hear you laughing...) write-up on the GC and how it relates to the problem at hand.

CPython Memory Management Scheme

Python is a managed language, like Java, JS, PHP, Ruby, Go, PERL, TCL, Lisp, Smalltalk, and so on (and on, and on - this memory management scheme seems to have "won" in the '90s). and unlike C, C++, D, Pascal/Delphi, Cyclone, Fortran, Ada, Rust, Forth, Asm, and so on.

- this means that, most of the time, there is no need to explicitly allocate or deallocate memory

- this makes Python a *memory-safe language*: certain kinds of bugs (use after free, double free, out of bounds read, buffer overflow) cannot happen in pure Python code

2. Since CPython 1.6 Python uses two (only the first one prior to that) complementary automatic memory management techniques:

- reference counting: each object has a counter, which gets incremented each time the object is referenced from a different context (other stack frame, other name in the same scope, inside of another container) and decremented once that reference stops being reachable (by being deleted, overwritten, going out of scope, being removed from a container, etc.) Once the counter drops to 0, the object is deallocated: *immediately, synchronously, and predictably*. This is a *CPython implementation detail* (ie. it may or may not be true in PyPy, Jython, IronPython, etc), but there's little chance it'll ever change, so it's ok to rely on it (unless you need to use another Python implementation).

- cycle detection: there's a special (doubly-linked) list in the interpreter state which contains all "tracked" objects, that is, the ones allocated with PyObject_GC_New instead of PyObject_New. Most mutable objects, including class objects and dicts, are created this way. Every time the PyObject_GC_New is called there's a *chance* (depending on the ratio of allocations to deallocations since last collection) that the garbage collector will run. When that happens, the list is traversed and cycles - subgraphs of the object graph forming a closed loop, called "cyclic isolates" in the PEP - are found; the GC then traverses each cycle and successively decrements the refcounts ("breaks down the loop"), until they reach 0; then the objects are deallocated. This happens - from the pure Python code - asynchronously.

The two approaches are complementary: every object in CPython is refcounted, which means the GC-aware objects are refcounted, too. Depending on the interpreter compile-time flags and/or runtime configuration the GC can be completely disabled; then the PyObject_GC_New is simply equivalent to PyObject_New.

3. Memory deallocation is a specific case of a more general process called "object finalization" (also "object destruction" in C++ and related languages). In Python, objects are always finalized just before deallocation - in case of cycles, all of them are finalized before the first one is deallocated. It is possible to provide a custom finalization procedure for objects by defining a __del__ magic method on a class. The finalizers are not restricted in what they can do, so it's possible for a finalizer to _resurrect_ an object by creating a fresh reference to itself somewhere.

Prior to Python 3.4 the __del__ methods *were never called in case the object was a part of a cycle*. In other words, the cyclic garbage collector refused to finalize and deallocate objects with __del__ method defined. *This is no longer the case*, thanks to [PEP-442](https://www.python.org/dev/peps/pep-0442/). In other words, the custom finalizer will be called even if an object is a part of a cycle. It's a very welcome change, BTW, which makes __del__ methods much safer to use than in the past.

4. One important use case for object finalization is releasing resources - graphical contexts (GDK/Cairo), external event loops, loaded libraries, mutexes, child processes, threads, sockets, files and so on. Many built-in objects provide a finalizer, either as a __del__ method or (equivalently) C function stored in tp_finalize field of the PyTypeObject struct (tp_del prior to 3.4).

5. *Important note*: exceptions raised in finalizers are not propagated, and are silently swallowed. This is because there is no guarantee that, at the point in time the finalizer is called, a context where the exception should be raised still exists; an object cycle created inside a function call may get collected after the call already finished, for example.

6. The garbage collection is guaranteed to happen just before the interpreter state is deleted, which always happens before the python process exits. Thus, since 3.4 at least, it's safe to assume the finalizers, including custom ones, *will* be run at some point.

7. I'm not sure when exactly, but in recent Python versions the memory allocator was improved to be able to return the freed memory to the operating system. In Python 2.x, once the memory was requested from the OS, it was never given back - this could, in certain scenarios involving many long-running processes, lead to resource exhaustion even if the memory was not being actively used. This is *no longer the case*, which is another very welcome change.

---

These are the basics - the details are a bit more complicated due to various optimizations (how often the GC is called), the objects being split into 3 generations (the GC in CPython is a "generational" one), the GC maintaining separate lists for objects of certain built-in types and so on, but in practice they don't matter much, unless you're working with a particularly twisted allocation profile. If you want to know more - just ask, I'll either answer or at least point in the general direction of the answer.

--- ### Memory leak in a script that ran on ProjectX prod

Moving on to the situation at hand: the code as used by @S. would normally be harmless, ie. it would consume at most S3BotoStorage.max_memory_size of memory. There are two things which would guarantee this:

1. During a single iteration of the loop, the id_ reference is overwritten; no reference to the previous id_ is visible in the code. The previous-iteration version of id_ would have its reference counter decremented (just before entering the next iteration), which would trigger its finalization and finally deallocation. As mentioned, this is synchronous and deterministic; we can rely on it, though we should note that it's a CPython implementation detail.

2. The implementation of tempfile.SpooledTemporaryFile contains a _io.BytesIO object as its _file member, which in turn provides a finalizer, which releases both the object itself and all the memory it allocated for content (ie. its internal buffer). It gets more complex in the case of content overflow, where SpooledTemporaryFile changes its internal representation to a named file - it has to introduce an external finalizer then - but this didn't happen in our case (and it wouldn't ever happen with the default settings, see below; also, if it DID happen, there would be no such massive memory leak, actually).

Here's the _io.BytesIO instantiation: https://github.com/python/cpython/blob/47be7d0108b4021ede111dbd15a095c725be46b7/Lib/tempfile.py#L634

and here's its destructor (finalizer): https://github.com/python/cpython/blob/47be7d0108b4021ede111dbd15a095c725be46b7/Modules/_io/bytesio.c#L886-L900

Given that the leak happened, we can conclude that the finalizer was not called. Or more precisely, it didn't get the chance to run due to the server running out of RAM; given sufficiently large available memory (>= 16 GB in this case, for 60k images), the finalizers would still run at the end of the loop, the memory would get reclaimed and everything would work just fine. Until one day we'd need to run the loop for 61k images, that is. (I'm assuming the images to be ~250kb in size, based on how 16700 images were able to exhaust 4gb of RAM).

Aside: the fact that the readable() method loads the whole file and caches it (via _get_file) on the S3BotoStorageFile instance is unfortunate, but on its own it wouldn't be dangerous. It would waste some bandwidth - indeed, performing a HEAD request would be much better - but the memory consumption would not exceed the amount needed for a single biggest image. Also, there's a setting, namely AWS_S3_MAX_MEMORY_SIZE, which controls when the temporary file "rolls over", ie. changes its representation from in-memory buffer to a named file. Unfortunately, the default value - 0 - means "no rollover, ever", which is kind of dangerous on its own.

Now the question is where exactly were the references to the id_ objects kept. Looking at the code there is no such place - the CA list only keeps strings, which are considered atomic by the GC, ie. they cannot refer to any other objects; this means the CA list couldn't have kept the processed objects alive. The body of the loop never sees more than one id_ instance, so it's also improbable to be the cause.

There is, however, one object which does see the whole set of IDFrontData instances: the QuerySet. That object, or parts of it at least, and all the objects referenced from there, have to be kept alive until the loop finishes - it only gets finalized and deallocated after the loop.

A quick look through the QuerySet code reveals that, indeed, there is a cache of objects returned:

https://github.com/django/django/blob/227d0c7365cfd0a64d021cb9bdcf77bed2d3f170/django/db/models/query.py#L194

which, to add insult to injury, gets *fully* populated at the beginning of the loop; the __iter__ method: https://github.com/django/django/blob/227d0c7365cfd0a64d021cb9bdcf77bed2d3f170/django/db/models/query.py#L273-L289

calls a _fetch_all method before returning the iterator over the _results_cache; _fetch_all in turn uses a ModelIterable class (by default) over self and calls list on that iterable: https://github.com/django/django/blob/227d0c7365cfd0a64d021cb9bdcf77bed2d3f170/django/db/models/query.py#L1284-L1288

The same thing happens when you call len on the QuerySet, by the way. The _results_cache is only cleared when delete or update methods are called; there's no (or maybe I missed it?) public method for resetting the cache.

To summarize, code like this:

for obj in SomeModel.objects.filter(...):
    ...
is semantically equivalent to something like this:
objs = list(SomeModel.objects.filter(...))
for obj in iter(objs):
    ...
del objs

So this is where the references were kept, which explains the memory leak: every obj is still reachable via the list, so they cannot be finalized and garbage collected. Normally it's not a big deal, and this is a sane performance optimization - the real problem comes from the model instances being *mutable*: whatever you do to the instance (or objects referred to by the instance), even without calling save() on it, the changes are kept in the list until the whole list is collected.

### Possible Fixes

1. Looking more closely at the QuerySet code we see that the _fetch_all method uses something called _iterable_class(self) which returns a *generator* - hence the cast to list when assigning _results_cache in _fetch_all. There is another public method which exposes that generator - iterator(): https://github.com/django/django/blob/227d0c7365cfd0a64d021cb9bdcf77bed2d3f170/django/db/models/query.py#L357-L368

That generator returns results one by one, without saving them anywhere - so, in this case, code like this:

for obj in SomeModel.objects.filter(...).iterator():
    ...

would yield the expected results: each object would be only referred from the body of the loop and it would get finalized and deallocated immediately before entering the next iteration.

2. It's of course possible to simply close the file, which would also clear the buffer and release the memory:

for id_ in IDFrontData.objects.filter(user_selected_state="CA"):
    if id_.source_image.file.readable():
        id_.source_image.file.close()
        CA.append(id_.source_image.url)

That would however require us to know in advance that the readable method does what it does (ie. reads the whole file and caches it), which may or may not be explained in the docs (I didn't check, would guess it isn't).

3. We could create a (deep) copy of each object before processing it in the body of the loop. It would waste some memory and time, but it would guarantee that nothing we do to the instance is saved in the original list:

for id_ in IDFrontData.objects.filter(user_selected_state="CA"):
    id_copy = copy.deepcopy(id_)
    if id_copy.source_image.file.readable():
        CA.append(id_.source_image.url)

Here the id_copy would get finalized and deallocated just before the next copy is created.

4. For fun: we could make use of the mutability of the list and "clean up" after each step:

qs = SomeModel.objects.filter(...)
for i, obj in enumerate(qs):
    ...
    qs._results_cache[i] = None

This assumes the iter() call doesn't copy the original list, in other words that the qs._results_cache is exactly the same list we iterate over.

---

There are many more ways to handle the situation - once you know what happened it's easy to find a good workaround. The "once you know" part, though, is the problem, which leads to the final point:

### Further Recommendations

I've been thinking about the reliability of our products for a year now, but I just realized that I've never shared my thoughts on the matter, other than in private chats. I'll use this opportunity to summarize the most obvious ones.

1. The biggest problem (aside from running a script, any script, on a production server) in this situation was the fact that the exact failure reason was only discovered after many hours. It should have been discovered immediately: the RAM usage is being monitored in the AWS and the exact problem is clearly visible on the graphs there. The reason for the delay is that it required someone to look at that graph: this is not reliable (humans are unreliable in general). Instead, once the RAM usage exceeds a sane value, that fact should be automatically reported everywhere, immediately. By "everywhere" I mean: sent on all Slack channels related to ProjectX and devops, sent on Slack in private message to person/people responsible for ProjectX, sent by mail to every dev in TS and to ProjectX product owner, sent by SMS to devops and devs responsible for ProjectX, displayed on a status page (we should have one, by the way) for the product, sent by FedEx and carrier pidgeon to just about everyone...

...I mean, come on guys, the *production died*. Twice! This is not supposed to happen, and when it does, it's a RED ALERT (or it should be) across *the whole organization*. The details of a failure cannot be locked inside private conversations and AWS console that no one ever opens. The more people get involved, and the more information they get from the get go, the faster the situation will get resolved.

2. There should never be a reason for anyone to restart the server manually. Once the server gets unresponsive it should immediately be isolated, a fresh one should be created and the traffic should be rerouted there. The faulty server itself *should not be restarted*: this way important clues needed for resolving the situation could get irretrievably lost, which would mean that we wouldn't be able to fix the issue. The best we could do in that case would be to "wait for it to happen again". Which is wrong on many levels - I hope I don't need to explain why.

Doing this on Kubernetes is far easier, but it's perfectly doable with EC2 instances, too. Also, obviously, we should implement some limits in this logic to make sure it doesn't fall into *infinite* loop. Another thing: the logic for handling failures like this *is also code*, and *untested code is broken by default* (if it isn't that's just a lucky coincidence). This means that we have to do a full dry-run after implementing it - we don't want to test the solution for the first time during the *real* outage, because then we'd almost surely need to debug and fix both the server, and the recovery logic at the same time. Hardly appealing perspective.

3. The operating system should never allow a single process to eat all the RAM and get away with it without getting killed. In recent Linux kernels there is a support for per-group and per-process memory limits. It would not protect against a fork bomb (try running :(){ :|:& };: in BASH), but in this case the memory was all inside a single process. That process should have been brutally murdered long before bringing the whole server down. This is a bit contradictory with the previous point - some important clues for resolving the issue might have been lost with the process death - but it's a trade-off: ease of debugging vs. guaranteeing uptime. We're talking about production here, so (esp. in the absence of higher-level disaster recovery) it wouldn't hurt to err on the uptime side.

4. There should never be a need to run any script on a production server. On the other hand, it *cannot be impossible to run one-off tasks* in the context of production deployment. The best way to achieve both goals would be to provide, created on-demand, copy of the server or the whole environment. This would require some tweaks in the code and would be much easier on Kubernetes than on EC2, but it's possible on the latter too. Spinning up such instance should be easy, ideally needing a single click or a single command - we need to disincentivize touching production as much as possible, to the point where *everything* you'd want to do on production is *easier done* outside of it - otherwise situations like the current one *will* happen, it's just a matter of time.

5. We should create a "disaster checklist" - document outlining the exact steps needed to be taken by anyone related to the outage (and in case of production, again, we're *all* related). There should never be a need to think on what to do next; who to call, what data to gather, when to restart, when not to restart, what to tell the users, when to start panicking and so on. People are unreliable even in the best of circumstances; in a stressful situation, possibly late at night, most likely while being dragged away from other tasks, that unreliability can reach astonishing levels. A checklist, while on its own it doesn't guarantee the swift resolution of a problem, helps people to act sane in all kinds of circumstances. The document should be public and should include sections for developers, devops, and managers alike.

6. People are unreliable in general, but especially so in situations they encounter for the first time. It's virtually impossible, even with the instructions in hand, to follow a process (any process) correctly on the first try. This is a truism: talk with any soldier (we've got one former soldier IIRC), or a diver, or a climber, and you'll hear exactly this. This means that, if we want to be good at handling outages, the only thing we can do is to handle outages many times. But we don't want that - ideally we'd like to see no outages at all. There's an obvious way around this: we need to *exercise* using servers which *can* die; if we don't have such servers, then we just need to create them. We can easily create a copy of production environment, bring it down, and perform a dry run of our recovery procedures. Ideally we would do this periodically and update the exercise content after every real problem we encounter.

---

Again, like with the GC above, these are the basics - a short summary, even - and there are many finer points to discuss; again, Ask Me Anything, I'll do my best to explain everything.

---

I think one important - THE most important - question regarding our reliability is this: *do we want to take the reliability seriously*? Because as of right now - we don't; if we did, we'd look around and actively search for and implement solutions used successfully at Boeing, in the armed forces, on the satellites and SpaceX rockets, basically everywhere where the "mission critical" means "people die if we fsck up". The solutions are out there, described in detail, ready to implement if we really wanted to, really cared about reliability.

--/md Comments