Expensive Explicit Deallocation: An Example

Here is an example in which explicit deallocation appears to be necessarily much more expensive in execution time than garbage collection. The scenario:
  1. We are implementing a cache used for reading a random-access file of bytes.
  2. We assume that the file accesses will exhibit a high degree of locality, i.e. it is very likely that the next character requested will be near a character that was recently requested. In particular, we assume that the total time required for all reads that hit the cache is at least on the same order as the total time required to process cache misses.
  3. The client of this cache is multithreaded.
  4. Acquiring and releasing a lock is expensive relative to a memory access.
  5. Writing a single pointer is an indivisible operation.

Note that this is a typical scenario encountered by CORD_from_file_lazy in the cord package distributed with our garbage collector. Similar observations are likely to apply to any hash-table-like data structure that must support fast read accesses by multiple threads.

We will cache a fixed number of file sections. A read operation from an uncached part of the file will usually replace one of these file sections.

Our goal is to reduce the cost of a read from a cached section of the file to a few memory operations. By analogy to a direct mapped CPU cache, one way to accomplish this is to make the section lengths fixed (corresponding to cache lines), and to use the low order bits of the file position to index into the section. The next higher group of file position bits is used to index into a table T, each entry of which is a pointer to a section. Each section also contains its full starting address in the file, so that we can validate that the appropriate file section is actually cached.

In the garbage collected version, this all works fine. A cache miss is handled by allocating a new section, reading the appropriate section from the file, filling in the newly allocated section, and then atomically replacing the entry in T. This works correctly even in the presence of another concurrent access. The second access will see either the old or the new section. Either will provide it with accurate (though perhaps useless) information. Note that if the second access retrieves the old section, the old section will not be deallocated until the second access completes, since it remains accessible. Thus read operations that hit the cache require no locking.

In the explicitly managed version, a cache section being replaced must be deallocated. This requires that there must be no concurrent read operations proceeding on the old section. There appears to be no way to ensure that without requiring every read operation, including cache hits, to acquire and release a lock. With a typical thread library, this is likely to increase the time required for a cache hit by something between a factor of 2 and 10, or possibly more. It is not hard to conceive of applications for which this is the dominant cost. The text editor supplied with the cord package is probably often in this category.

(To be fair, the explicitly managed version might well allocate all sections statically, since it needs to lock on read accesses anyway. This saves some allocation and collection cost relative to the garbage collected version. But that is likely to be dominated by IO costs, and only affects the cache miss time, not the cache hit time. The cache hit time will still be much lower in the garbage collected version.)

This is a revised copy of a page written while the author was at Xerox PARC. The original is here.