On the Powers and Dangers of Caching

Operating systems, the internet, video games, social media – they all would not exist without significant hardware and software infrastructure devoted to caching. The same is true for us: We make enormous demands on a host of distributed caching solutions, some open source, some commercial, all carefully optimized for a particular workload.

Our front line of defense is our content distribution system, which replicates content that can be statically cached, out to network edges for speedy delivery. Automated tasks in our system determine which pieces of content are globally usable and push them out to the edges. Our JavaScript infrastructure does the same on browsers, pulling in data from the correct places at the right times so as to get optimal performance. This allows us to handle enormous traffic spikes.

A second line of defense is our web-tier content cache, which serves content to authenticated/connected users. This is a large, but all-in-all straightforward caching system.

More interesting is our query result cache system, a multi-level distributed caching system that lazily replicates cache data from centralized memcached servers to in-memory caches on our web tiers. This enables us to handle cache system failures gracefully, so that cache restarts do not result in massive load spikes on our database clusters. Indeed, this is one of the main dangers of caching: When caching servers go down, rewarming those caches can drive back-breaking load on back-end systems, and so managing cache hydration and persistence is a key task for systems with 24×7 uptime.
Also interesting is our approach to cache locality: Since modern databases can serve single-row queries on clustered-index data incredibly quickly, improving object fetch performance is best done by avoiding network round trips, rather than query overhead. Therefore, we store query results in remote caches, but never objects, which, if they are cached at all, are always cached on the web tiers.

Using these approaches, as well as many others, such as sophisticated ETag exchanges, we enable our infrastructure to scale to the tens of thousands of requests per second that our customer base drives at peak hours.


On SOLID Code: D

To close out the discussion on SOLID, the D in SOLID stands for “Dependency Inversion” and is currently a bit of a darling in the industry, with significant efforts from many communities and companies to provide infrastructure that allows developers to implement this practice.

Now, dependency inversion does not require an inversion of control container, nor does it even imply that one must use dependency injection. These patterns are natural outcomes of this principal, that states that components should not depend upon concrete implementations, but on abstractions. More concretely, and often, that components should not depend on classes, but on interfaces.

The most powerful example of how this principal can help companies scale is in unit test suites: In a company that I’ve worked at previously, which built on of the world’s biggest applications, running regression tests on even a tiny part of the system could take days, and entire labs and teams were responsible for test execution and reporting. Startups simply cannot afford this kind of overhead (arguably, these days, no one can afford this kind of overhead.) The requirement for tests to run fast is a matter of survival. So consider the example of a startup, such as Sociable Labs, that is hosted in the cloud (Amazon, in our case) and makes use of many cloud services – key stores, binary storage systems, databases, distributed queues, Facebook’s Open Graph, the Twitter API, and much more. Now imagine running a full regression suite that runs thousands of tests against all of these systems. It would, and does, take hours to days, and new tests are constantly being added.

In order to be able to regression test our own systems independent of all of these systems, we provide mock implementations of all of the above services: We have mock Facebook implementations, mock S3 implementations, mocks of the database repositories and of our configuration server. In order to be able to tell our code to use these mocks, the code must not rely on any concrete implementations of S3 or Facebook clients – they must rely on interfaces so that we can provide one implementation at unit test time, and another during full regression passes or in production.

There are many systems that enable this sort of configuration, and most come down to some central authority, such as a service locator or container, which is configured via some mechanism so that it knows what concrete implementation to provide for a particular interface.

As an interesting aside, the classic Singleton pattern can easily become a massive source of violations of this principal. We use a combination of the service locator pattern and inversion of control to give our developers a very clean, intuitive view of the system that is actually more convenient and simple than the Singleton pattern, but without that pattern’s hard dependency code smell.

 


On SOLID Code: I

The I in SOLID stands for “Interface Segregation” and is a hallmark of high quality, disciplined framework design: As system requirements change over time, there is often the temptation to short-circuit codepaths, to add just one more API to some class to make things just work. This is one of the ways in which code rot enters systems: One boundaries and responsibilities of systems are not clear, there is momentum towards tighter coupling/more brittleness in the system.

Interface segregation is something of a special case of single responsibility: An interface tells us that an object supports a certain set of operations, and interface segregation adds that that set of operations must have a maximal degree of cohesion. For example, an interface such as IObjectCache should really only have three methods: add, get, remove. It is tempting to start growing this interface, to add methods such as queryKeys, and getStatistics, increment, decrement, push and pop. As we add new methods, the interface becomes harder and harder for new implementations to support, and so decreases the usefulness of the interface. Breaking that interface down into a hierarchy allows classes to implement more or less capabilities, and clients to more clearly declare which particular set of capabilities they need. Goodness all around.


Applying Minimal Viable Product to Software Design

You’ve probably seen this before: A team of well-funded, intelligent people, many of them engineers from top-flight companies, build a product over the course of six months to a year, maybe more, and when the release it, the industry reacts with a yawn, or not at all, or worst – negatively. Or the product utterly fails to scale, or is unusable by customers, or is so inflexible that customers who do buy it are unhappy.
 As a lean, agile, hungry startup with both existing and new customers, and engineering, product management, and sales teams who all make significant, intelligent, demands on us, we have to make difficult choices every day about where we put our efforts. And once we do make those choices, we are faced with countless design problems, mostly coming down to the question of “how”?
 How should a feature behave? How should customers interact with an application? How should we build a new caching or query system?
 At every turn, we ask a question: “What is the minimal viable product (MVP) approach here?”. This is neither laziness nor cutting corners: Our intent is to get to insight as quickly as possible. When building a caching system, we could attempt to take six months and roll out a complex system that caches queries in a wonderfully reusable, interesting way. Or we could take a sprint, do a quick prototype spike, roll it out in a limited way, and start learning about how the system behaves under load in the real world very quickly. And then roll that learning back into what we do quickly.
 When building a feature, we could implement it in a robust way, with a high degree of flexibility, lots of options, perhaps throw in a dash of extensibility to cover our butts. The kind of things Really Large Companies do when they can only ship once every two years. Instead, we build the minimal feature needed to enable us to learn what our customers, and their customers, benefit the most from – what resonates and what gets no use. Then we can quickly take those learnings and use them to guide our feature development in the most optimal way. The MVP approach can also help get you out of the analysis paralysis phase – rather than bouncing between N solutions, of which none of us is sure of the best one, we simply pick the simplest solution, move quickly, get it deployed, and feed the results back into our plans.
 Try this in your next design meeting.

Process + Judgement

The word “process” can put up the hackles of good developers. Images are conjured up of endless forms and quality gates, of endless documentation and worst of all, endless meetings.
 And yet…we want to know where we are. We want to know where we’re going. We want to assess our risks and we want to ship on quality, on time. We want to know what went into our last code push and whether the latest set of feature enhancements caused any performance regressions.
 We think of process as simply the things we do that record the past, our intentions for the future, and allow us to ask meaningful questions about the present. We take a minimal and iterative approach to process – Create and use the smallest amount of process that allows us to answer meaningful questions.
- By “meaningful”, we mean those questions that tell us what our risks are for any action: Schedule risk, quality risk, customer risk.
- By “iterative”, we mean that processes are constantly being tuned to better answer questions.
Elements of Agile and SCRUM are key here – Code reviews, stand-ups, 2-week sprints, triage, rapid releases, unit and automation test suites, daily builds, and so on. We have a tuned system that runs two parallel sprints at the same time, and we’ve created templates and systems that give us realtime visibility into the health of our sprints. The team is committed to this process because the team helped design it, and we have people in place who take on the project-management aspects of the process in order to free the developers to code, design, review, and otherwise engage in engineering.
 We also bring to bear a number of somewhat new techniques that guide how we design software. More on this in the next post.

Hiring: Striving for a 1% success rate

On our wiki there is a list of our company priorities. It’s on the front page. Among other purposes, this list helps ensure that everyone knows how their work fits into our overall roadmap and help people prioritize what they do. The contents of this list change every month as we knock big-ticket items off of it. One item, though, always remains at the number one slot, and that is hiring the best people.
For every 200 or so candidates that apply for a position, we will call about 40. Of those 40, about 15 make it through the initial call to a technical screen. This second line of filtering is carefully tuned – we read candidates resumes, and craft a technical interview whose intent is to create as realistic an image as possible about the cultural, technical, and personal fit.
We look for exceptional individuals, but we are not dogmatic about what exceptional means: We ask questions about Java, about databases, about algorithms and distributed systems. For senior hires, we ask questions about team building, management, process and execution. We look for strong communication skills, clarity of thought, passion and interests. This is difficult to do in a phone interview that takes an hour, but we’ve gotten decent at it.
Of those 15 calls, about 3 people end up coming into our office for a face to face interview with the team. We make every effort to ensure that the significant investment in team time that an interview loop demands has a high likelihood of success. Even so – of the three people who come in, we may hire one. These interview loops take not only the three or so hours in the office – an hour each with three developers – but also at least an additional two hours of preparation time, where the team gets together, talks about the candidate, selects non-repeated interview questions, and everyone on the interview loop is expected to carefully review the resume, as well as get together after the loop and provide detailed feedback.
An interesting advantage of this approach is that new hires have an instant level of credibility when they come in: Everyone here knows how difficult the interview loop is, so anyone else who has passed it is given a certain level of credit. This helps new people integrate into the team much more easily, and helps us bring in the kinds of people that other engineers want to work with.
We invest a lot of time in hiring, and in our teammates while they are here – we see ourselves as the equivalent of a high performance sports team – where everyone is expected to be able to play ball, where success is only assured by all team members working together to kick ass.

On SOLID code: L

The L in SOLID stands for “Liskov substitution” and you have been coding long enough, you have probably violated it and quickly found that the violation caused your system to become fragile and code to stop being reusable or stable.

Consider the case of a class that represents a vehicle: It has methods for getting velocity, checking the fuel level, adding fuel, and so on. Now imagine that the system is expanded to support solar-powered vehicles, and that those are written as subclasses of the vehicle class; SolarPoweredVehicle : Vehicle.
And now we have a problem: SolarPoweredVehicle has inherited the methods for checking fuel level and adding fuel, but these methods are meaningless for solar vehicles.

Liskov substitution states that subclasses of a class should not violate the contract that their base class provides. As another example, consider a subclass of a Circle class whose setRadius method interprets a value passed in as a size in inches, whereas the parent class interprets it as pixels. This would prevent the subclass from being passed to code that is expecting the base class.

Often, violations of Liskov substitution are a sign of improper modeling, of improper specification of abstractions. We do collaborative technical design sessions to try and tease out these issues and specify, together, the right architectures, build by composing the right sets of abstractions. Then we move fast and refactor continuously to drive the architecture to implementation while maintaining maximal correctness as the design meets the difficulties presented by the real world.


On SOLID code: O

The O in SOLID stands for “Open/Closed Principle”, and in practice encompasses a number of programming practices that enable the construction of useful, stable lass hierarchies. When building a class, every part of the class should be as closed as possible for modification, and as open as needed for extension. In practice, this means that all methods should be private unless absolutely necessary. All members should be both private and final, again, unless absolutely necessary.

Classes should be closed for modification in order to enable a number of things. First and foremost, members should be private so that operations on them are handled by the root class, which can then be modified and versioned in ways that can be guaranteed to be both backwards compatible and stable. Once encapsulation is broken, base classes become both unstable and difficult to version, since there is no clear contract as to how and when their state can be modified. The most stable, versionable class has immutable state and no public members. Such classes are often not particularly useful or interesting, although they do exist (for example, one could imagine a class responsible for doing periodic healthchecks against a cluster of machines and recording the results.) When opening a class up to extension, such as declaring methods protected, there should be a clear, explicit contract as to the expected behavior of extensions. When opening methods up for public or package consumption, again, there should be both a clear, well understood behavioral contract, and subclasses should not be able to alter this contract, since that would violate the ability for those subclasses to be reliably substituted for their base class.

The use of interfaces and abstract base classes where appropriate also enables the closing of classes for modification, since modifications that break the interface’s contract, or that of the abstract class will be caught by the compiler. The intelligent use of interfaces and abstract classes is incredibly important: Successful framework design, which enables high degrees of reuse, often relies on dependencies on abstraction (interfaces, abstract classes,) rather than concretions. We put a great deal of thoughts into what things in the system should be interfaces, and where abstract classes can be deployed. Such tools provide wonderfully useful constraints as well, ensuring that failure to write extensions to a framework in the correct way are caught as early and quickly as possible.


On SOLID code: S

Robert Martin, one of the fathers of the Agile Programming movement, introduced this acronym about a decade ago and it is somewhat surprising that so few engineers that we interview have even heard of it.

We use SOLID as a lens, a way to think about our code and to mentor new engineers, especially during code reviews.

The S in SOLID stands for “single responsibility” and appropriately, it is often the first thing we look for: Does a class clearly declare and then abstract away a single function of the system? Classes that take on more than one responsibility result in confusing APIs, brittle behaviors, and are difficult to version and reuse. The notion of “responsibility” can be a bit difficult for people to grasp, so we start with an example of a class that exhibits an extreme code-smell called “coincidental cohesion”: In this smell, a class attempts to cover multiple, unrelated functions, such as providing both logging and XML parsing services. A less extreme smell, which we see more often is naive code that exhibits only logical cohesion – where all functions that look the same from a code perspective are grouped together. In this case, the engineer has placed all code for, say, querying the database, in one class. Such classes exhibit rapid growth, and often have circular dependencies on other classes.
We look for classes that explicitly declare their dependencies, and which have a very clear name. Beware classes with names that end in “Service” or “Manager” or “Utils” – such general names can sometimes indicate that the class is taking on more than a class should.


On the Importance of Perfection

Our systems are not perfect; code gets checked in every once in a while that is not optimal – it may go against the system’s architectural grain, or it could be cleaner, more efficient, more forward-looking.
But, checking this code in is done thoughtfully – we know when we take shortcuts in our system and we do so when we know that a new system is coming, just not soon enough for our rapid release requirements.
When we set out to write our new data access layer, we wrote it three times, twice throwing away the entire system because we found anti-patterns emerging in the code. We did these iterations very quickly, in order to discover the best set of patterns for our particular set of needs. We did not check in any code until complete scenarios could be implemented end-to-end without code duplication, without dependency loops, without the code starting to smell. The current working system still isn’t perfect, but its imperfections are understood and represent conscious tradeoffs.
In contrast, many systems we’ve seen in the past do not represent such tradeoffs – they were not designed as frameworks on which higher-level systems could be built. Such systems work, but constantly accrete complexity until they collapse under their own weight: New changes are hard to make and cause regressions, unit tests are difficult to write and require pulling in multiple dependencies.
As a Software-as-a-Service company, we strive to move fast so that we can deliver new features to our customers, but at the same time we hold ourselves to many of the standards that a company that releases public APIs must hold itself to. We do this because we believe, and have learned from decades of combined experience, that doing so not only makes us happier as engineers, it makes us more successful as a company. It does this by giving our code a much longer lifespan, by making tests quicker to write and quicker to run, by enabling new engineers to ramp up more quickly.
While it can take many years to get an intuitive feel for whether code is going to scale with increased system complexity, there are some simple methods that can be used: Code must adhere strictly to package and class design best practices. In addition, our class designs are judged using something called “SOLID.” more on this in the next blog post.

Follow

Get every new post delivered to your Inbox.