Startups are often kick-started with a prototype. Perhaps not much more than tape and baling wire, this prototype can get the company users, noticed, funding.
Success is when the moment of true danger is, and the test of top-notch engineering and product leadership: Success is when you need to grow and evolve your system.
If you have the luxury to throw away your initial code and start from scratch, you’re in an unusually good position. If you aren’t, then the decisions made at the prototype stage have a profound impact on your forward velocity. A badly written prototype, full of cut corners and shortcuts is going to severely impair your ability to write new features, to find and fix bugs, to write a world-class test system.
The key is to know where to take shortcuts, and this applies even beyond the prototype stage: Never, ever take shortcuts on architectural correctness or code quality. Take shortcuts on features, on configurability, on performance. Don’t make every part of your system configurable, but architect it so that it can be. Don’t try to write a massively complex sharding system – rely on a single, robust database with a hot replica. Don’t attempt to give users three ways of doing anything in your system; focus on the path of least resistance.
At each step of the way, build just enough code to validate assumptions and test ideas, but build it right.
Many of us grow up with the myth that in order to “move up” you must become a manager. Perhaps this is true at some companies.
It certainly is not true here.
There are so many skills that we value – communication, design, problem solving, technical. People can also have terrific leadership skills but not want to manage people. They may enjoy writing code more, or influencing across different groups, or solving difficult design or marketing problems.
So we have two career paths at Sociable Labs. For managers there is a management path. For individual contributors there is an IC path. Thus, a principal engineer can be more highly compensated than their manager. This is not only a fair approach, it is also extremely flexible, allowing folks to grow and be recognized for that growth without running into a prescriptive career path roadblock.
Among our design mantras is that everything that can go wrong, will go wrong.
Is the concurrency model not perfect? Under the onslought of 10,000 hits per second, it’ll break.
Is the caching model not thought through completely? It too, will break.
Is there a single point of failure anywhere in the system? It will bring your system down.
We’ve learned to be, and to hire, perfectionists, because shortcuts do not work in a sophisticated, high volume environment. We’ve learned that every solution must be thought through and peer reviewed because any sufficiently complex system cannot be fully understood by a single person. We deploy at least two (typically four) instances of anything.
We are moving towards a model where the system is so loosely coupled that we’ll be able to shut down every single one of our database servers, and have the system continue running with no visible interruptions or loss of data. We can already shut down any web server with no visible effect to the system, as well as seamlessly launch new instances.
We’ve learned to rehearse, to “dry run” all significant operations, and to challenge each other to approach the theoretical limits of operational goodness – to have zero downtime even for significant system changes.
All this requires very different thinking from traditional enterprise systems. It is even different from running traditional high-volume websites, because we are responsible for very different websites across far flung, geographically diverse domains, each with highly isolated data stores running on a shared fabric.
We love the enormous amounts of learning this involves, the need to constantly improve ourselves and our systems to handle constantly increasing loads. If you’ve built a high-scale SaaS system and evolved it over time into a best-of-breed solution that is truly stable under enormous load spikes. If you’ve learned to be a real engineer under these conditions – to understand how to solve problems thoroughly and deeply and to consider issues of monitoring, scale, failure, analytics, concurrency, availability and consistency, then there are few systems you cannot be confident that you cannot build.
Silicon valley moves blindingly fast – Facebook releases new code every Tuesday. Mozilla is on an incredibly aggressive three month release cycle. Open source projects appear like mushrooms after rain and some spread like wildfire.
Sociable Labs release trains go out every two weeks. To maintain this sort of speed, we have a number of systems in place, from automated test suites, smoke testing, regression prevention to architectural and cultural fixtures.
I’d like to cover a few of these fixtures, specifically versioning, complexity control, and decision making.
Code is easy to change – drop a new dll, jar file, or code file into place and your CPU has a new set of instructions to run. Data, however, lives forever, and is much more difficult to change. A company can be hampered by the costs of making changes that require data changes. To get around this, thought must be paid at the very beginning of system design to where data lives, and how it is to be quickly and efficiently versioned. The classic problem in this space is database schema migration: Addition of new columns is fairly easy, but renaming columns, splitting or joining columns, refactoring one table into many, or many into few, changing data types – these are complex. Our deployment pipeline accounts for this versioning and has done so since our first prototypes. Configuration management is just as important, however: As we ramp up new customers, customer configuration data rapidly gets more complex, and refactoring and normalizing configuration parameters is critical if we are to keep increasing the power of the system without carrying a heavy debt of legacy complexity. For this, we drive knowledge of versions and how to migrate them into the core of the product, so that any configuration version can be loaded and quickly and transparently migrated to the latest schema without loss of fidelity.
It is a counterintuitive principal, and one not often seen, that as systems grow, they should become architecturally simpler, rather than more complex: While algorithms may grow in complexity, for example, moving from linear algorithms to non-linear, feedback-driven systems, the execution environment for those algorithms needs to become simpler. Code that starts our as a big ball of mud must evolve into something more beautiful or it will die. Putting in place not only best-practices such as dependency injection, but an engineering culture that understands at a gut-level when a particular code change is going to make the system more brittle, and when it is time to refactor. By having a shared desire for simplicity, and a deep skill at making it happen, enabled by experience, judgement, and collaboration, we drive platform velocity by enabling new code to slot into the system in natural, elegant ways.
We’ll keep this one short and sweet: Decisions have to be made thoughtfully, but quickly. Good decision making is a direct result of clear communication, trust, and a culture of collaboration, where people actively seek out feedback and accept it with an open mind. Priorities are shared and clear: First our customers, then our company, then us.
What does it mean to be in control?
In our business, being in control means being aware of all of your risks and having a real plan for mitigating them. Risks we define as uncertainties about the success or continued health of some process. So you have, for example, schedule risk (can we ship on time?), technical risk (will this system handle massive load spikes?), business risk (will we meet our numbers?) and so on.
We use a combination of techniques to mitigate risk: Schedule risk is managed by careful planning, historical analysis and collaborative assessments. For example, after each sprint, we analyze the amount of work done and average it against previous sprints. This allows us to calculate our capacity to deliver work. We feed this data into new sprints in order to ensure that we can commit to finishing all the work we sign up for. The team as a whole collaborates on costing and estimating items.
Another set of tools are used to manage technical risk: Code reviews are done for checkins, often by multiple people. This both helps everyone understand how the various parts of the system work, allows more senior members of the team to provide mentorship and oversight, and gives us additional chances to catch bugs and to ensure that code is of the highest quality – with tests and documentation fully in place. Load testing using JMeter, unit testing using JUnit, black-box testing using browser automation frameworks, monitoring systems and careful, clean logs are some additional things we have in place to manage technical risk.
When you are in control, you are more productive, you can move faster, commit to, and deliver work on time, you have less stress and more fun. And ultimately, create a more valuable, sustainable company.
Operating systems, the internet, video games, social media – they all would not exist without significant hardware and software infrastructure devoted to caching. The same is true for us: We make enormous demands on a host of distributed caching solutions, some open source, some commercial, all carefully optimized for a particular workload.
A second line of defense is our web-tier content cache, which serves content to authenticated/connected users. This is a large, but all-in-all straightforward caching system.
More interesting is our query result cache system, a multi-level distributed caching system that lazily replicates cache data from centralized memcached servers to in-memory caches on our web tiers. This enables us to handle cache system failures gracefully, so that cache restarts do not result in massive load spikes on our database clusters. Indeed, this is one of the main dangers of caching: When caching servers go down, rewarming those caches can drive back-breaking load on back-end systems, and so managing cache hydration and persistence is a key task for systems with 24×7 uptime.
Also interesting is our approach to cache locality: Since modern databases can serve single-row queries on clustered-index data incredibly quickly, improving object fetch performance is best done by avoiding network round trips, rather than query overhead. Therefore, we store query results in remote caches, but never objects, which, if they are cached at all, are always cached on the web tiers.
Using these approaches, as well as many others, such as sophisticated ETag exchanges, we enable our infrastructure to scale to the tens of thousands of requests per second that our customer base drives at peak hours.
The weather here is back to being beautiful and our plants are nearing the point where we are going to have to give them some structural support. Meanwhile, at work, we have just closed down another sprint and are getting ready to head out to Off the Grid in downtown San Mateo.