Architecting as a Service

Architecting as a Service

From blogs which centralized many personal websites on shared infrastructure, to early Application Service Provider business models pushing enterprise applications toward the edge, and now Software as a Service which is a big reason software is eating the world, providing X as a service is an increasing part of how we deliver value to customers.

I am approaching three decades in our industry. During that time I have worked in tiny to large startups, as well as big enterprise. Across them all, responsibility for infrastructure services has been a constant... The exact focus has changed, whether core services (network, DNS, LDAP), Infrastructure as Code (version control, configuration management), monitoring and logging or most recently CI/CD. Regardless of focus, there is a common theme – none of these are valuable in and of themselves. Service is the key word, and unless each of these areas are easily and reliably consumed the value proposition is tenuous at best.

While the need to provide services has become a constant, the ways in which we can provide those services are almost infinite. I'm not one to believe in wrong and right when it comes to engineering. In a given context, there may be a few options which we can agree are very bad or very good, but thanks to human creativity there is almost certainly a large spectrum of solutions in the middle which are good enough. In our aim not to let perfect be the enemy of the good, and avoid premature optimization, these good enough or right-sized solutions are often quite ideal and much cheaper to attain in the real world.

Premature optimization is the root of all evil. –Donald Knuth

Along with as a service patterns, CI/CD has evolved from a dark art into a well understood science during my career. Thanks to the work of countless practitioners and detailed guides such as Continuous Delivery and Release It!, we have hard won expertise at our fingertips to inform our software development, test and delivery practices.

While these provide high level patterns we can adopt, how we implement those patterns is constrained only by budget, time and creative thought. You will most certainly want to start small, being pragmatic and iterating toward the ideal. However, with a growing number of tools available to help us in this space, the best-fit will depend on your context. Returning to the famous Donald Knuth quote, the full version is often a better teacher:

The real problem is that programmers have spent far too much time  worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil... –Donald Knuth

Aside from starting with the smallest piece of value you can provide and iterating, one important key to success as you think about providing anything as a service is how your architecture evolves to scale. This is not about right or wrong, but rather which parts of the system you chose to optimize and when you do so. Beyond approach, the architecture itself may need to morph over time to continue to provide the best possible service in your context.

With that in mind, I want to extract a few stories based on my personal experience over the past few decades. The first two will focus on providing CI/CD as a service. The last will require more extrapolative thought but hopefully serve as a useful device in sharing pain points felt in a past life while providing heavily centralized infrastructure services.

Centralized CI/CD

In my current role, I help Platform Engineering teams provide Platform as a Service to development teams. This starts with manual installation and configuration in a sandbox environment to learn the ins and outs of the platform, experiment with APIs, understand configuration options, etc. It very quickly becomes an automation exercise, so the platform can be consistently replicated in other environments and kept up to date.

The primary tool we use for automating platform deployment is Concourse, a lightweight, open-source, container-based, continuous integration tool. Many tools solve this problem, feel free to pick the one you are most comfortable with. The important thing is incorporating a process which includes Infrastructure as Code and CI/CD best practices.

These are industry norms such as having all code (including configuration!) in version control. Whether you have one or many repos is less important in my mind. The key is to have versioned and auditable change history for any code or configuration your product depends upon, as well as transparency and shared ownership of the code base (to facilitate troubleshooting and timely fixes).

For efficiency and repeatability, you also want to leverage an artifact repository. Again, many tools can solve this problem... the simplest form from a past life being product and configuration (you want to version the two separately, since a given product release typically has many configuration updates across environments and over time) RPM/APT repositories hosted on a simple HTTP server. More feature-full versions like Artifactory and Nexus are also common.

A common theme over the past couple years has been development teams hearing about the success of platform deployment and upgrade automation, and wanting to realize similar benefits. This is great! Cross-pollination of ideas and tooling convergence (the opposite of which is undesirable tool sprawl) is healthy and can reduce complexity over time. This often takes a form similar to the diagram below.

Centralized CI/CD

I've purposefully abstracted a lot of complexity and tooling choices from this diagram so we can focus on what I perceive as common pros and cons. Obvious pros include adoption of key patterns discussed above. We have common version control, an artifact repository, and a consistent approach to getting code into various environments. The platform team also maintains full control over the CI/CD infrastructure, which lets development teams focus on their apps.

In this variation, development teams manage pipelines responsible for building their components. This gives them autonomy during daily tasks, allowing them to iterate as needed without impacting the work of others (e.g. commits in version control trigger builds [a], and if tests pass generate artifacts [b], wash-rinse-repeat). This avoids tight coupling.

Deployment is managed via a master or deploy pipeline, which typically consumes approved artifacts [c], runs tests [d] and deploys to production-like environments [e]. Step [d] is worth clarifying to avoid confusion... The deployment pipeline does not re-build anything (which could cause inconsistency), but will re-run key tests (integration, acceptance, smoke, performance, etc.) that are co-owned by development teams and housed in version control like any other code. This also lends itself well to industries requiring separation of duties, since an operations or release team can control the final stages of production deployment.

This is often a natural starting point, because it allows development teams to quickly leverage CI/CD benefits without requiring a lot of time or money to deploy dedicated infrastructure. However, as the number of component pipelines grows, edge cases start to rear their heads. Shared infrastructure doing more work consumes more time from the platform team managing something other than the PaaS.

Another problem is diverse needs of development teams. While common patterns will emerge over time, some will undoubtedly be more CPU- vs RAM- vs I/O-bound. Batch-like jobs may trigger infrequently, while other teams trigger complex deployments many times a day. Some of these workloads can be very spiky in nature, causing noisy neighbor problems.

As this growth and complexity evolves, one knock-on effect is often impact on the platform team's ability to deploy the platform itself. Perhaps a number of development teams are running complex builds when you need to patch a critical CVE in the platform, and you encounter slow builds, timeouts or other demons which slow progress and pose risk to production workloads. Maybe you see more flakes, which reduces confidence in automation. These types of issues can often be mitigated in the short-term by additional cross-team communication (good, but also a form of overhead) or deployment windows... but that can also lead to bad practices where critical work can only be done off-hours (there goes quality of life) or at scheduled times (hackers don't conveniently schedule their attacks).

A final challenge I want to call out is cultural. While interacting with their component pipelines regularly, after running this setup for awhile (usually after many rounds of tuning) development teams can become increasingly detached from the deploy pipeline. Since breaking the component pipelines stops development, breakage is typically very visible and easy to prioritize. It's important to maintain similar visibility and a sense of ownership of the master deployment pipeline. This often includes coordinated releases, and the ability to prioritize work relating to fixing failed deploys in lower environments. All teams should share ownership of the deployment pipeline, and be empowered to pull the andon cord. This ensures quality is a cross-cutting concern, and systemic vs local optimization is the focus over time.

For these and other reasons, we have started recommending customers continue to use similar patterns and tooling, but distribute their eggs in more baskets (Concourse instances)... Let's see what that looks like!

Decentralized CI/CD

With Kubernetes becoming ubiquitous as the infrastructure dial-tone, we have been putting a lot of effort into making it easier to consume (across cloud providers and on-prem) via a management API (simplifying many day-2 tasks) and re-thinking how we empower platform teams to provide services.

While not a silver bullet (Kubernetes still needs infrastructure to consume, and that still has to be managed as it always has), container-based infrastructure can make spinning up and scaling new services faster and easier. One way we've started to leverage this new paradigm is by providing a Helm chart for Concourse. Given a platform which can provide containers as a service, this allows the platform team to provide an easy way (helm init;helm install;helm upgrade) for development teams to deploy their own dedicated CI/CD instances. Development teams continue to have control of their component pipelines, and the master pipeline is still owned by the operations or release team.

Decentralized CI/CD

We've maintained accepted patterns and simply changed how we provide service. We've leveraged modern infrastructure to provide better encapsulation and separation of concerns. We shifted left and allowed development teams to shoulder more responsibility, but were thoughtful in our approach which typically leads to feelings of empowerment. Helm allows us to completely abstract deployment details from the development team (so they can evolve over time), and avoids the size or priorities of the platform team blocking the need for additional CI/CD infrastructure (with a bit of scripting, a single command could standup a CI/CD instance and deploy any team-specific pipelines from source control). By continuing to leverage an artifact repository and deploy pipeline, we gained flexibility without relinquishing control.

In both of these cases, the platform team will need to be part practitioner and part consultant to scale. Similar to Google's SRE staffing model, you will want to provide good interfaces, documentation, patterns (often in the form of starter repositories) and guidance on how to use the service and get started with CI/CD. You want to strike a balance between providing reasonable guardrails (often in the from of opinionated wrappers or APIs to abstract away complexity) and getting out of the way so you are not a blocker.

One may observe this version includes more moving parts. If you are adopting containers as a service, much of the shared infrastructure will already exist. If you are at a point in the first version where you have sufficiently scaled your shared VM- or server-based infrastructure, you already have a lot of (different) parts. This is about economies of scale, and leveraging the right tools for the job.

The key difference in the latter approach is that you will need to invest more effort in how you plan and provision the CI/CD instances to avoid snowflakes increasing complexity. In a sense, you've chosen to embrace a different set of tradeoffs vs eliminating them entirely – such is the case in most projects. Making the right tradeoffs for your context is key.

With Kubernetes, you can more easily implement auto-scaling (up to some limit), self-healing and other service management tasks. Many infrastructure metrics you and the development teams care about will come out of the box, and integrate easily with upstream tooling. Spin up and tear down is lighter weight, and you have better isolation between tenants or builds.

Worth noting, this does not solve cultural issues which can lend themselves to not my problem thinking. Regardless of technical architecture, it's important for leadership to encourage shared ownership of deployment pipeline failures as a stop the line event. With common tooling and automation across environments, it is hopefully an easy sell to management (including product managers) since any breakage in deployment pipelines is a threat to the value stream.

Once Upon a Time

While I've largely embraced the latter model above, I want to end with a story which hopefully demonstrates these choices are often difficult in practice and not specific to CI/CD. Much like "mono repo vs poly repo" holy wars (our industry would be boring without such entertainment), you can be successful with either approach. Choices should come down to a combination of context (your use cases, application environment, tooling preferences, etc.) and comfort level (it is usually better to start with the evil you know best, and start as simply as possible).

Several years ago I was on a team providing infrastructure services for a number of development teams. We were kicking off a datacenter refresh, and wanted to modernize our logging and metrics infrastructure. The point is not technology choices, since there are a lot of legos you could use to solve these problems, but in case it helps draw on similar experience, we decided to use ELK and Graphite with producers and consumers interconnected via message bus.

Hub and Spoke model for Logging and Metrics as a Service

I've again abstracted away a lot of detail and specific technology (if you are curious, feel free to reach out and we can chat) to focus on the larger patterns and learning opportunity. There were many spokes, some spanning miles and others continents. We were careful to select a message bus good at dealing with latency and lossiness, and ensured relatively loose coupling between individual components. While simplified as connected boxes, each tier was composed of multiple load balanced instances.

Like most projects, we started with a limited budget and needed to accrue some technical debt. One place this occurred was networking, where point to point links were prohibitively expensive we resorted to MPLS tunnels with less than stellar commit rates. Our initial testing worked fine, and there was a business commitment to fund more bandwidth in a later iteration.

A key requirement was centralized search for logs and metrics. This had been a historic pain point (with per-datacenter silos requiring a lot of UI hopping). Another primary requirement was reliability in metric delivery since reporting done against the centralized metric store was leveraged to report SLAs.

We primarily focused on end-to-end flow for each data source (logs and metrics) and a loosely coupled architecture. Very quickly, we realized we should have put more thought into guardrails, throttling, back-pressure and other features which would ensure a better experience across tenants. We initially included auto-discovery of most system and application metrics, and as more teams migrated to the updated infrastructure it took a lot of care and feeding to keep the centralized storage happy. We had metric driven monitoring, but had to invest a lot of time and often upstream bug fixes in scaling the clusters. Automation formed for this over time, but should have been a consideration earlier on. Worse, business priority and funding did not grow nearly as fast as consumption, and our MPLS tunnel bandwidth become increasingly problematic.

To stem the tide, we had to start downsampling logs in order to preserve metric delivery and avoid impacting SLA analysis. We increased buffer and queue sizes, and spent a lot of time scaling and tuning the message bus. Even bringing in outside consultants (which might not be an option) for that piece in particular did not help. After more investment, we at least built enough automation that we could easily scale and mitigate failures on the log and metric storage tiers.

One option we discussed was optimizing locality of reference, and pushing more data into regional storage. This would decrease real-time bandwidth demands. It would also violate one of our key requirements (centralized search), but the idea was to mitigate that by federating at the UI vs data level. There was concern over user experience (slowness from UI searches spanning data stores), but that was seen as no worse and perhaps better than the current state.

While painful in some ways, this infrastructure supported a large number of development teams shipping mission critical security services serving many of the Fortune 100 and Fortune 500. It provided real value, but was noisy to operate... particularly in the early stages as we adjusted components, tweaked configuration and refined automation. Even then, realizing a better state required experiencing real-world pain and using that pain to inform our architecture.

We initially thought of things like loose coupling and resilience, but we still had all our eggs (logs and metrics) in one centralized basket. This was our achilles heel, and was obscured by one of our primary requirements (centralized search). With more thought we could have implemented this differently from the start, but that was instead a lesson hard learned.

We also found it easy to accept tech debt early on, but hard to pay down that debt later despite commitment from business stakeholders. Despite on-call pain, it was too easy for product management to prioritize other features over infrastructure issues invisible to customers. Be mindful of the debt you incur!

I'll stop here since this is not meant as an exhaustive explanation of our architecture, it's evolution over time, or the pain we encountered along the journey. The key points are that even when thinking carefully about architecture, you will often need to make decisions based on limited time and budget. Further, edge cases are hard to understand until you have ran something at scale. Stay pragmatic, start small and iterate, and continue to evolve your architecture as you learn. Thinking too far ahead will result in analysis paralysis. Think about how you provide value in the short-term, and how you can pivot when needed. Think about growth, scaling and non-functional requirements just like features.

Conclusion

While the use cases and architecture choices above varied, certain patterns such as Infrastructure as Code, having a consistent deployment pipeline composed of automated tasks, visibility into both application code and configuration, the use of artifact repositories, empowerment across team boundaries to spot and fix problems, being mindful of dependencies, iterating with a focus on value, and thinking about non-functional requirements are constant.

We live in a time where many of these patterns have been proven at scale, succinctly captured for easy consumption, and are moving from black art to industry science. For that, we must thank those who have come before us...and out of respect, we should carefully apply these patterns in our daily work while living in the moment, adapting to our own contexts, and helping generate the next set of useful patterns for our community and posterity.

Show Comments