Looking at architecture it is always very important to have a set of well defined principles which can be used to assess and review an architecture, but also give to the developers the rules of engagement they will need to consider in their design and implementation. The principles cannot be overwhelming and need to be constantly assessed to measure their effectiveness. So here it is, I will first give a summary of each principle followed up by a more detailed description. There are 11 principles which seems to be quite useful when they are being used. There is no order of priority but some of the principles should be considered mandatory for a proper success of the project.
Abbreviated principle list
- Managing Failure (Mandatory): Embrace failure: don't try to prevent it, but manage it.
Presence of failures is the rule, don't see them as exceptions. Assume everything fails, like:
- Power faults: plan for it, could happen with unexpected frequency .
- Double faults happen: expect it, don't just assume single independent failures when creating scenarios of doom.
- Partial failure: handle it, never assume partitioning can't happen in a data enter.
- Instrumentation and Logging (Mandatory): Know what's happening, before you try to improve anything.
We will need a great deal of runtime information either in real time or batch mode:
- Systems designed to expose runtime information and measure performance (including latency).
- Data rich enough to provide meaningful information about systems and users .
- Events carry correlation information to provide a big picture view of high level and complex events. The correlation can be:
- Horizontal: between system elements of the same type: hardware or software.
- Vertical: going thru the infrastructure, os, application and service levels.
- Design for the cloud (Mandatory): Design applications/software elements for a dynamic cloud environment, don't assume your infrastructure.
Most these principles are classic internet design principles and not specific only for a CLOUD development.
- Partition: avoid funnels or single points of failure. The only aggregation point should be the network itself.
- Plan on resources not being there for short instances of time: break the system apart into pieces that work together, but can keep working in isolation at least for several minutes.
- Plan on any machine going down at any time: build mechanisms for automated recovery and reconfiguration of the cluster (see Principle 1).
- Implement elasticity:
- Automate the deployment process and streamline the configuration and build process (ensure the system can scale without any human intervention) .
- Every instance should have a role to play in the environment (DB, FE, BE,..), which could be passed as an argument that instructs machine image (e.g. grab necessary resources) .
- Based on parallelization, multi-threading when accessing (retrieving / storing) data and consuming resources, handle deadlocks (2-way, n-way, phantom) with proper detection prevention and avoidance algorithms and built wait-for graph.
- Divide and Conquer (Mandatory): It's all about boundaries and... Rome went bust when it became too big...
- Divide big, complex problems & systems into smaller, simpler components.
- Choose the best solution, tools & technology available for each component.
- Optimize each component for the most frequent tasks it will need to do.
- Aim for the minimum overlap of functionality between the components.
- Avoid tight dependencies between components.
- Latency is not zero: It exist! Embrace it: latency is the mother of interactivity. But try hard to reduce it Latency hurts (customers AND our revenue!):
- Amazon: every 100ms of latency cost them 1% in sales.
- Google: extra .5 seconds in search page generation dropped traffic by 20%.
- The less interactive a site becomes the more likely users are to click away and do something else, e.g. use the competitors site (e.g. latency in games is a make or break success)
- Every API should be cache-able Cache should be closer to the end-user.
- Almost infinite scale (Mandatory): Assume the number of objects will grow significantly and more than you thought Almost-infinite scaling is a deliberately loose way to motivate us to:
- Be clear about when and where we can know something fits on one machine.
- What to do if we cannot ensure it does fit on one, two or three.... machine.
- Want to scale almost linearly with the load (both data and computation).
- Develop a plan and documentation how to extend capabilities ad-hoc.
- Relaxed Consistency: Trade some consistency for availability in partitioned databases.
- BASE diametrically opposed to ACID:
- ACID (Atomicity, Consistency, Isolation, Durability) : pessimistic and forces consistency at the end of every operation,
- BASE (Basically Available, Soft state, Eventual consistency): optimistic & accepts that consistency will be in a flux but eventual.
- While this sounds impossible to cope with, it is quite manageable and leads to levels of scalability that cannot be obtained with ACID.
- Basically Available: it will always be there, even if some nodes aren't, reads for all data likely still served by caches.
- Soft state: Writes and Updates return quickly but may take time to propagate, reads will never block but might return a previous version.
- Eventually consistency: If you wait long enough data will eventually be consistent.
- BASE diametrically opposed to ACID:
- Design for Security (Mandatory): Treat Security requirements as 1st class citizens in your design:
- Define ownership of data and take data life-cycle into account.
- Define multi-tenancy level and handle it within all system components:
- Infrastructure: separate network, separate blades/rack (up to either separate location, or locked cages) -> expensive at the whole stack level.
- OS: separate OS/with or without virtualized environment/container on shared infrastructure -> good level to start thinking about multi-tenancy
- Application: Multiple instance on the same OS -> expensive: licensing model.
- Service: Logical separation at the database level within the same application -> dangerous: the application may not fully understand the logical separation via data.
- Handle delegated administration as part of the basic design:
- Individual user with self administration or administration on behalf.
- Cascading notion of group of users (company, departments within company, groups within department or group, etc.. with administration policies at each level.
- Don't reinvent the wheel: Did somebody else already solve your problem? You're likely not trying to solve a totally new problem:
- Look at the Open Source, the commercial market or SaaS solutions (the good ones follow similar architecture principles and have exposed API) for things you can reuse.
- When using Open Source, use true managed Open Source (see if there is the company that has initiated the project, there is a big chance that they have an enterprise version).
- There's also a big chance it has been already used in a production environment. This will save you time you can use to focus on a proper business logic with all the architecture principles!
- Worse is better: Solve only 80% of a problem: that's usually good enough.
- Worse-is-better solutions not only get done faster, they actually get done at all.
- Think about a design that:
- must be simple.
- correct in all the observable aspects & not overly inconsistent.
- cover as many important situations as it is practical.
- Solid Design Patterns: Good design gives faster delivery, change and complexity management.
- Based on well proven patterns and anti-patterns that help you solve problems in an efficient and elegant way and makes the software less complex and cheaper to maintain.
- Learning and understanding those patterns must be paramount for any developer that joins the project.
This list does not pretend to be exhaustive but cover major pitfall I have seen in different projects in the different companies I have been in.
Detailed principle listThis section will cover each principle with the following structure:
- Statement: What has been already described in the previous list
- Rational: A description of why it is important by looking at what is happening if the principle is not followed
- Implication: consequence of what development and other teams should or need to do
- Industry best practices: when applicable a non exhaustive list of example of what has been done at the industry level
- Detail Recommendations: when applicable a non exhaustive list of practical actions which cove the principle
ConclusionAs usual for all of these principle lists, it will always be possible to find missing items, and it should always be possible to add or remove a principle or detail of a principle to fit more with the demand and the constraint of a specific environment.
It is also important to understand that there will always be exception, however since the principles are documented, it is easy to then assess the risk of the exception for not following a specific principle.