My ramblings on the stuff that holds it all together
Applying “Agile” to Infrastructure…? Virtualization is Your Friend
I have been looking at this for a while, in the traditional model of delivering an IT solution there is an extended phase of analysis and design, which leads through to build and hand-over stages; there are various formalised methodologies for this – in general all rely on having good upfront requirements to deliver a successful project against. and in infrastructure terms this means you need to know exactly what is going to be built (typically a software product) before you can design and implement the required infrastructure to support it.
This has always been a point of contention between customers, development teams and infrastructure teams because it’s hard to produce meaningful sizing data without a lot of up-front work and prototyping unless you really are building something which is easily repeatable (in which case is using a SaaS provider a more appropriate model?)
In any case the extended period these steps require on larger projects often doesn’t keep pace with the rate of technical and organisational change that is typical in modern business; end result – the tech teams are looking after an infrastructure that was designed to outdated or at worst made-up requirements; the developers are having to retro-fit changes to the code to support changing requirements and the customer has something which is expensive to manage and they wonder why they aren’t using the latest whizzy technology that is more cost-effective and they are looking at a refresh early into it’s life-cycle – which means more money thrown at a solution.
With the growing popularity of Agile-type methodologies to solve these sort of issues for software projects infrastructure teams are facing a much harder time, even if they are integrated into the Agile process which they should be (the attitude should be that you can’t deliver a service without infrastructure and vice-versa) they struggle to keep up with the rate of change because of the physical and operational constraints they work within.
Other than some basic training and some hands-on experience I’m definitely not an Agile expert – but to me “agile” means starting from an overall vision of what needs to be delivered and iteratively breaking a solution into bite-sized chunks and tackling them in small parts, delivering small incremental pieces of functionality through a series of “sprints” – for example delivering basic UI and customer details screen for an order entry application and letting people use it in production then layering further functionality through further sprints and releases. a key part of this process is reviewing work done and feeding that experience back into the subsequent sprints and the overall project.
Typically in Agile you would try to tackle the hardest parts of a solution from day one – these are the parts that make or break a project – if you can’t solve it in the 1st or 2nd iteration maybe it actually is impossible and you have a more informed decision on if the project actually is feasible, or at a minimum you take further the learning and practical experience of trying to solve the problem and what does/doesn’t work and are able to produce better estimates.
This has another very important benefit; end-user involvement – the real user feedback means it’s easier to get their buy-in to the solution and the feedback they give from using something tangible day to day rather than a bunch of upfront UI workflow diagrams or a finally delivered solution is invaluable – you get it BEFORE it’s too late (or too expensive) to change it; fail early (cheaply) rather than at the end (costly).
For me, this is how Google have released the various “beta” products like gMail over the last few years; I don’t know if they used “Agile” methodologies but; set expectations that it’s still a work in progress; it’s “good-enough” and “safe” you (the user) have the feedback channel to get something changed to how you think it should be.
Imagine if Google had spent the 2 years doing an upfront design and build project for gMail only for it to become unpopular because it only supported a single font in an email because they hadn’t captured that in their upfront requirements – something that for argument’s sake could be implemented in weeks during a sprint but would take months to implement post-release as it meant re-architecting all the dependent modules that were developed later on.
In application development terms this is fine – this Agile thing is just a continual release/review cycle and just means deploying application code to a bunch of servers – but how does that map to the underlying infrastructure platform where you need to provide and run something more tangible and physical? every incremental piece of functionality may need more server roles or more capacity to service the load this functionality places on databases, web servers, firewalls etc.
With physical hardware implementing this sort of change means physical intervention – people in data centres, cabling, server builds, lead time, purchase orders deliveries, racking, cabling etc. every time there is a release – with typical sprints being 2/4 week iterations quite often traditional physical infrastructure can’t keep up with the rate of change, or at a basic level can’t do so in a managed risk fashion with planned changes.
What if the development sprint radically changes the amount of storage that is required by a host?, needs a totally different firewall and network topology or needs more CPU or RAM resource than you can physically support in current hardware.
What if the release has an unexpected and undesirable effect on the platform as a whole – for example a service places a heavy load on a CPU because of some inefficient coding that had not shown up through testing phases and is not trivial to patch – you have 2 choices; roll back the change or scale the production hardware to work around it until it can be resolved in a subsequent release.
Both of these examples mean you may need servers to be upgraded/replaced and all adds up to increased time to deliver – in this case the infrastructure becomes a roadblock not a facility.
Add to this the complication of doing this “online” as the system this functionality is being delivered to is in production with real, live users – that makes things difficult to do with a low-risk or no downtime.
The traditional approach to this lack of accurate requirements and uncertainty has been to over-specify the infrastructure from day one and build in a lot of headroom and redundancy to deal with on-line maintenance, however with traditional infrastructure you can’t easily and quickly move services (web services, applications, code) and capacity (compute, storage, network) from one host to another without downtime, engineering time, risk etc.
Rather than making developers or customers specify a raft of non-functional requirements before any detailed work has started on design; what if you could start with some hardware (compute, network, storage) that you can scale out in an incremental and horizontal manner.
If you abstract the underlying hardware from the server instance through virtualization it suddenly become much more agile – cloud like, even.
You can start small, with a moderate investment in platform infrastructure and scale it out as the incremental releases require more, maintain a pragmatic headroom within the infrastructure capacity and you can easily react straight away as long as you are diligent at back-filling that capacity to maintain the headroom.
Virtualization, and particularly at the moment with vMotion, DRS and Live Migration type technologies you have an infrastructure that is capable of horizontal scaling far beyond anything that you could achieve with physical platforms – even with the most advanced automated bare-metal server and application provisioning platforms.
Virtualization has a place in horizontal scaling where individual hosts need more CPU, Compute etc. even if you need to upgrade the underlying physical hardware to support more CPU cores virtualization allows you to do most of this online by moving server instances to and from upgraded hardware online.
VMware vSphere for example supports up to 8 virtual CPU’s and 256GB RAM presented to an individual virtual machine. You can add new higher capacity servers to a VMware ESX/vSphere cluster and then present these increased resources to the virtual machine sometimes without downtime to the server instance – this seamless upgrade technology will improve as modern operating systems become more adapted to virtualization – in any case vMotion allows you to move server instances around online to support such maintenance of the underlying infrastructure platform in a way that was never possible before virtualization.
This approach allows you to right-size your infrastructure solution based on real-world usage, you are running the service in production with some flex/headroom capacity not only to deal with spikes to can satisfy immediate demands but also with a view to capacity planning for the future – backed up with real statistics.
Maybe at day one you don’t even need to purchase any hardware or infrastructure to build your 1st couple of platform iterations – you could take advantage of a number of cloud solutions like EC2 and VMware vCloud to rent capacity to support the initial stages of your product development;
This avoids any upfront investment whilst you are still establishing the real feasibility of the project and outsources the infrastructure pain to someone else for the initial phases; once you are sure your project is going to succeed (or at least you have identified the major technical roadblocks and have a plan) you can design and specify a dedicated platform based on real-world usage rather than best-guesses – the abstraction that virtualization offers makes it much easier to do this kind of transition once you have a dedicated platform in place, or even another service provider.
To solve the release/risk complexity virtualization allows you to snapshot and rollback entire software and infrastructure platform stacks in their entirety – something that is almost impossible in the physical world – you can also clone your production system off to an isolated network for staging/destructive type testing or even disaster recovery.
Hopefully this has given you some food for thought on how Agile can apply to infrastructure and where virtualization can help you out – I only ever see the Agile topic being discussed in relation to software development – virtualization can help your infrastructure to work with Agile methodologies. However it’s important to remember that neither Agile methodologies or Virtualization a panacea – they are not the cure for all ills and you will need to carefully evaluate your own needs, they are both valuable tools in the architect’s toolbox.