My ramblings on the stuff that holds it all together
Category Archives: Cloud
I have been watching this conversation about ITIL, virtualization and cloud play out over the last year; some very enthusiastic bloggers loudly bash ITIL for how unsuitable it is in the modern cloud world, change control for vMotion? do you update your CMDB when you vMotion? lunacy? how? tools? methods? consultancy? snake-oil? £££?
I spent a good chunk of my role before VMware as a project-based architect with significant interfaces to an ITIL-heavy managed-services team, they embraced ITIL and took it to the core of the operations side of the business and had the scars as well as the trophies to show for it.
I have seen it help, and I have seen it hinder – but I think the core problem that people seem to have with ITIL is that they just don’t understand it or they are afraid it’s hard-work.
Wake-up; it is hard-work, but it’s hard work for a reason.
ITIL is not prescriptive, it doesn’t tell you how to do things so you change your business to fit around it; the truly successful ITIL organizations are those that understand that it’s a FRAMEWORK, you can pick & choose which parts apply or deliver benefit to their business and discard those that don’t.
ITIL also comes at a cost, ITIL is about best-practice, information sharing and planning and auditability/accountability; this means systems, software and people time to make that happen – but that cost is also about reducing risk and providing accountability/auditability, yes it does slow down your reaction time by adding a layer of process and approval but the trade-off is that when things do go wrong you know who did what, when, why and who said it was ok to do it (accountability) rather than an unmanaged mess.
What does that deliver…?
More expensive operations (people time = £££, tools = £££)
More informed operations and business (downtime, Intellectual property retained = £££)
The two functions of ITIL that I see raise the most heckles are Change Control (and the notorious Change Advisory Board (CAB) meeting) and the Configuration Management Database (CMDB) that I will tackle in-turn;
Change Control is about communications and planning, those CAB meetings are there to disseminate information about what is going to happen and gain buy-in of stakeholders, however obscure you may think your dependency on the change being implemented you have had your opportunity to air an opinion and contribute to the go/no-go based on your service requirements, it’s your responsibility to your service to engage with this process and not see it as a hindrance – neither IT or business stand still, nor should you.
ITIL also makes techies stand back and think about what they are doing to do before they do it – because you make them document it and explain it in English (or $LOCALE) to the people that matter (the stake-holders), not just allow them to get all Jackie-Chan with the CLI. As techies, it’s all too easy to believe in your own command-line fu and forget that you are fallible and may have missed a critical dependency or conveyed the gravity and risk of what you are going to do to that customer.
Sometimes as a techie, ITIL-induced CAB is your friend; this is your chance to convey the risk of something you have been asked to do, it’s your way of saying “you won’t spend £££ on redundant storage for this service migration, thus if this goes wrong you will be down for X hours at a cost of £Y”, that’s a very useful and practical way to put things in to perspective for the stake and budget-holder and lubricate the flow of extra contingency budget to avert a potential disaster, and if it does go wrong you’ve CYOA.
The CMDB is just a database (or in some cases many databases), so what if you don’t have a single all-seeing and all-knowing CMDB?, there may be very valid reasons to maintain multiple CMDB’s – for example some equipment may be owned/managed by service providers and some by internal IT – this isn’t new it’s an age-old business IT problem – in the real-world (i.e business) it’s solved by building interfaces, API’s and views – why not treat your mythical and so hard to manage CMDB as a meta-database, an index of where to go and find the relevant info (or better still build an API to do it for you).
And stop relying on people to populate the CMDB correctly – build tools to do it automatically, leverage that API and have hosts check themselves in and out of the cloud, or between clouds, or between clouds and internal infrastructure – this isn’t a problem with ITIL, this is a problem with doing things manually.
I honestly don’t see ITIL as a blocker for cloud, systems and people just get smarter to support quicker change and deliver lower-cost of operations, for example;
- A list of pre-approved automated changes and a notification list when they are implemented – like adding more storage, adding hosts, vMotion, storage tiering etc. but that keep a detailed audit-trail.
- A budget of pre-approved changes/actions based on typical usage – this allows systems to trap/manage explosions of requests that could be caused by a problem
- Automated voting tools for change-approval/veto, rather than CAB conference calls/meetings and an agreed escalation process
- systems that register/de-register themselves in a CMDB when changes happen – rather than relying on someone to do it manually, implementing some sort of heartbeat to age-out hosts that die or are removed outside of the process.
Applications are changing for the cloud, application frameworks are freeing code from underlying infrastructure – great, maybe this means you don’t have to worry about infrastructure, servers, networks, storage in the great public cloud (it’s SEP), but you still leverage ITIL for things like release-management and change-control within the bits you manage/care about.
This doesn’t mean it isn’t the same old ITIL in the cloud – it’s just ITIL principals with tools/enlightened people.
Speed of Change and instant gratification are one of the much-touted benefits of cloud, but let’s put that into perspective, how often does your business really need a server/application NOW – i.e in 3 mins? and if you do – how well thought out is that deployment, how long before it becomes a critical but home-grown business app that you can’t un-weave from the rest of the business (how often have you seen spredsheet-applications and Access DB’s worm their way into your own business processes?
If you implement the sort of light-weight approval change/control I discuss here does it really matter if it takes an hour to go through an approval cycle and everyone knows what’s going on, approval could even be automated if you are given that level of pre-approved changes.
With that I’ll sign-off with a simple warning; bear in mind the more automated you make things, the easier it is for people to ignore them or feel disenfranchised from the activity. An electronic approval becomes a task rather than a face-face decision for which they were accountable in a meeting/CAB – people are still human after-all and it’s the stupid system’s fault isn’t it?
I was passed a link to a very interesting article on-line about silent data corruption on very large data sets, where corruption creeps undetected into the data read and written by an application over time.
Errors are common in reading from all media and this would normally be trapped by storage subsystem logic and handled lower down the stack but as these increase in complexity and the data they store vastly increases in scale this could become a serious problem as there could be bit-errors not being trapped by disk/RAID subsystems that are passed on unknown to the requesting application as a result of firmware bugs or faulty hardware – typically these bugs manifest themselves in a random manner or by edge-case users with unorthodox demands.
All hardware has a error/transaction rate – in systems up until now this hasn’t really been too much of a practical concern as you run a low chance of hitting one, but – as storage quantities increase into multiple Tb of data this chance increases dramatically. A quick scan round my home office tallys about 16Tb of on-line SATA storage, by the article’s extrapolation on numbers this could mean I have 48 corrupt files already.
This corruption is likely to be single-bit in nature and maybe it’s not important for certain file formats – but you can’t be sure, I can think of several file formats where flipping a single bit renders them unreadable in the relevant application.
Thinking slightly wider, if you are the end-user “victim” of some undetected bit-flipping what recourse do you have when that 1 flips to a 0 to say your life insurance policy doesn’t cover that illness you have just found you have – “computer says no”?
This isn’t exclusively a “cloud problem” it applies to any enterprise storing a significant amount of data without any application level logic checks, but it is compounded in the cloud world where it’s all about a centralised storage of data, applications and code, multi-tenanted and highly consolidated, possibly de-duplicated and compressed where possible.
In a market where cost/Gb is likely to be king providers will be looking to keep storage costs low, using cheap-er disk systems – but making multiple copies of data for resilience (note, resilience is different from integrity) – this could introduce further silent bit corruptions that are propagated across multiple instances as well as increasing the risk of exposure to a single-bit error due to the increased number of transactions involved.
In my view, storage hardware and software already does a good job of detecting and resolving these issues and will scale the risks/ratios with volumes stored. But, if you are building cloud applications maybe it’s time to consider a check summing method when storing/fetching data from your cloud data stores to be sure – that way you have a platform (and provider)-independent method of providing data integrity for your data.
Any such check summing will carry a performance penalty, but that’s the beauty of cloud – scale on demand, maybe PaaS providers will start to offer a web-service to offload data check summing in future?
Check summing is an approach for data reliability, rather than security but at a talk I saw at a Cloudcamp last year; a group were suggesting building DB field-level encryption into your cloud application, rather than relying on infrastructure to protect your data by physical and logical security or disk or RDBMS-level encryption (as I see several vendors are touting) build it into your application and only ever store encrypted assets there – then even if your provider is compromised all they hold (or leak) is already encrypted database contents – you as the end-user still retain full control of the keys and controls.
Combine this approach with data reliability methods and you have a good approach for data integrity in the cloud.
I noted with interest that Microsoft have announced some details of the Azure platform appliance, a way of running the components of their Azure cloud service in your own data centre.
It reads from the article that this will be based around a container/pod type architectural unit of many servers, rather than a single hardware appliance;
“We call it an appliance because it is a turn-key cloud solution on highly standardized, preconfigured hardware. Think of it as hundreds of servers in pre-configured racks of networking, storage, and server hardware that are based on Microsoft-specified reference architecture.”
I mentioned this concept back in 2008, licensing out appliances of cloud IPR goodness (now known as PaaS/SaaS) to run on-site (see comments of my post here) is a great way to build confidence and gain market penetration for the cloud-sceptic organisation. Or just to help those people that can’t move their data and services into the public cloud to leverage highly scalable PaaS technologies.
Interesting times, will we see Amazon and Google start to offer EC2/AWS and AppEngine pods that you can run on-premise?
Of course, you can do this sort of thing at an IaaS level now with VMware and their vCloud partners – VMware are moving up the stack with their PaaS (Springsource) and SaaS (Zimbra) acquisitions and a hybrid of on and off-premise would be easily achievable for them.
The Acadia proposition is interesting as a joint approach to delivering private cloud infrastructure – this sort of pre-packaged solutions offering with good vendor support is a welcome addition to the industry, but other than tighter links to the product vendors I’m not sure what more they bring to the table over a traditional VAR.
As an aside, I worked on a very similar concept in 2008 for my current employer, although on a much smaller scale – we built a repeatable private cloud stack around a set of well-understood technologies.
I have deployed it a number of times and as a professional services organisation I have seen 1st hand how this base template approach has helped accelerate not only the pre-sales and design process but also the delivery of actual infrastructure to the end-customer – particularly when working to build infrastructure for a new solution where current metrics and sizing information just isn’t available.
You can read my original thoughts about my work on a private cloud platform here – I do however think the VCE coalition has some way to go yet around it’s software licensing before it’s really workable on a true ‘pay as you go’ basis – rather than bundling everything up into a traditional commercial lease-purchase type agreement for hardware and software.
I also have yet to see some more innovative commercial models for the procurement of the infrastructure itself – although the vBlock is designed to scale in a horizontal, modular fashion if you need to scale down how do you do that? the cost is “sunk” with the vendor/reseller and I can’t see them wanting to undo that traditional commercial model.
I’ve seen IBM start to pioneer some mainframe pay as you go type commercial models down into the x86 space, where they ship you a fully loaded system and you pay for capacity that you use – this kind of works for vendors as they don’t have to pay margin to resellers and distribution if they can do it direct and it comes from their own factories at “cost” prices, a traditional VAR would find that this carries a significant financial risk so would usually seek to offset this via a contracted capacity and guaranteed capacity expansion.
I wonder if this could be a key selling tool of the ACADIA proposition – at a guess I’d say EMC/VMware/Cisco still want to sell tin/software as a capital item and get it out of their warehouses and bank the outright sale but they have a stake in the ACADIA business, they are the shareholder(s).
What if the ACADIA business were able to act as a financial intermediary – buying kit (hardware or software) from the VCE partners, levering volume and special pricing via its owners, handling logistics and leasing infrastructure out to the end-customer with professional services, rather than relying 100% on sales margin and professional/managed services revenue.
In theory ACADIA could build a diverse enough pool of customers that it could weather storms in any specific market sector (financials, telco, media etc.) to keep an overall positive profit and market performance. Because the “product” is built around a standard set of components (the vBlock) managing and re-distributing inventory items between customers is more feasible because it’s easier to keep “stock” of components or entire vBlocks– in this mode ACADIA could act almost as a super-VAR in traditional terms, but with some more creative financial models enabled by access to better "raw” pricing. (raw in the sense there are less middle-men and commissions to pay)
if they were able to pull this off then I can see a significant advantage over the more traditional VAR’s but do VCE risk treading on the toes of their traditional partners, distribution and resellers?