Virtualization, Cloud, Infrastructure and all that stuff in-between
My ramblings on the stuff that holds it all together
Category Archives: Cloud Computing
Presenting at Cloud Expo Europe 2011
I will be presenting with another VMware colleague, Aidan Dalgleish at Cloud Expo Europe 2011 which is being held in London on the 2nd-3rd February.
Our session is on 2nd Feb at 11.30 – you can find the full schedule here and there is more information about the event here, it’s free if you register before 1st Feb and you can do that here.
We will be demonstrating VMware vCloud Director and talking about hybrid-cloud use-cases so if you’re interested to see it in action come along, we’ll also be hanging around to answer any cloudy questions that you may have.
Hope to see you there.
Silent Data Corruption in the Cloud and building in Data Integrity
I was passed a link to a very interesting article on-line about silent data corruption on very large data sets, where corruption creeps undetected into the data read and written by an application over time.
Errors are common in reading from all media and this would normally be trapped by storage subsystem logic and handled lower down the stack but as these increase in complexity and the data they store vastly increases in scale this could become a serious problem as there could be bit-errors not being trapped by disk/RAID subsystems that are passed on unknown to the requesting application as a result of firmware bugs or faulty hardware – typically these bugs manifest themselves in a random manner or by edge-case users with unorthodox demands.
All hardware has a error/transaction rate – in systems up until now this hasn’t really been too much of a practical concern as you run a low chance of hitting one, but – as storage quantities increase into multiple Tb of data this chance increases dramatically. A quick scan round my home office tallys about 16Tb of on-line SATA storage, by the article’s extrapolation on numbers this could mean I have 48 corrupt files already.
This corruption is likely to be single-bit in nature and maybe it’s not important for certain file formats – but you can’t be sure, I can think of several file formats where flipping a single bit renders them unreadable in the relevant application.
Thinking slightly wider, if you are the end-user “victim” of some undetected bit-flipping what recourse do you have when that 1 flips to a 0 to say your life insurance policy doesn’t cover that illness you have just found you have – “computer says no”?
This isn’t exclusively a “cloud problem” it applies to any enterprise storing a significant amount of data without any application level logic checks, but it is compounded in the cloud world where it’s all about a centralised storage of data, applications and code, multi-tenanted and highly consolidated, possibly de-duplicated and compressed where possible.
In a market where cost/Gb is likely to be king providers will be looking to keep storage costs low, using cheap-er disk systems – but making multiple copies of data for resilience (note, resilience is different from integrity) – this could introduce further silent bit corruptions that are propagated across multiple instances as well as increasing the risk of exposure to a single-bit error due to the increased number of transactions involved.
In my view, storage hardware and software already does a good job of detecting and resolving these issues and will scale the risks/ratios with volumes stored. But, if you are building cloud applications maybe it’s time to consider a check summing method when storing/fetching data from your cloud data stores to be sure – that way you have a platform (and provider)-independent method of providing data integrity for your data.
Any such check summing will carry a performance penalty, but that’s the beauty of cloud – scale on demand, maybe PaaS providers will start to offer a web-service to offload data check summing in future?
Check summing is an approach for data reliability, rather than security but at a talk I saw at a Cloudcamp last year; a group were suggesting building DB field-level encryption into your cloud application, rather than relying on infrastructure to protect your data by physical and logical security or disk or RDBMS-level encryption (as I see several vendors are touting) build it into your application and only ever store encrypted assets there – then even if your provider is compromised all they hold (or leak) is already encrypted database contents – you as the end-user still retain full control of the keys and controls.
Combine this approach with data reliability methods and you have a good approach for data integrity in the cloud.
ACADIA…thoughts
Now that the VCE joint coalition has announced their new CEO they launched the acadia.com website with a blog.
The Acadia proposition is interesting as a joint approach to delivering private cloud infrastructure – this sort of pre-packaged solutions offering with good vendor support is a welcome addition to the industry, but other than tighter links to the product vendors I’m not sure what more they bring to the table over a traditional VAR.
As an aside, I worked on a very similar concept in 2008 for my current employer, although on a much smaller scale – we built a repeatable private cloud stack around a set of well-understood technologies.
I have deployed it a number of times and as a professional services organisation I have seen 1st hand how this base template approach has helped accelerate not only the pre-sales and design process but also the delivery of actual infrastructure to the end-customer – particularly when working to build infrastructure for a new solution where current metrics and sizing information just isn’t available.
You can read my original thoughts about my work on a private cloud platform here – I do however think the VCE coalition has some way to go yet around it’s software licensing before it’s really workable on a true ‘pay as you go’ basis – rather than bundling everything up into a traditional commercial lease-purchase type agreement for hardware and software.
I also have yet to see some more innovative commercial models for the procurement of the infrastructure itself – although the vBlock is designed to scale in a horizontal, modular fashion if you need to scale down how do you do that? the cost is “sunk” with the vendor/reseller and I can’t see them wanting to undo that traditional commercial model.
I’ve seen IBM start to pioneer some mainframe pay as you go type commercial models down into the x86 space, where they ship you a fully loaded system and you pay for capacity that you use – this kind of works for vendors as they don’t have to pay margin to resellers and distribution if they can do it direct and it comes from their own factories at “cost” prices, a traditional VAR would find that this carries a significant financial risk so would usually seek to offset this via a contracted capacity and guaranteed capacity expansion.
I wonder if this could be a key selling tool of the ACADIA proposition – at a guess I’d say EMC/VMware/Cisco still want to sell tin/software as a capital item and get it out of their warehouses and bank the outright sale but they have a stake in the ACADIA business, they are the shareholder(s).
What if the ACADIA business were able to act as a financial intermediary – buying kit (hardware or software) from the VCE partners, levering volume and special pricing via its owners, handling logistics and leasing infrastructure out to the end-customer with professional services, rather than relying 100% on sales margin and professional/managed services revenue.
In theory ACADIA could build a diverse enough pool of customers that it could weather storms in any specific market sector (financials, telco, media etc.) to keep an overall positive profit and market performance. Because the “product” is built around a standard set of components (the vBlock) managing and re-distributing inventory items between customers is more feasible because it’s easier to keep “stock” of components or entire vBlocks– in this mode ACADIA could act almost as a super-VAR in traditional terms, but with some more creative financial models enabled by access to better "raw” pricing. (raw in the sense there are less middle-men and commissions to pay)
if they were able to pull this off then I can see a significant advantage over the more traditional VAR’s but do VCE risk treading on the toes of their traditional partners, distribution and resellers?
Double-Take puts DR into the Cloud
A colleague passed me this link today, Double-take have a new product offering allowing copies of app-servers to be replicated to and run on Amazon’s EC2 cloud service (register article here) – syncing disk writes in a delta fashion to an EC2 hosted AMI.
I suggested a similar architecture last year using Platespin, recent changes to EC2 to allow boot from elastic block storage (i.e persistent storage and private networking) make this a feasible solution, and as it’s pay per use you only pay for the EC2 instance(s) when they are running (i.e during a recovery situation).
You can read more about it here on the Double-Take site unfortunately their marketing department have coined another ‘aaS-ism’ in Recovery as a Service (RaaS) but we’ll forgive them as it’s a cool concept :).
There is a getting started guide here and it looks to operate on a many to one basis with one EC2 hosted instance of their software receiving delta changes from protected hosts over a VPN and writing them out to EBS volumes; if you need to recover a server an new EC2 instance is invoked and boots from the EBS volume containing replicas of your data, presumably inserting appropriate EC2 virtual h/w driver into the image at boot time (essentially P2V or V2V conversion).
My quick calculations; for a Windows 2008 server with a moderate amount of data (not factoring any client-side de-dupe) initial sync would transfer approx 15Gb into EC2 charges here – they vary by region so you can do your own figures EBS storage charges, and, of course; the initial sync might take a while depending on your internet connection.
If you are a *NIX admin you are probably thinking, huh, so what? copy data to S3 and just start-up a new AMI with the software and config you need and off you go; this solution seems targeted to Windows servers, where this sort of P2V, V2V recovery is very. very complicated due to the proprietary (i.e non-text file based) way Windows stores its application and system configurations in the registry.
In conclusion they would seem to have pipped Platespin:Protect to the post on this one – I had some good conversations with Platespin’s CTO about this solution last year but I have to say I’ve not seen significant new functionality out of the Platespin product range since Novell acquired it which is a shame, Double-Take Cloud looks like an interesting solution – check it out, and being “cloud” it’s easy to take it for a test drive – you would do well to consider whatever data protection laws your business is bound by, however (the curse of the cloud).
Cloud Camp London (21st Jan 2010) now open for registrations
You can register for the next Cloudcamp London on the 21st Jan 2010 at this link
If you don’t know what cloudcamp is about – check out one of my previous posts if you are available I recommend it.
Private Connectivity to Amazon EC2 – your own Private Cloud, in the Cloud
VPN connectivity and private networking within EC2 are now available, this is great news – I mused on the possibilities of this sort of thing previously in this post.
This is a key step to gaining corporate acceptance, and proves that there is definitely still a use case and demand for a private cloud,
This new offering provides better opportunities for integrating internal systems with large-scale commodity service from people like Amazon, extending your own address space into EC2 opens up interesting opportunities for selective offloading and “cloud-bursting” of services as well as DR.
Private or shared/dedicated cloud infrastructures take the principals of public cloud computing (on-demand, pay as you go, scalability) and apply them to private infrastructure (along these lines through the adoption of virtualization technology) some people see this as a bit of a cheat, or not “real” cloud computing… however, in the real world* they are very appealing where outsourcing to a commodity provider isn’t an option due to regulatory, compliance or security issues and it can provide extra assurance levels because you have the ability to “look the service provider in the eye” via a traditional business relationship, rather than an anonymous entity on the web.
I like the quote “virtualization is a technology, cloud computing is a business model” and to me that means that you can apply that “cloud” business model internally or externally (chargeback/leasing/outsourcing), it really doesn’t matter – it’s just how you do the sums, not the technology.
See this post from the AWS team for more details, and some analysis from the hoff here.
<flame>*I define real world as not in the land of whiteboards, workshops and architectural models, but in the non green-field land of doing business, making money and delivering service </flame>
Google opens up its DC so you can look inside.
Google are hosting a conference at the moment with a focus on energy efficient DC design, because of their scale they have a vested interest in this sort of thing, up until now they have been very protective of their “Secret sauce” but are now sharing their experiences with the wider community.
Key interesting points for me are; Google have been using container based DC’s with 4000 servers per container since 2005 – pics and info here and they are still building their own custom servers but with built in UPS batteries rather than relying on building based UPS. This is interesting as it distributes the battery storage and de-centralises the impact/risk of UPS maintenance or problems. Google also say this scale is actually more energy efficient.
There are some good close up pictures of an older Google server here, posts have referred to the more recent revisions as using laptop style PSUs; details of which I don’t believe they are making public, this design is a part of their competitive advantage I guess.
Dave Ohara has a comprehensive list of links to bloggers covering the conference here, along with his own interesting posts about the information that has been shared here and here.
I believe the videos will be available on YouTube on Monday so it will be interesting viewing, particularly seeing how Google have taken an entirely custom approach to their hardware & DC infrastructure rather than relying on off the shelf major vendor servers (Dell, HP, etc.)
On the subject of Google, I have heard rumours that the fabled GoogleOS is actually RHEL with heavy customisations for job management and distributed, autonomous control – at their scale the hardware needs to be just a utility; the “clever” bit is what their software does in managing horizontal scalability rather than high levels of raw compute power.
Whatever they can share with the community whilst maintaining their competitive edge can only benefit everyone – I’m sure Microsoft, Amazon and all the other cloud providers are watching closely 🙂
VMware Client Hypervisor (CVP) – Grid Application Thoughts
Today VMware announced the client hypervisor they are producing and a collaboration with Intel on the hardware support (VT) and management (vPro), Citrix made a similar announcement last month (some analysis from the trusty Brian Madden here).
If the client side device is now running a hypervisor this would presumably extend the same encapsulation principles from datacentre/server virtualization to the desktop; where more than one OS instance could run on a client; for example a Linux and a Windows VM side by side, sharing data or isolated for security/compliance reasons – network traffic securely routed or encapsulated to keep it separate.
With most PC hardware that’s probably still a lot of computing horsepower around the estate that is underused or idle while the user goes to lunch, or doing lightweight tasks.
Grid based applications are much discussed in the banking/geophysical world as they need to crunch vast amounts of data and are well suited to horizontal scaling. On an Internet scale, there are distributed grids like SETI or Folding@Home – crunching towards a common goal.
What if you have a centralised server than can stream down virtual appliances that run such applications and thus distributed services – isolated from the user through the hypervisor, resource controlled so that they process in the background or when the CPU is idle or by a central “resource policy”.
What if you could then sell this compute capacity back to a “grid” provider – which federates and dispatches grid jobs;
of course, you can technically do this now because multi-tasking has been standard on most desktop operating systems since the late 80’s but security has always been a concern, what if that “grid” application contains malicious code or a bug which can leak data from your machine or the corporate network – this problem hasn’t really been solved to-date, Java etc. provide sandboxes but they depend on a lot of components from the core OS stack and don’t address network isolation.
Now you have an option to provide a high level of instance and network isolation between business systems and grid/public applications by using a client hypervisor – much in the same way that VMware ESX is the foundation for a multi-tenant cloud through vSwitches & Private VLANs etc.
Take that idea to the next level, what if you could distribute your server workload around your desktop estate rather than maintain a large central compute facility?
High-availability through something like VMware FT and DRS/HA make features of the underlying hardware like RAID, redundant power supplies less of a focus point, arguably you are providing high availability at the hypervisor/software level rather than big-iron.
You could also do something like provide a peer to peer file system leveraging local storage on the device to provide local LAN access to files from caches – the hypervisor isolates the virtual appliance from the end-user to divide administrative access to systems and services.
There is a lot of capacity in this “desktop cloud”… and maybe some smart ways to use it, conventional IT thinking says this is a bit wacky but I definitely think there is something in it….thoughts?
Could Skynet be a Cloud Application, and Should I be Scared?
Has the cloud been sent from the future to kill you?
It’s Friday… so time for something completely different, Smugmug have already built skynet here on EC2 which decided it wanted more power… and made a semi-autonomous decision to scale itself out to mammoth proportions, if you weren’t as diligent as they are and maybe don’t pay close attention maybe your EC2 bills would bankrupt you by the time you see the invoice, assuming no credit-control limit… then you’d be out on the street, maybe loose your job, etc.
Or what if your EC2 instances picked up some kind of malware that is EC2 aware and it suddenly started to become a botnet, harvesting people’s credit card details to open up new EC2 accounts and spawn more parallel instances of itself, or spread to other cloud providers or opened up online loans, credit cards and gambling accounts, trade accounts, share dealing accounts – which in turn bankrupted other people. what if it made a coordinated (or maliciously intended), distributed on-line run on a particular stock, sparking panic buying, which in turn causes credit crunch 2.0 and brought about the end of humanity? oh, wait… that’s going on now.. maybe we know what caused it 🙂
What then if EC2 did provide IP connectivity back to your own networks and it started stealing and disseminating your internal commercial data (or entire virtual servers..), what if you ignore all that security best-practice stuff and start plugging in your office HVAC system into the LAN (lots of it going on these days) and it decides that it should brute-force access into or DoS your building UPS, resulting in overloads and fires.
Maybe virtualization is that chip they found, and VMware are really Cyberdyne systems?
Ok, bit off the wall but this thought came to me on the train home today…I’ve had a nasty dose of the flu, so maybe that paracetemol was a bit stronger than it said on the box 🙂
Best to remember those firewalls, sandboxes and policies are there for a reason.. and people’s natural impatience to embrace new things can always compromise that, especially in today’s world of instant/on-demand gratification… why do I have to wait 7 days to sign my paper! credit card application form… those check-points are there for a reason, the same security principals that apply to the physical world also apply to the cloud and virtualization – just because you can do something, doesn’t mean it’s the right thing to do, you need to assess risk and mitigate accordingly*
Normal service will be resumed shortly..
*Although I would expect there would be a few eyebrows raised if your corporate risk register contained an essay on how to mitigate against a horde of cyborgs controlled by your HR department trying to exterminate you (oh, wait..:))
Workload Portability: Ultimate Cloud Edition
I like the PlateSpin range of products a lot, it really does let you take an OS instance + app stack (workload) and move it between different physical machines, hypervisors etc. in a low impact way – if you’ve not come across it before – read this post for more info I see this portability as one of the key infrastructure components if you are looking to build or manage your own internal cloud infrastructures.
This isn’t possible at present, but put your architect hat on and imagine if you could plug PlateSpin Migrate (previously known as PlateSpin PowerConvert) tool into Amazon’s EC2 cloud, or a VMWare vCloud based farm – then you could do whatever you like with your Windows and Linux servers.
By design AWS and vCloud are both supposed to be automatable with web services and APIs to control machine provisioning and control etc. EC2 seems to have all of this now (API docs and example) and vCloud is coming along. (more real details at VMworld I’m guessing).
Moving services between on and off-premise cloud infrastructures is a key concept of vCloud; but I’m guessing this will only be between vCloud based infrastructures, what if you wanted to take advantage of the capacity and scale/commodity pricing from big providers like EC2 (which is Xen based under the hood) to offload some of your internal services – to my mind, there are a couple of scenarios here that PlateSpin could fulfil;
- Disaster Recovery – using the cloud (EC2 or other) for DR capacity; pay per use – use PlateSpin Protect to sync your machine images off to Amazon S3 and have a “panic button” that converts the S3 hosted images to running AMI’s. Brent has a similar idea here around SQL, my proposition takes this to the next level and does it from the OS up; if you did have to move over to the EC2 hosted DR cloud, then you could use it to go back to physical hardware again once you’ve repaired/rebuild your internal infrastructure
- Data centre moves or serious maintenance – use a cloud like EC2 as “swing” capacity to run services whilst you pick up your DC hardware and move it somewhere else (rather than a kit refresh).
- Test & Development; the ability to sandbox new apps in EC2 could be attractive to some organisations where corporate policies hinder or prevent this type of innovation taking place in-house; What if you could do this externally then just bring the machine instances back in-house to put into internal production use (I’ve seen this happening at several customers) – of course IT security teams would probably not be to happy about it.
- Short-term Expansion Capacity; if you experience an occasional surge of demand or load for an internal service. For example; if you have an internal application that you know will get really hit for a promotion or project then you could clone instances of the relevant web/application servers off to EC2 and use some kind of very clever load balancing tech to selectively hand off load to EC2 hosted instances when internal servers start getting saturated – or vice-versa.
Maybe even if PlateSpin were to position their product as a web service itself with downloadable agents – a connector/conversion hub between clouds – now that’s an interesting proposition.
Hopefully this diagram explains some of this idea visually
Issues at present:
- PlateSpin doesn’t have an interface to EC2 (consider this my feature request :))
- There is no secure connectivity back to corp HQ – this is something that as far as I can see AWS has an issue with – out of the box there is no way to have say an IPSec VPN or dedicated private subnet managed and provided by EC2, complicated networking scenarios don’t seem to be possible – you could build your own using software based routers and firewalls on EC2 hosted server instances but this is host based – would be good if EC2 add this sort of service to the platform in future – that would definitely be a killer feature as far as I’m concerned – AWS team, consider this my feature request :))
- VM Persistence is something of an issue with EC2 and I don’t think the EC2 model currently deals with it; with EC2 you pay whilst an instance is running, if you terminate it; i.e switch it off, it’s gone – the data (and that includes OS/app configurations) that you build into the instance are lost. there is no way to archive/suspend/freeze an instance to S3 and “spin it up” as required – I’m guessing this would be feasible for Amazon to build into EC2/S3 – you pay per GB stored on S3 so there is a cost-model for it – again this would be a killer feature for me – there are ways obviously to make your instances “vanilla” and have them auto-install relevant code and data when they are created; examples here and here but that takes a lot of work and isn’t so simple for most corporate type apps.
- You can attach an EBS (Elastic Block Storage) volume to an instance, this is persisted (as long as you keep paying for it) and you can mount it to a single host as a block disk device – but the issue remains with the actual OS instance not being persisted. if its a Windows OS, this is a particular problem as the config is all held in the registry etc. which is part of the OS itself.
- This still doesn’t get you past the concerns/issues over data ownership and cloud security, there is no magic bullet in this respect, just risk management/mitigation.
Anyways. just an idea, feel free to comment and give me your feedback..