My ramblings on the stuff that holds it all together
Silent Data Corruption in the Cloud and building in Data Integrity
I was passed a link to a very interesting article on-line about silent data corruption on very large data sets, where corruption creeps undetected into the data read and written by an application over time.
Errors are common in reading from all media and this would normally be trapped by storage subsystem logic and handled lower down the stack but as these increase in complexity and the data they store vastly increases in scale this could become a serious problem as there could be bit-errors not being trapped by disk/RAID subsystems that are passed on unknown to the requesting application as a result of firmware bugs or faulty hardware – typically these bugs manifest themselves in a random manner or by edge-case users with unorthodox demands.
All hardware has a error/transaction rate – in systems up until now this hasn’t really been too much of a practical concern as you run a low chance of hitting one, but – as storage quantities increase into multiple Tb of data this chance increases dramatically. A quick scan round my home office tallys about 16Tb of on-line SATA storage, by the article’s extrapolation on numbers this could mean I have 48 corrupt files already.
This corruption is likely to be single-bit in nature and maybe it’s not important for certain file formats – but you can’t be sure, I can think of several file formats where flipping a single bit renders them unreadable in the relevant application.
Thinking slightly wider, if you are the end-user “victim” of some undetected bit-flipping what recourse do you have when that 1 flips to a 0 to say your life insurance policy doesn’t cover that illness you have just found you have – “computer says no”?
This isn’t exclusively a “cloud problem” it applies to any enterprise storing a significant amount of data without any application level logic checks, but it is compounded in the cloud world where it’s all about a centralised storage of data, applications and code, multi-tenanted and highly consolidated, possibly de-duplicated and compressed where possible.
In a market where cost/Gb is likely to be king providers will be looking to keep storage costs low, using cheap-er disk systems – but making multiple copies of data for resilience (note, resilience is different from integrity) – this could introduce further silent bit corruptions that are propagated across multiple instances as well as increasing the risk of exposure to a single-bit error due to the increased number of transactions involved.
In my view, storage hardware and software already does a good job of detecting and resolving these issues and will scale the risks/ratios with volumes stored. But, if you are building cloud applications maybe it’s time to consider a check summing method when storing/fetching data from your cloud data stores to be sure – that way you have a platform (and provider)-independent method of providing data integrity for your data.
Any such check summing will carry a performance penalty, but that’s the beauty of cloud – scale on demand, maybe PaaS providers will start to offer a web-service to offload data check summing in future?
Check summing is an approach for data reliability, rather than security but at a talk I saw at a Cloudcamp last year; a group were suggesting building DB field-level encryption into your cloud application, rather than relying on infrastructure to protect your data by physical and logical security or disk or RDBMS-level encryption (as I see several vendors are touting) build it into your application and only ever store encrypted assets there – then even if your provider is compromised all they hold (or leak) is already encrypted database contents – you as the end-user still retain full control of the keys and controls.
Combine this approach with data reliability methods and you have a good approach for data integrity in the cloud.