My ramblings on the stuff that holds it all together
Category Archives: vSphere
Whilst we all await the “official” vSphere administration app for the iPad, as previewed at VMworld I found myself needing something to control my home vSphere lab environment from my shiny new iPad.
The iPad has now integrated itself as the device of choice with my wife & kids and is in regular use as a web-browser and media-player in the living room at home rather than laptops so this seemed like a logical extension
A quick browse of the iTunes store turned up iDatacenter, whilst not cheap at 8.99 GBP it works well in my testing as a basic administration interface to my lab and allows me to reboot guests/hosts as well as kick off vMotion and storage vMotion tasks.
It doesn’t offer a remote console or any historical performance graphing but it is good for basic administration tasks and looking at current statistics like CPU, memory and disk space – which is handy as my home lab currently has 21 ESX hosts and 54 “production” virtual machines
The following photo shows a quick view of the interface, my only minor gripe is that it doesn’t seem to recognise clusters as a management object – just individual ESX hosts or virtual machines and it can be a little bit slow at times, but those aside it’s worth checking out if you have this sort of requirement.
The application home-page is here http://nym.se/idatacenter/ and there is a video demonstrating the key features.
I love VMware Workstation, I have used it since about 1999 when I was first introduced to virtualization and it totally revolutionised the way I did my home and work lab study and later production systems.
Since then it has always introduced new features with every version that seemed to be back-ported into the server products, as I understand it the record and replay features underpinned the code that became VMware Fault-Tolerance, and the same with linked-clones and thin-provisioning in vSphere.
There has been integration with developer environments for better debugging and I guess a lot of Workstation has gone into the Fusion product for the Mac – but it did get me thinking, where is next for Workstation – beyond the usual performance tweaks that seem to get make in every version?
What I think would be great, and it ties into my previous post on vendor hardware emulators is a pluggable hardware abstraction layer(HAL)/Driver architecture for VMware Workstation.
Workstation does a brilliant job at virtualizing x86/64 hardware and to-date that has been its primary task but I wonder if it could be expanded into a more modular architecture product to support wider development and use of other hardware platforms on x86/64.
There are many emulators available out there for developers for mobile phone chipsets, custom ASICs etc. but these are often hard to configure for the end-user and are very bespoke to the devices they are developed for.
With the amount of spare horsepower and low price-point available to commodity x86/64 hardware all it needs is a common virtualization/emulation product to unify it together and you have a very powerful product with a huge market, not only for developers but for operations people who no longer need a huge lab of bespoke hardware, mobile phones and devices to support end-users – it’s all available in a virtual machine.
Thinking slightly wider in scope, If it were also back-ported and integrated into future versions of the vSphere product line you have a very powerful back-end server product – VMware talk of the software mainframe, this is bringing what some mainframes currently do for virtualizing x86 server, but making it a MUCH wider application.
Whilst the initial pay-off would be with developer licenses rather than enterprise/large scale licensing agreements with Hyper V and Xen rapidly catching up on the Hypervisor front VMware need something cutting-edge to keep them ahead of the game, and consider the enterprise implications for this;
Lots of customers running workloads on SPARC hardware/OS – porting to x86 Solaris isn’t simple, is the cost/performance benefit still there for SPARC customers in the world of cheap and fast x64 hardware – emulating/virtualizing SPARC CPU workloads onto x64 could be a big draw for Sun customers, particularly with the Oracle acquisition and VMware targeting vSphere at large scale Oracle customers this could prove easier than porting legacy apps from SPARC in the same way virtualization has revolutionised the x86 server space.
Or an ESX cluster running a mix of x86/64, SPARC, ARM, iPhone, Set-Top Box, AS/400 virtual machine workloads – either as a test and dev, support or even production solution.
Sure, emulation has an overhead – so does virtualization but x86/64 hardware is cheap and off-the-shelf, add in a distributed ESX processing cluster (my thoughts on that here) and you could probably build something with equivalent or even better performance for less.
Interesting concept (to me anyway)… thoughts?
Now that VMware are moving away from ESX classic (with service console) to the ESXi model I have experienced a couple of issues recently that got me wondering if NFS will be a more appropriate model for VM storage going forward. in recent versions of ESX (3.5 and 4) NFS has moved away from just being recommended for .ISO/template storage and has some big names behind it for production VM storage.
I’m far from a storage expert, but I know enough to be dangerous… feel free to comment if you see it differently.
“out of band” Access speed
Because VMFS is a proprietary block-storage file system you are only able to access it via an ESX host you can’t easily go direct (VCB…maybe, but it’s not easy), in the past this hasn’t been too much of an issue; however whilst building a new ESXi lab environment on standard hardware I found excessive transfer times using the Datastore browser in the VI Client, 45mins+ to copy a 1.8GB .ISO file to a VMFS datastore, or import virtual machines and appliances; even using Veeam FastSCP didn’t make a significant difference.
I spent ages checking out network/duplex issues but in desperation I tried it against ESX classic (based on this blog post I found) installed on the same host and that transfer time was less than 1/2 (22mins) – which still wasn’t brilliant – but I cranked out Veeam FastSCP and did it in 6mins!
So, lesson learnt? relying on the VI client/native interfaces to transfer large .ISO files or VMs into datastores slow and you have to go via the Hypervisor layer, which oddly doesn’t seem optimized for this sort of access. Veeam FastSCP fixes most of this – but only on ESX classic as it has some service-console cleverness that just isn’t possible on ESXi.
With ESX classic going away in favour of ESXi, there will need to be an alternative for out of band access to datastores – either direct access or an improved network stack for datastore browsing
This is important where you manage standalone ESX hosts (SME), or want to perform a lot of P2V operations as all of those transfers use this method.
In the case of using NFS, given appropriate permissions you can go direct to the volume holding the VMs using a standard network protocol which is entirely outside of the ESX/vCenter. upload/download transfers thus are at at the speed of the data mover or server hosting the NFS mount point so are not constrained by ESX.
To me, Fibre Channel was always more desirable for VM storage as it offered lossless bandwidth up to 4Gb/s (now 8Gb/s) but Ethernet (which is obviously required to serve NFS) now has 10Gb/s bandwidth and loss-less technology like FCoE, some materials put NFS about 10% slower than VMFS – considering the vast cost difference between dedicated FC hardware and commodity Ethernet/NAS storage I think that’s a pretty marginal difference when you factor in the simplicity of managing NFS vs. FC (VLANs, IPs vs. Zoning, Masking etc.).
FCoE maybe addresses the balance and provides the best solution to performance and complexity but doesn’t really address the out of band access issue I’ve mentioned here as it’s a block-storage protocol.
If you have a problem with your vCenter/ESX installation you are essentially locked out of access to the virtual machines, it’s not easy to just mount up the VMFS volume on a host with a different operating system and pull out/recover the raw virtual machines.
With NFS you have more options in this situation, particularly in small environments.
Storage Host Based Replication
For smaller environments SAN-SAN replication is expensive, and using NFS presents some interesting options for data replication across multiple storage hosts using software solutions.
I’d love to hear your thoughts..
Following on from my recent blog posts about the various ways to configure ML115 G5 servers to run ESX, I thought I would do some further experimenting on some older hardware that I have.
I have a Dell D620 laptop with dual-core CPU and 4Gb of RAM which is now no longer my day-day machine, because of the success I had with SSD drives I installed a 64Gb SSD in this machine
I followed these instructions to install ESXi 4 Update 1 to a USB Lego brick flash drive (freebie from EMC a while ago and plays nicely to my Legogeekdom). I can then boot my laptop from this USB flash drive to run ESXi.
I am surprised to say it worked 1st time, booted fully and even supports the on-board NIC!
So, there you go – another low-cost ESXi server for your home lab that even comes with its own hot-swappable built-in battery UPS 🙂
The on-board SATA disk controller was also detected out of the box
A quick look on eBay and D620’s are going for about £250, handy!
Here is a screenshot of the laptop running a nested copy of ESXi, interestingly I also told the VM it had 8Gb of RAM, when it only has 4Gb of physical RAM.
If you need to install vCenter 4 on Windows Server 2008 and want to be able to customize non Windows 2008/Vista and later VMs (i.e Windows XP, 2003, 2000) you need to place the extracted deploy.cab files in a different location than you used with Windows 2003 (C:\documents and settings\all users … etc.) so that vCenter has access to the sysprep.exe files.
On Windows 2008 this location is now in C:\ProgramData\VMware\VMware VirtualCenter\sysprep
You can then extract the deploy.cab file to the appropriate folder and use the customization specification functionality (like this ESX 3.5 example)
There is a handy reference with download links for all versions here.
Note – as I posted previously you don’t need to worry about this if you are solely deploying Windows 2008/Vista and later VMs, as they have sysprep.exe built into the default OS build.
In my lab I have a virtualized vCenter installation, it works well and I’ve had no problems with this configuration in the last year.
I wanted to try to build a 2 node demo cluster for my VMUG session and needed vCenter to be protected by FT – so an individual host failure would not break vCenter during my demos.
My vCenter installation was thin-provisioned which isn’t compatible with FT so the quickest solution I found to this was to just clone it to a new VM with a fully provisioned (thick) disk.
Once completed I powered up the cloned vCenter installation whilst quickly switching off the old one to avoid any IP conflicts this worked fine and the ESX hosts didn’t really notice, I just had to re-connect my vSphere client.
I then enabled the FT features and after doing its thing I have a fully protected ESX/vCenter installation using FT.
it’s worth noting that you can only enable FT when using a vSphere client connected to vCenter – you can’t enable it if you connect directly to the ESX host itself (which is why cloning was the easiest approach for me)
When trying to browse the performance overview tab in the vSphere client you may get this error;
“This program cannot display the webpage”
However, the advanced tab works ok and you can still build custom charts.
Luckily, this is pretty simple to fix, the cause of this problem is that the VMware Virtual Centre Management Webservices service is not running.
the VI client breaks out to an internal webservice to deliver the graphs on the performance overview page.
to fix this problem you can start the service manually.
I have seen this problem on virtualised Virtual Center installations where the VC box cannot reach it’s back-end SQL server at start-up; either because of a network problem or delayed/out of sequence start-up.
you can set the recovery options to try and work around this if you cannot fix the root cause.
Once it’s working again you get all the following charty goodness again
If you are a VCP3 you’ll need to get a move on and upgrade your certification to VCP4 unless you have time to sit (and pay for) some classroom training next year – you need to have passed the exam before December 31st 2009 (i.e in 43 days time!)
Also bear in mind there might be a bit of a rush – anyone else remember the NT4 MCSE –> one-shot Windows 2000 upgrade exam? there will be a lot of people in the same boat as you (and I!) and time is running out, this is especially a problem if you only have access to a limited number of testing centres where you live as they will be getting booked up.
As some insurance VMware are also offering a free re-take at present; but there is a catch – you have to wait for a voucher to be emailed to you before you can book your exam with free re-take option – and it says the Friday following your registration – so bear this delay in mind if you want to take this option.
If you are a VCP3 you should have received an email from VMware with a link to register for
For the official word on what you need to do – go here – you’ll also need your VMware myLearn username and password (which is recoverable from the site if you’ve forgotten it)
You can register for the 2nd shot option here
It seems a bit odd, but you need to register for this “virtual class” to be issued the voucher (screen cap of successful registration below..)
I am now waiting for my voucher via email* so I can register for my VCP exam with free re-take option – check this post for links to my study materials and another plug for Simon Long’s excellent resources here, oh and I need some time to take the exam as well 🙂
*You may want to check the email address registered with your myLearn account is valid/correct
I encountered this situation in my home lab recently – to be honest I’m not exactly sure of the cause yet, but I think it was because of some excessive I/O from the large number of virtualized vSphere hosts and FT instances I have been using mixed with some scheduled storage vMotion – over the weekend all of my virtual machines seem to have died and crashed or become unresponsive.
Firstly, to be clear this is a lab setup; using a cheap/home PC type SATA disk and equipment not your typical production cluster so it’s already working pretty hard (and doing quite well, most of the time too)
The hosts could ping the Openfiler via he vmkernel interface using vmkping so I knew there wasn’t an IP/VLAN problem but access to the LUNs was very slow, or intermittent – directory listings would be very slow, time out and eventually became non-responsive.
I couldn’t power off or restart VMs via the VI client, and starting them was very slow/unresponsive and eventually failed, I tried rebooting the vSphere 4 hosts, as well as the OpenFiler PC that runs the storage but that didn’t resolve the problem either.
At some point during this troubleshooting the 1TB iSCSI LUN I store my VMs on disappeared totally from the vSphere hosts and no amount of rescanning HBA’s would bring it back.
The Path/LUN was visible down the iSCSI HBA but from the storage tab of the VI client
Visible down the iSCSI path..
But the VMFS volume it contains is missing from the list of data stores
This is a command line representation of the same thing from the /vmfs/devices/disks directory.
OpenFiler and it’s LVM tools didn’t seem to report any disk/iSCSI problems and my thoughts turned to some kind of logical VMFS corruption, which reminded me of that long standing but never completed task to install some kind of VMFS backup utility!
At this point I powered down all of the ESX hosts, except one to eliminate any complications and set about researching VMFS repair/recovery tools.
I checked the VMKernel log file (/var/log/vmkernel) and found the following
[root@ml110-2 /]# tail /var/log/vmkernel
Oct 26 17:31:56 ml110-2 vmkernel: 0:00:06:48.323 cpu0:4096)VMNIX: VmkDev: 2249: Added SCSI device vml0:3:0 (t10.F405E46494C454009653D4361323D294E41744D217146765)
Oct 26 17:31:57 ml110-2 vmkernel: 0:00:06:49.244 cpu1:4097)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x410004168500) to NMP device "mpx.vmhba0:C0:T0:L0" failed on physical path "vmhba0:C0:T0:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Oct 26 17:31:57 ml110-2 vmkernel: 0:00:06:49.244 cpu1:4097)ScsiDeviceIO: 747: Command 0x12 to device "mpx.vmhba0:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
Oct 26 17:32:00 ml110-2 vmkernel: 0:00:06:51.750 cpu0:4103)ScsiCore: 1179: Sync CR at 64
Oct 26 17:32:01 ml110-2 vmkernel: 0:00:06:52.702 cpu0:4103)ScsiCore: 1179: Sync CR at 48
Oct 26 17:32:02 ml110-2 vmkernel: 0:00:06:53.702 cpu0:4103)ScsiCore: 1179: Sync CR at 32
Oct 26 17:32:03 ml110-2 vmkernel: 0:00:06:54.690 cpu0:4103)ScsiCore: 1179: Sync CR at 16
Oct 26 17:32:04 ml110-2 vmkernel: 0:00:06:55.700 cpu0:4103)WARNING: ScsiDeviceIO: 1374: I/O failed due to too many reservation conflicts. t10.F405E46494C454009653D4361323D294E41744D217146765 (920 0 3)
Oct 26 17:32:04 ml110-2 vmkernel: 0:00:06:55.700 cpu0:4103)ScsiDeviceIO: 2348: Could not execute READ CAPACITY for Device "t10.F405E46494C454009653D4361323D294E41744D217146765" from Plugin "NMP" due to SCSI reservation. Using default values.
Oct 26 17:32:04 ml110-2 vmkernel: 0:00:06:55.881 cpu1:4103)FSS: 3647: No FS driver claimed device ‘4a531c32-1d468864-4515-0019bbcbc9ac’: Not supported
Due to too many SCSI reservation conflicts, so hopefully it wasn’t looking like corruption but a locked-out disk – a quick Google turned up this KB article – which reminded me that SATA disks can only do so much 🙂
Multiple reboots of hosts and the OpenFiler hadn’t cleared this situation – so I had to use vmkfstools to reset the locks and get my LUN back, these are the steps I took..
You need to find the disk ID to pass to the vmkfstools –L targetreset command, to do this from the command line look under /vmfs/devices/disks (top screenshot below)
You should be able to identify which one you want by matching up the disk identifier.
Then pass this identifier to the vmkfstools command as follows (your own disk identifier will be different) – hint: use cut & paste or tab-completion to put the disk identifier in.
vmkfstools-L targetreset /vmfs/devices/disks/t10.F405E46494C4540096(…)
You will then need to rescan the relevant HBA using the esxcfg-rescan command (in this instance the LUN is presented down the iSCSI HBA – which is vmhba34 in vSphere)
(you can also do this part via the vSphere client)
if you now look under /vmfs/volumes the VMFS volume should be back online, or do a refresh in the vSphere client storage pane.
All was now resolved and virtual machines started to change from (inaccessible) in the VM inventory back to the correct VM names.
One other complication was that my DC, DNS, SQL and vCenter server are all VMs on this platform and residing on that same LUN. So you can imagine the havoc that causes when none of them can run because the storage has disappeared; in this case it’s worth remembering that you can point the vSphere client directly at an ESX node, not just vCenter and start/stop VMs from there – to do this just put the hostname or IP address when you logon rather than the vCenter address (and remember the root password for your boxes!) – if you had DRS enabled it does mean you’ll have to go hunting for where the VM was running when it died.
In conclusion I guess there was gradual degradation of access as all the hosts fought with a single SATA disk and increased I/O traffic until the point all my troubleshooting/restarting of VMs overwhelmed what it could do. I might need to reconsider how many VMs I run from a single SATA disk as I’m probably pushing it too far – remember kids this is a lab/home setup; not production, so I can get away with it 🙂
In my case it was an inconvenience that it took the volume offline and prevented further access, I can only assume this mechanism is in-place to prevent disk activity being dropped/lost which would result in corruption of the VMFS or individual VMs.
With the mention of I/O DRS in upcoming versions of vSphere that could be an interesting way of pre-emotively avoiding this situation if it does automated storage vMotion to less busy LUNs rather than just vMotion between hosts on the basis of IOPs.
I have a 2 node vSphere cluster running on a pair of ML115g5 servers (cheap ESX nodes, FT compatible) and I was trying to put one into maintenance mode so I could update its host profile, however it got stuck at 2% entering maintenance mode, it appeared to vMotion off the VMs it was running as expected but never passed the 2% mark.
After some investigation I noticed there were a pair of virtual machines still running on this host with FT enabled – the secondary was running on the other server ML115-1 (i.e not the one I wanted to switch to maintenance mode)
I was unable to use vMotion so that the primary and secondary VMs were temporarily running on the same ESX host (and that doesn’t make much sense anyway)
That makes sense, the client doesn’t let you deliberately do something to that host that would break the FT protection as there would be no node to run the secondary copy. incidentally this is good UI design – you have to opt-in to break something – so you just have to temporarily disable FT and should be able to proceed.
If I had a 3rd node in this cluster there wouldn’t be a problem as it would vMotion the secondary (or primary) to an alternative node automatically (shown below is how to do this manually)
However in my case all of the options to disable/turn-off FT were greyed out and you would appear to be stuck and unable to progress.
the fix is pretty simple and you just need to cancel the maintenance mode job by right-clicking in the recent tasks pane and choosing cancel, which then re-enables the menu options and allows you to proceed. Then turn-off (not disable – that doesn’t work) fault tolerance for the problematic virtual machines
The virtual machine now doesn’t have FT turned on, if you just disable FT it doesn’t resolve this problem as it leaves the secondary VM in-situ, you need to turn it off.
So, moral of the story is – if you’re stuck at 2% look for virtual machines that can’t be vMotioned off the host – if you want to use FT – a 3rd node would be a good idea to keep the VM FT’d during individual host maintenance; this is a lab environment rather than an enterprise grade production system but you could envision some 2-node clusters for some SMB users – worth bearing in mind if you work in that space.