Snapshots en máquinas virtuales: ¿una herramienta útil o un riesgo oculto?

Snapshots in Virtual Machines: A Useful Tool or a Hidden Risk?

Celia Catalán

Do you think a virtual machine snapshot is the same as a backup or a disk copy? Then this post is for you.

First of all, let's answer the main question. What exactly is a snapshot? As its name suggests, a snapshot is an instantaneous view of a virtual machine's (VM) state at a specific moment. In other words, it's a "photo" that can be used to restore the VM to that exact point in time. And that's precisely where the misconception arises: many believe this makes it a backup... but it's not.

The following diagram shows an example of a single-disk VM with 5 data blocks:


What happens if we modify block A and add block F? Since the VM has no snapshots, it will directly modify its Disk 1, altering block A to A' and expanding its size to 6 with block F:

We reach the crucial point where we create a snapshot... what actually happens to the VM at a logical level? When generating a snapshot, we instruct the virtual machine to freeze its current disk (Disk 1) in read-only mode, so that if we ever need to revert, we can return to that exact state. From this point on, the VM creates a new disk (Disk 1') where all subsequent changes will be recorded. In other words, the snapshot doesn't duplicate all the information, but acts as a quick restore point:


And here's the first clue as to why a snapshot is not a backup: both the creation and reversal of a snapshot happen in a matter of seconds, which is very different from the time it takes to generate and restore a complete backup.

In this case, after the first snapshot is created, blocks A' and B are modified, and a new block G is added. As can be seen in the following diagram, these changes are stored in Disk 1', which now has a size of 3 blocks, while Disk 1 remains unchanged with its 6 blocks intact:

An important detail we should not overlook. The actual size the VM should have is 7 blocks, but due to how snapshots work, it is currently occupying 9 blocks since Disk 1' contains modified versions of blocks that already exist.
 
Now we create a second snapshot, and the process is similar to the previous one. Disk 1', which until now was recording changes, becomes frozen in read-only mode, and the virtual machine generates a new disk, Disk 1''. At this point, modifications to blocks A'', D, and E arrive, and a new block H appears:

After this second nested snapshot, we arrive at one of the main problems that arise from keeping old snapshots on a virtual machine. At this point, the original Disk 1 occupies 6 blocks, while the disks corresponding to the snapshots already total 7 blocks, making a total of 13 blocks. The result? The VM is consuming more space in the datastore due to snapshots than its own original size. This is a subtle but critical risk. Snapshots not only create dependencies between disks but can also lead to uncontrolled storage growth, affecting performance and even compromising system capacity.

Furthermore, as observed, this mechanism demonstrates that snapshots are not complete copies, but rather chained dependencies. Each new snapshot relies on the previous one. And here lies a significant risk that is often overlooked: if one of these disks fails, the entire chain can become unusable, leaving the VM in an inconsistent state.

Regarding performance, imagine we need to access block C. The VM starts searching in Disk 1'' (the most recent) in case the block has been modified. Since it's not there, it moves to the next disk in the chain: Disk 1'. It's not found there either. Finally, it reaches the original Disk 1, where the block has resided intact from the beginning. This journey implies that the read operation had to traverse the entire snapshot chain to locate the data, making the operation inefficient and slow.

Now let's extrapolate this scenario to a real environment of a virtual server with a disk of thousands or millions of blocks and with multiple old snapshots. Each read or write operation that involves traversing the entire disk chain will significantly increase access time. If this situation is replicated in multiple queries and on a critical system, the impact on overall performance will be evident: increased latency, slower operations, and degradation of the guaranteed service.

Does everything we've explained seem exaggerated to you? Let me give you a real case: a company had its main database hosted on a virtual server with a 2TB disk, however, in the datastore, that VM occupied almost 5TB. The reason was a snapshot that had not been deleted for over 3 years, on a server that processed thousands of daily changes and requests. The result was uncontrolled storage growth and service degradation to the point of daily customer complaints about slowness and malfunction.

This is the great risk of treating snapshots as if they were backups and accumulating them for fear of "losing something." The reality is quite the opposite: the older snapshots you keep, the greater the risk to your infrastructure.

There's a basic rule that never fails: if a snapshot is old, delete it. Think about it: would you really restore a production server from a snapshot that's months... or even years old? If the answer is no, then that snapshot serves no purpose.

Snapshots are very useful for specific changes and temporary operations, but they are not a substitute for a backup. To protect your data, you need complete, external, and verified backups.

And finally... What happens when we delete the snapshot chain and consolidate the VM disk? The process consists of merging all accumulated changes in the snapshots with their previous disk, one by one, until the chain is fully integrated into the base disk. Only then does the virtual machine regain its real size, eliminating the differential disks that had been created:

This procedure is very slow and delicate, especially if you have old snapshots with completely uncontrolled sizes, as it involves intensive read and write operations. Therefore, maintaining old snapshots not only consumes space and affects performance but also complicates and prolongs any consolidation task.

return to blog

Leave a comment

Please note that comments must be approved before they are published.