Last weekend I re-read this paper on the internals of Azure storage. It’s an excellent deep dive into the architecture of Windows Azure storage, Its stream layer shows the similarity to Cosmos although this paper goes into much more detail than I’ve seen in any paper that references Cosmos. Streams are append only which allows for much faster reads and writes, Google’s GFS and Hadoop’s HDFS are similar in this regard.
RangePartitions in the Object Tables use Log-Structured Merge-Trees which works similar to the SSTable architecture of Google’s BigTable. The HFile format used in Apache HBase has some similarities. Section 5 goes into a lot of detail about the in-memory data structures used to index and return results as well as the triggers and algorithms involved in splitting and merging RangePartitions according to traffic.
The reason for the re-read was in preparation for learning more about Azure’s use of erasure codes for redundant storage. Replicated storage is becoming expensive as the amount of data we store grows. Erasure codes provide a more efficient solution. As an analogy, conventional triple replication is the distributed equivalent of RAID 1 disks (well, 0+1 really). The use of erasure codes is like a distributed equivalent to RAID 6 drive arrays, Instead of replicating each block in a file, parity blocks are created that can be used to recover any of the original blocks. RAID 5 uses parity stripes as well but the parity computation is simpler and easier to understand. In RAID 5, the parity clock is constructed by computing the bitwise XOR ⊕ of all the other blocks. This takes advantage of the fact that if A ⊕ B ⊕ C = P, then P ⊕ B ⊕ C = A and A ⊕ B ⊕ P = C etc. So if you lose a block, you can construct it from all the others. The challenge with XOR as a parity bit computation is that you can only use it to recover from a single failure. RAID 6 introduced a second parity block that uses an alternative function. The most common form of code used are Reed-Solomon codes.
The challenge with these codes is they result in higher recovery costs, recovering a lost partition requires transmitting a large amount of data to a single location for reconstruction. Recovery is also often computationally expensive. Recent new approached are helping change that.
Windows Azure uses erasure codes to replicate fragments of “sealed” data (see section 4.4 of the first paper). By leveraging a new code they developed called Local Reconstruction Codes the Azure team has been able to achieve even more efficient storage. A key part of this approach is to generate parity blocks from subsets of the full collection of blocks. With this, any block in that subset can be easily recovered without needing to read all blocks. LRCs achieve a higher mean time to failure than triple replication or Reed-Solomon codes and offer configurable tradeoff between storage overhead and reconstruction costs. Azure can also use LRCs to deal with a network partition, temporarily using the parity blocks to return data from a block that is temporarily unavailable. For more see this paper on “Erasure Coding in Windows Azure Storage”.
Facebook has been using a version of HDFS (HDFS RAID) That uses RS codes to do something similar for cold files. In a recent paper they also introduced a new flavor of HDFS (HDFS-Xorbas) that uses uncannily similar techniques. They even have the same acronym: LRC- Locally Repairable Codes. Their solution focuses on reducing the I/O and network cost of repairs when using Reed-Solomon codes.
It’s exciting to see these kinds of cost saving innovations in storage architecture, I wish I could learn more about how Amazon’s Glacier works under the covers to find out if they’re doing anything like this. I look forward to seeing (and trying to keep up with) future advances in this space.