A short overview for Distributed File Systems

A short overview for Distributed File Systems


Google-Hadoop File System PARAM

Cloud computing is changing the way we use to access our data which may be in form of different file formats, powered by advancements in distributed computing, parallel computing, grid computing and other computing technologies. Which gives us the idea to store data distributed over different computers rather than local and server systems. In this article we only focus on two most advanced and most popular distributed file systems which are GFS or Google File System and HDFS or Hadoop distributed file system. Here we compare both of them in a more precise and accurate manner considering all of the aspects which make them different and similar.

Before moving further firstly we need to know about what is a file system in case of the distributed web systems. A file system is a subsystem of the operating system that performs file management activities such as organization, storing, retrieval, naming, sharing, and protection of files. It also frees programmers from headache of space allocation and layout of the secondary storage devices.
But in contrast of the distributed systems it's implementation is more complex than the local file system due to the fact that the users and storage devices are physically dispersed. It provides remote information sharing, user mobility, availability around the world and use of diskless workstations or transparent remote-file accessing capability.

Now the introduction part is over now we get into a more technical background. GFS is Google's own implementation which is designed to meet rapidly growing demands of their processing needs. Not open to all currently Google uses it for their Google apps and workloads. Whereas this is not the case for HDFS developed under Apache which is highly inspired by GFS but as an open-source alternative to satisfy the different needs of clients. In general the hadoop is an open-source implementation of the MapReduce framework.

Now what is this MapReduce thing?

Originally developed by Google researchers around 2003 later adapted in Hadoop systems. MapReduce is a programming model for processing large data sets with parallel distributed algorithms. In simple words in HDFS its main purpose is to map all the systems in Hadoop cluster and generate tuples (say computers where data resides) and then reducing them to get most appropriate set of tuples so that the accessibility is efficient by performing some estimation techniques.

Google File System (GFS)
Hadoop Distributed File System (HDFS)

In terms of file structure above pictures shows the basic model for the GFS and HDFS. As we can see that GFS is divided into 64MB chunks and each is identified by 64-bit to handle these chunks, replicated into three default replicas chunkservers. They are further divided into 64KB blocks each having checksum (32-bit) for data integrity. In HDFS we have 128 MB blocks division and here NameNode have block replica as two files where one is for the data and one is for checksum and timestamp generation called DataNode acting as chunkserver. 

Noticeable difference from GFS is that there are only single writers per file and client decides where to write. Here only append is possible but in GFS random file writes possible. GFS follow multiple writer, multiple read model and later follow single writer, multiple reader model. Another aspect of HDFS is that it is open-source hence provides many different libraries for different file systems in different languages and platforms (like S3, KFS, C++, Python etc).

The real world HDFS application is Yahoo (2010) having over 60 million files and 63 millions blocks with cluster of having 3500 nodes. It is handling about 9.8 PB of total storage. Tech giants like Facebook also implemented HDFS based data grid systems for handling huge amount of user data.

GFS is optimized for high availability and speed best suited for Google data storage needs and at the same time simpler than most distributed systems. Whereas HDFS is more specific to client needs and data management requirements. There are further developments in this field which introduced various new technologies that must be mentioned here like NFS (Network file system by Sun Microsystems), AFS (Andrew File System), Coda (by Carnegie Mellon University) and many more.

For in-depth study you can refer to these research papers-

Written by