logo

SCIENCE CHINA Information Sciences, Volume 62 , Issue 7 : 072101(2019) https://doi.org/10.1007/s11432-017-9295-4

LCCFS: a lightweight distributed file systemfor cloud computing without journaling and metadata services

More info
  • ReceivedMay 15, 2017
  • AcceptedNov 23, 2017
  • PublishedApr 4, 2019

Abstract


Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant No. 61370018).


References

[1] Mesnier M, Ganger G R, Riedel E. Object-based storage. IEEE Commun Mag, 2003, 41: 84--90. Google Scholar

[2] Schwan P. Lustre: building a file system for 1000-node clusters. In: Proceedings of the Linux Symposium, Ottawa, 2003. Google Scholar

[3] Welch B, Gibson G. Managing scalability in object storage systems for HPC linux clusters. In: Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, Greenbelt, 2004. 433--445. Google Scholar

[4] Weil S A, Leung A W, Brandt S A, et al. Rados: a scalable, reliable storage service for petabyte-scale storage clusters. In: Proceedings of the 2nd International Workshop on Petascale Data Storage: Held in Conjunction with Supercomputing, Reno, 2007. 35--44. Google Scholar

[5] Weil S A, Brandt S A, Miller E L, et al. Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, Seattle, 2006. 307--320. Google Scholar

[6] Weil S A, Pollack K T, Brandt S A, et al. Dynamic metadata management for petabyte-scale file systems. In: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, Pittsburgh, 2004. 6--12. Google Scholar

[7] Sevilla M. Mds has inconsistent performance. 2015. http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/22674. Google Scholar

[8] Nagle D, Factor M E, Iren S. The ANSI T10 object-based storage standard and current implementations. IBM J Res Dev, 2008, 52: 401-411 CrossRef Google Scholar

[9] Rodeh O, Teperman A. zFS — a scalable distributed file system using object disks. In: Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, San Diego, 2003. 207--218. Google Scholar

[10] Abd-El-Malek M, Courtright II W V, Cranor C, et al. Ursa minor: versatile cluster-based storage. In: Proceedings of the 4th USENIX Conference on File and Storage Technologies, San Francisco, 2005. 59--72. Google Scholar

[11] Kubiatowicz J, Bindel D, Chen Y, et al. Oceanstore: an architecture for global-scale persistent storage. In: Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, New York, 2000. 190--201. Google Scholar

[12] Adya A, Bolosky W J, Castro M, et al. Farsite: federated, available, and reliable storage for an incompletely trusted environment. In: Proceedings of the 5th Symposium on Operating Systems Design and implementation, New York, 2002. 1--14. Google Scholar

[13] Haeberlen A, Mislove A, Druschel P. Glacier: highly durable, decentralized storage despite massive correlated failures. In: Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation, Berkeley, 2005. 143--158. Google Scholar

[14] Beaver D, Kumar S, Li H C, et al. Finding a needle in haystack: facebook's photo storage. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, Vancouver, 2010. 1--8. Google Scholar

[15] Card R, Ts T, Tweedie S. Design and implementation of the second extended filesystem. In: Proceedings of the 1st Dutch International Symposium on Linux, 1994. Google Scholar

[16] Mathur A, Cao M, Bhattacharya S, et al. The new ext4 lesystem: current status and future plans. In: Proceedings of the Linux Symposium, Ottawa, 2007. 21--33. Google Scholar

[17] Ts'o T Y, Tweedie S. Planned extensions to the Linux EXT2/EXT3 filesystem. In: Proceedings of the FREENIX Track: 2002 USENIX Annual Technical Conference, Berkeley, 2002. 235--243. Google Scholar

[18] Tweedie S. Ext3, journaling filesystem. In: Proceedings of the Linux Symposium, Ottawa, 2000. Google Scholar

[19] Konishi R, Amagai Y, Sato K. The Linux implementation of a log-structured file system. SIGOPS Oper Syst Rev, 2006, 40: 102-107 CrossRef Google Scholar

[20] Rosenblum M, Ousterhout J K. The design and implementation of a log-structured file system. ACM Trans Comput Syst, 1992, 10: 26-52 CrossRef Google Scholar

[21] Hitz D, Lau J, Malcolm M. File system design for an nfs file server appliance. In: Proceedings of the USENIX Winter 1994 Technical Conference on Technical Conference, San Francisco, 1994. 19. Google Scholar

[22] Rodeh O, Bacik J, Mason C. Btrfs: the linux b-tree filesystem. ACM Trans Storage, 2013, 9: 1--32. Google Scholar

[23] Shen K, Park S, Zhu M. Journaling of journal is (almost) free. In: Proceedings of the 12th USENIX Conference on File and Storage Technologies, Santa Clara, 2014. 287--293. Google Scholar

[24] Fryer D, Qin D, Sun J, et al. Checking the integrity of transactional mechanisms. In: Proceedings of the 12th USENIX Conference on File and Storage Technologies, Berkeley, 2014. 295--308. Google Scholar

[25] Aghayev A, Theodore Y, Gibson G, et al. Evolving ext4 for shingled disks. In: Proceedings of the 15th Usenix Conference on File and Storage Technologies, Santa clara, 2017. 105--119. Google Scholar

[26] Wang L, Liao X K, Xue J L, et al. Enhancement of cooperation between file systems and applications-VFS extensions for optimized performance. Sci China Inf Sci, 2015, 58: 092104. Google Scholar

[27] Zhang S, Catanese H, Wang A I A. The composite-file file system: decoupling the one-to-one mapping of files and metadata for better performance. In: Proceedings of the 14th Usenix Conference on File and Storage Technologies, Santa Clara, 2016. 15--22. Google Scholar

[28] Xu Q, Arumugam R V, Yong K L. Efficient and Scalable Metadata Management in EB-Scale File Systems. IEEE Trans Parallel Distrib Syst, 2014, 25: 2840-2850 CrossRef Google Scholar

[29] Thomson A, Abadi D J. Calvinfs: consistent wan replication and scalable metadata management for distributed file systems. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, Santa Clara, 2015. 1--14. Google Scholar

[30] Niazi S, Ismail M, Haridi S, et al. Hopsfs: scaling hierarchical file system metadata using newsql databases. In: Proceedings of the 15th USENIX Conference on File and Storage Technologies, Santa Clara, 2017. 89--104. Google Scholar

  • Figure 1

    The integration of LCCFS with OpenStack.

  • Figure 2

    A namespace example.

  • Figure 3

    The usage of LCCFS.

  • Figure 4

    Performance speedups of LCCFS over CephFS under three different record sizes for (a) sequential read, protectłinebreak (b) random read, (c) sequential write, and (d) random write.

  • Figure 5

    Performance speedups of LCCFS over CephFS for read and write measured by dd.

  • Figure 6

    The time of directory creation and deletion for LCCFS and CephFS.

  • Figure 7

    Normalized IOPS under random (a) read and (b) write with a record size of 4 KB. Normalized bandwidth under sequential (c) read and (d) write with a record size of 4 MB. Normalized bandwidth under random (e) read and (f) write with a record size of 4 MB.

  •   

    Algorithm 1 Create a file

    procedure create

    algorithmicrequire inodeno_t pino, const char * name

    Generate a new inode number ino;

    Insert a reclaim entry (pino, ino);

    bt

    Create an object Inode.(ino);

    Inode.(ino)$\rightarrow$pino = pino;

    et

    bt

    exist = entry_exist(Inode.(pino), name);

    if exist = FALSE then

    Insert an entry name in the object Inode.(pino);

    end if

    et

    if exist = TRUE then

    bt

    Delete the object Inode.(ino);

    et

    end if

    end procedure

  •   

    Algorithm 2 Remove a file

    procedure unlink

    algorithmicrequire inodeno_t pino, inodeno_t ino

    Insert a reclaim item (pino, ino);

    bt

    if entry_exist(Inode.(pino), ino) = TRUE then

    Remove the entry ino from the object Inode.(pino);

    end if

    et

    end procedure

  •   

    Algorithm 3 Rename a file

    procedure rename

    algorithmicrequire inodeno_t pino, const char * oldname, const char *newname

    bt

    if entry_exist(Inode.(pino), oldname) = TRUE then

    if entry_exist(Inode.(pino), newname) = FALSE then

    Remove the entry oldname from the object Inode.(pino);

    Insert an entry newname to the object Inode.(pino);

    end if

    end if

    et

    end procedure

  •   

    Algorithm 4 Add reclaim item

    procedure add_reclaim_item

    algorithmicrequire (pino, ino)

    again:

    wp = Reclaim.counter$\rightarrow$wp;

    if object_exist(Reclaim.(wp)) = FALSE then

    bt

    if create_object(Reclaim.(wp)) = SUCCESS then

    Reclaim.(wp)$\rightarrow$state = WRITE;

    Reclaim.(wp)$\rightarrow$timestamp = current_time;

    end if

    et

    end if

    bt

    if Reclaim.(wp)$\rightarrow$state = WRITE Reclaim.(wp) is not full then

    Insert (pino, ino) into Reclaim.(wp);

    Reclaim.(wp)$\rightarrow$timestamp = current_time;

    Return;

    end if

    et

    wp+;

    bt

    if Reclaim.counter$\rightarrow$wp $<$ wp then

    Reclaim.counter$\rightarrow$wp+;

    end if

    et

    goto again.

    end procedure

  •   

    Algorithm 5 Consume reclaim item

    procedure consume_reclaim_item

    again:

    rp = Reclaim.counter$\rightarrow$rp;

    bt

    result = (Reclaim.(rp)$\rightarrow$state = READ);

    if result = FALSE then

    if (current_time$-$Reclaim.(rp)$\rightarrow$timestamp) $>$ REC_THRESHOLD then

    Reclaim.(rp)$\rightarrow$state = READ;

    result = TRUE;

    end if

    end if

    et

    if result = FALSE then

    Return;

    end if

    for each item $m$ in Reclaim.(rp)

    process_reclaim_item($m$);

    Remove the item $m$ from the object Reclaim.(rp);

    end for

    Delete the object Reclaim.(rp);

    Reclaim.counter$\rightarrow$rp+;

    goto again;

    end procedure

    procedure process_reclaim_entry

    algorithmicrequire (pino, ino)

    if object_exist(Inode.(pino)) = TRUE object_exist(Inode.(ino)) = TRUE entry_exist(Inode.(pino), ino) = TRUE then

    Return;

    end if

    if object_exist(Inode.(ino)) = FALSE then

    Return;

    end if

    if (file_type(Inode.(ino)) = FILE) then

    count = Inode.(ino)$\rightarrow$ size/OBJECT_SIZE;

    for m = 0 to count

    Delete the object Data.(ino).($m$);

    end for

    end if

    if (file_type(Inode.(ino)) = DIRECTORY) then

    for each entry cino in Inode.(ino)

    Insert a reclaim entry (ino, cino);

    Remove the entry cino from the object Inode.(ino);

    end for

    end if

    Delete the object Inode.(ino);

    end procedure

  •   

    Algorithm 6 Offline file system check

    procedure fsck

    for each object $o$ in the object storage

    if $o$ is a data object named Data.(ino).(index) then

    if object_exist(Inode.(ino)) = FALSE $||$ ${\rm~index*OBJECT\_SIZE>Inode.(ino)}$$\rightarrow$size then

    Delete the object $o$; //orphan object.

    end if

    end if

    if $o$ is an inode object named Inode.(ino) then

    pino = $o$$\rightarrow$pino; //the inode number of parent inode.

    if object_exist(Inode.(pino)) = FALSE $||$ entry_exist(Inode.(pino), ino) = FALSE then

    Delete the object $o$; //orphan object.

    continue

    end if

    if file_type(Inode.(ino)) = DIRECTORY then

    for each entry $m$ in Inode.(ino)

    if object_exist(Inode.($m$)) = FALSE then

    Remove the entry $m$ from the object Inode.(ino); //orphan entry.

    end if

    end for

    end if

    end if

    end for

    end procedure