Finding a needle in Haystack: Facebook’s photo storage (1)
Abstract
This paper describes Haystack, an object storage system optimized for Facebook's Photos application. Facebook currently stores over 260 billion images, which translates to over 20 petabytes of data. Users upload one billion new photos(~60 terabytes) each week and Facebook serves over one million images per second at peak. Haystack provides a less expensive and higher performing solution than our previous approach, while leveraged network attached storage appliances over NFS. Our key observation is that this traditional design incurs an excessive number of disk operations because of metadata lookups. We carefully reduce this per photo metadata so that Haystack storage machines can perform all metadata lookups in main memory. This choice conserves disk operation for reading actual data and thus increases overall throughput.
Introduction
This presents the design and implementation of Haystack, Facebook's photo storage system that has been in production for past 24 months. Haystack is an object store that we designed for sharing photos on Facebook where data is written once, read often, never modified, and rarely deleted. We engineered our won storage system for photo because traditional file system perform poorly under our work load.
POXIS: Tree like directory structure, based on Unix specifications.
In our experience, we find that the disadvantages of a traditional POXIS based file system are directories and per file metadata. For the photos application most of this metadata, such as permissions, is unused and thereby wastes storage capacity. Yet the more significant cost is that the file's metadata must be read from disk into memory in order to find the file itself. While insignificant on a small scale, multiplied over billions of photos and petabytes of data, accessing metadata is the throughput bottleneck. We found this to be our key problem in using network attached storage(NAS) appliance mounted over NFS. Several disk operation were necessary to read a single photo: one(or typically more) to translate the filename to an inode number, another to read the inode from disk,and a final one to read the file itself. In short using disk IOs for metadata was the limiting factor for our read throughput. Observe that in practice this problem introduces an additional cost as we have to rely on content delivery networks (CDNs), such as Akamai, to serve the majority of read traffic.
Given the disadvantages of traditional approach, we designed Haystack to achieve four main goals:
- High throughput and low latency: We accomplish this by keeping all metadata in main memory, which we make practical by dramatically reducing the per photo metadata necessary to find photo on disk.
- Fault-tolerant: Haystack replicates each photo in geographically distinct locations. If we lose a machine we introduce another one to take its place, copying data for redundancy as necessary.
- cost-effective: Haystack performs better and is less expensive than our previous NFS-based approach.
- Simple: That simplicity let us build and deploy a working system in a few months instead of few years.
Our three main contribution are:
- Haystrack and object storage system optimized for the efficient storage and retrieval of billions of photos.
- Lessons learned in building and scaling an inexpensive reliable, and available photo storage system.
- a characterization of the requests made to Facebook's photo sharing application.
Background&previous design
2.1 Background
We begin with a brief overview of the typical design for how web servers, content delivery networks(CDNs), and storages system interact to serve photo on a popular site. Figure 1 depicts the steps from the moment when a user visits a page containing an image until she download that images from its location on disk. When visiting a page the user's browser first sends an HTTP request to a web server which is responsible for generating the markup for the browser to render. For each image the web server constructs a URL directing the browser to a location from which to download the data. For popular site this URL often points to a CDN .If the CDN has the image cached then the CDN respond immediately with the data. Otherwise, the CDN examines the URL, which has enough information embedded to retrieve the photo from the site's storage system. The CDN then updates its cached data and sends the images to the user's browser.
2.2 NFS-based design
In our first design we implemented the photo storage system using an NFS-based approach. While the rest of this subsection provides more detail on that design the major lesson we learned is that CDNs by themselves do not offer a practical solution to serving photos on a social networking site. CDNs do effectively serve the hottest photos - prefile pictures and photos that have been recently uploaded but - a social networking site like Facebook also generates a large number of requests for less popular (often older) content,which we refer to as the long tail.
Our NFS-based design storage each photo in its own file on a set of commercial NAS appliances. A set of machines, Photo store servers, then mount all the volumes exported by these NAS appliances over NFS. Figure 2 illustrates this architecture and shows Photo Storage servers processing HTTP requests for images. From an image's URL a photo Store server extracts the volume and full path to the file, read the data over NFS, and returns the result to the CDN.
We initially stored thousands of files in each directory of an NFS volume which led to an excessive number of disk operations to read even a single image. Because of how the NAS appliances manage directory metadata, placing thousands of files in a directory was extremely inefficient as the directory's blockmap was too large to be cached effectively by the appliance. Consequently it was common to incur more than 10 disk operation to retrieve a single image. After reducing directory size to hundreds of images per directory, the resulting system would still generally incur 3 disk operation to fetch an image: one to read the directory metadata into memory, a second to load the inode into memory, and a third to read the file contents.
To further reduce disk operations we let the photo Storage servers explicitly cache file handles returned by NAS appliances. When reading a file for the first time a photo store server opens a file normally but also caches the filename to file handle mapping in memcache. When requesting a file whose file handle is cached, a Photo Store server open the file directly using a custom system call, open_by_filehandle
, that we added to the kernel. Regrettably, this file handle cache provides only a minor improvement as less popular photos are less likely to be cached to begin with. One could argue that an approach in which all file handles are stored in memcache might be workable solution. However that only addresses part of the problem as it relies on the NAS appliance having all of its inodes in main memory, an expensive requirement for traditional filesystems. The major lession we learned from the NAS approach is that focusing only on caching whether the NAS appliances's cache or an external cache like memcache - has limited impact for reducing disk operations. The storage system ends up processing the long tail of requests for less popular photos, which are not available in the CDN and are thus likely to miss in our caches.
2.3 Discussion
It would be difficult for us to offer precise guidelines for when or when not to build a custom storage system. However, we believe it still helpful for the community to gain insight into why we decided to build Haystack.
Faced with the bottencks in our NFS-based design, we explored whether it would be useful to build a system similar to GFS. Since we store most of our user data in MySQL databases, the main use cases for files in our system were the directories engineers use for development work, log data and photos. NAS appliances offer a very good price/performance point for development work and for log data. Furthermore, we leverage Hadoop for the extremely large log data. Serving photo requests in long tail represents a problem for which neither MySQL, NAS appliances nor Hadoop are well suited.
One could phrase the dilemma we faced as existing storage system lacked the right RAM-to-disk ratio. The system just needs enough main memory so that all of the filesystem metadata can be cached at once. In our NAS-based approach, one photo corresponds to one file and each file require at least one inode, which is hundreds of bytes large. Having enough main memory in this approach is not cost-effective. than buying more NAS appliances.
Design&implementation
When a web site has an I/O bottleneck static content the traditional solution is to use a CDN. The CDN shoulders enough of the burden so that the storages system can process the remaining tail. At Facebook a CDN would have to cache an unreasonably large amount of the static content in order for traditional (and inexpensive) storage approaches not to be I/O bound.
We accept that requests for less popular photos may require disk operations, but aim to limit the number of such operations to inly the ones necessary for reading actual photo data. Haystack achieves this goal by dramatically reducing the memory used for filesystem metadata, thereby making it practical to keep all this metadata in main memory.
Recall that storing a single photo per file resulted in more file system metadata than could be reasonably cached. Haystack take a straight-forward approach: it stores multiple photos in a single file and therefore maintains very large files. we show that this straight forward approach is remarkably effective. Moreover, we argue that its simplicity is its strength, facilitating rapid implementation and deployment. we now discuss how this core technique and the architecural components surrounding it provide reliable and available storage system. In the following description of Haystack, we distinguish between two kind of metadata. Application metadata describes the information needed to construct a URL that a browser can use to retrieve a photo. Filesystem metadata identifies the data necessary for a host to retrieve the photo that reside on that host's disk.
3.1 Overview
The Haystack architecture consists of 3 core components:
- Haystack Store
- Haystack Directory
- Haystack Cache
For brevity we refer to theses components with Haystack elided. The Store encapsulates the persistent storage system for photos and is the only component that the filesystem metadata for photos. we organize the Store's capacity by physical volumes. For example, we can organize a server's 10 terabytes of capacity into 100 physical volumes each of which provides 100 gigabytes of storage. We further group physical volumes on different machines into logical volumes. When Haystack stores a photo on a logical volume, the photo is written to all corresponding physical volumes. This redundancy allows us to mitigate data loss due to hard drive failures, disk controller bugs, etc. The directory maintains the logical to physical mapping along with other application metadata, such as the logical volume where each photo resides and logical volumes with free space. The Cache functions as our internal CDN, which shelters the store from requests for the most popular photos and provides insulation if upstream CDN node fail and need to refetch content
Figure 3 illustrates how the Store, Directory and Cache components fit into the canonical interactions between a user's browser, web server, CDN, and storage system. In the Haystack architecture the browser can be directed to either the CDN or the Cache. Note that while the Cache is essentially a CDN, to avoid confusion we use CDN to refer to external system and Cache to our internal one that Caches photos. Having an internal caching infrastructure gives us the ability to reduce our dependence on external CNDs.
When a user visits a page the web server users the directory to construct a URL for each photo. The URL contains several pieces of information, each piece corresponding to sequence of steps from when a user's browser contacts the CDN(or Cache) to ultimately retrieving a photo from a machine in the Store. A typical URLthat directs the browser to the CDN looks like the following:
$$ \texttt{http://<CDN>/<Cache>/<Machine id>/<Logical volume, Photo>} $$
The first part of the URL specifies from which CDN to request the photo. Then CDN can lookup the photo internally using only the last part of the URL: the logical volume and the photo id. If the CDN cannot locate the photo then it strips the CDN address from the URL and contacts the Cache. The Cache does a similar lookup to find the photo and, on a miss, strips the Cache address from the UR|L and request the photo from the specified Store machine. Photo requests that go directly to the Cache have a similar workflow except that the URL is missing the CDN specific information.
Figure 4 illustrates the upload path in haystack. When a user uploads a photo first send the data to web server. Next, that server request a write-enabled logical volume from the Directory. Finally, the web server assigns a unique id to the photo and uploads it to each of the physical volumes mapped to the assigned logical volume.
3.2 Haystack Directory
The directory serves four main functions.
- First, it provides a mapping from logical volumes to physical volumes. Web servers use this mapping when uploading photos and when constructing the images URLs for a page request.
- Second, the directory load balances writes across logical volumes and reads across physical volumes.
- Third, the Directory determines whether a photo request should be handled by the CDN or by the Cache. This functionality lets us adjust our dependence on CDNs.
- Fourth, the directory identifies those logical volumes that are read-only either because of operational reasons or because those volumes have reached their storage capacity. We mark volumes as read-only at granularity for operational ease.
When we increase the capacity of the Store by adding new machines, those machines are write-enabled; only write-enabled machines receive uploads. Over time the available capacity on these machines decreases. When a machine exhausts its capacity, we mark it as read-only. In the next subsection we discuss how this distinction has subtle consequences for Cache and Store.
The Directory is a relatively straight-forward component that stores its information in a replicated database accessed via PHP interface that leverages memcache to reduce latency. In event that we lose the data on a Store machine we remove the corresponding entry in the mapping and replace it when a new Store machine is brought online.
3.3 Haystack Cache
The interface to Store machines is intentionally basic. Reads make very specific and well-contained request asking for a photo with a given id, for a certain logical volume, and from a particular physical Store machine. The machine returns the photo if it is found. Overwise the machine return an error.
Each Store machine manages multiple physical volumes. Each volume holds millions of photos. For concreteness, the reader can think of a physical volume as simply a very large file (100GB) saved as /hay/haystack_<logical volume id>>
. A Store machine can access a photo quickly using inly the id of the corresponding logical volume and the file offset at which the photo resides. this knowledge is the keystone of the Haystack design: retrieving the filename, offset and size for a particular photo without needing disk operations. A store machines keep open file descriptors for each physical volume that it manages and also inmemory mapping of photo ids to the filesystem metadata(i.e., file,offset and size in bytes) critical for retrieving that photo.
We now describe the layout of each physical volume and how to derive the in-memory mapping from that volume. A Store machine represents a physical volume as a large file consisting of a superblock followed by a sequence of needles. Each needle represent a photo stored in Haystack. Figure 5 illustrates a volume file and the format of each needle. Table 1 describes the fields in each needle. Figure 5 illustrates a volume file and the format of each needle.
To retrieve needles quickly, each Store machine maintains an in-memory data structure for each of its volumes. That data structure maps pairs of (key,alternate key) to the corresponding needle's flags, size in bytes, and volume offset. After a crash, a Store machine can reconstruct this mapping directly from the volume file before processing requests. We now describe how a Store machine maintains its volumes and in-memory mapping while responding to read, write, and delete requests (the only operation supported by the Store).
3.4.1 photo Read
When a cache machine requests a photo it supplies the logical volume id, key, alternate key, and cookie to the Store machine. The cookie is a number embedded in the URL for a photo. The cookie's value is randomly assigned by and stored in the directory at the time that the photo is uploaded. The cookie effectively eliminates attacks aimed at guessing valid URLs for photos.
When a store machine receives a photo request from Cache machine, the Store machine looks up the relevant metadata in its in-memory mappings. If the photo has not been deleted the Store machine seek to the appropriate offset in the volume file, read the entire needle from disk (whose size it can calculate ahead of time), and verifies the cookie and the integrity of the data. If these checks pass then the Store machine returns the photo to the Cache machine.
3.4.2 photo write
When uploading a photo into haystack web servers provide the logical volume id, key, cookie, and data to Store machines. Each machine synchronously appends needle images to its physical volume files and updates in-memory mappings as needed. While simple, this append-only restriction complicates some operations that modify photos, such as rotations. As Haystack disallows overwriting needles, photos can only be modifiled by adding an updated needle with the same key and alternate key. if the new needle is written to a different logical volume than the original, the Directory updates its application metadata and future requests written to the same logical volume, then Store machines append the new needle to the same corresponding physical volumes. Haystack distinguishes such duplicate needles based on their offsets. That is, the latest version of a needle within a physical volume is the one at the highest offset.
3.4.3 Photo delete
deleting a photo is straight-forward. A store machine sets the delete flag in both the in-memory mapping and synchronously in the volume file. Requests to get deleted photos first check the in-memory flag and return errors if that flag is enabled. Note that the space occupied by deleted needles is for the moment lost. Later, we discuss how to reclaim deleted needle space by compacting volume files.
3.4.4 The Index File
Store machines use an important optimization- the index file - when rebooting. while in theory a machine can reconstruct its in-memory mappings by reading all of its physical volumes, doing so is time-consuming as the amount of data (terabytes worth) has to all be read from disk. Index files allow a Store machine to build its in-memory mappings quickly, shortening restart time .
Store machines maintain an index file for each of their volumes. The index file is a checkpoint of the in-memory data structures used to locate needles efficiently on disk. An index file's layout is similar to a volume file's, containing a superblock followed by a sequence of index records corresponding to each needle in the superblcok. These record must appear in the same order as the corresponding needles appear in the volume file. Figure 6 illustrates the layout of the index file and Table 2 explains the different fields in each record.
Restarting using the index is slightly more complicated than just reading the indices and initializing the in-memory mappings. The complications arise because index files are updated asynchronously, meaning that index files may represent stale checkpoints. When we write a new photo the Store machine synchronously ap pends a needle to the end of the volume file and asynchronously appends a record to the index file. When we delete a photo, the Store machine synchronously sets the flag in that photo’s needle without updating the in dex file. These design decisions allow write and delete operations to return faster because they avoid additional synchronous disk writes. They also cause two side effects we must address: needles can exist without corresponding index records and index records do not reflect deleted photos.
We refer to needles without corresponding index records as orphans. During restarts, a Store machine sequentially examines each orphan, creates a matching index record, and appends that record to the index file.
Note that we can quickly identify orphans because the last record in the index file corresponds to the last non-orphan needle in the volume file. To complete the restart, the Store machine initializes its in-memory mappings using only the index files.
Since index records do not reflect deleted photos, a Store machine may retrieve a photo that has in fact been deleted. To address this issue:
- After reading the entire needle for a photo, the Store machine inspects the deleted flag.
- If a needle is marked as deleted, the machine updates its in-memory mapping accordingly.
- It then notifies the Cache that the object was not found.