Wednesday, January 18, 2017

Diving into SDCardFS: How Google’s FUSE Replacement Will Reduce I/O Overhead

Several months ago, Google added something called “SDCardFS” to the official AOSP branches for the Linux kernel. At the time, the move was noticed only by some kernel developers, but otherwise flew under the radar of most users. No surprise there considering the fact that most users, including myself, do not really know what goes on under the hood of the Android OS and its kernel.


However, the most recent episode of the Android Developers Backstage podcast renewed interest in this topic. The podcast, hosted by Chet Haase (a senior software engineer at Google), explored recent and upcoming changes made to the kernel. On the show was a Linux kernel developer working on the Android team – Rom Lemarchand. The duo primarily discussed what changes were made to accommodate A/B updates, but in the last 5 minutes of the episode Mr. Lemarchand talked about “the next big thing” that his team was working on – SDCardFS.


I must admit that I learned about the existence of SDCardFS after listening to this podcast. Of course, I wasn’t the only one to take interest in this topic, as a recent Reddit thread has shown. However, I wasn’t satisfied with the basic explanation that was offered in the podcast, and in an effort to dispel some of the misinformation being spread around, I did some research of my own and talked to a few experts with relevant knowledge on the matter.


Major thanks to software developer Michal Kowalczyk for contributing his knowledge to this article and for taking the time to answer my questions.



“External” Is Really Internal


Right off the bat, there are bound to be some misconceptions we have to clear up – otherwise the rest of the article will be very confusing. It’s helpful to discuss the history of SD cards and Android phones.


In the early days of Android phones, nearly every device relied on using their microSD cards for storage. This was due to the fact that phones at the time shipped with miniscule internal storage capacities. However, SD cards used for storing applications often do not provide a stellar user experience, at least compared to the speed with which internal flash memory can read/write data. Therefore, the increasing use of SD Cards for external data storage was becoming a user experience concern for Google.


Due to the early proliferation of SD cards as external storage devices, Android’s storage naming conventions were based around the fact that every device had an actual, physical microSD card slot. But even on devices that didn’t contain an SD Card slot, the /sdcard label was still used to point to the actual internal storage chip. More confusing is the fact that devices that utilized both a physical SD card as well as a high capacity storage chip for storage would often name their partitions based around the SD Card. For instance, in these devices the /sdcard mount point would refer to the actual internal storage chip, whereas something like /storage/sdcard1 would refer to the physical external card.


Thus, even though the microSD card is practically considered to be external storage, the naming convention resulted in “SDCard” sticking around long past any actual use of a physical card. This confusion with storage also provided some headache to application developers due to the fact that application data and its media were segregated between the two partitions.


The low storage space of early internal storage chips resulted in users frustratingly finding out they could no longer install applications (due to the /data partition being full). Meanwhile, their larger capacity microSD cards were relegated to holding only media (such as photos, music, and movies). Users who browsed our forums back in the day might remember these names: Link2SD and Apps2SD. These were (root) solutions that enabled users to install their applications and its data all on the physical SD card. But these were far from perfect solutions, so Google had to step in.


Famously, Google pulled the plug on SD cards very early on. The Nexus One remains the only Nexus device with a microSD card slot (and it forever will be since the Nexus brand is effectively dead). With the Nexus S, there was now only one, unified partition for storing all application data and media – the /data partition. What was once known as the /sdcard mount point now was simply referring to a virtual filesystem (implemented under the FUSE protocol as discussed below) located in the data partition – /data/media/0.


In order to maintain compatibility and reduce confusion, Google still used this now-virtual “sdcard” partition to hold media. But now that this “sdcard” virtual partition was actually located within /data, anything stored within it would count against the storage space of the internal storage chip. Thus, it was up to OEMs to consider how much space to allocate to applications (/data) versus media (/data/media).



Two Very Different “SD Cards”



Google was hoping for manufacturers to follow their example and get rid of SD cards. Thankfully, over time phone manufacturers were able to source these components at higher capacities while remaining cost effective, so the need for SD cards was beginning to run thin. But the naming conventions have persisted to reduce the amount of effort developers and OEMs would have to make to adjust. Currently, when we refer to “external storage” we are referring to either one of two things: the actual removable microSD card or the virtual “SDCard” partition located in /data/media. The latter of these, practically speaking, is actually internal storage, but Google’s naming convention differentiates it due to the fact that this data is accessible to the user (such as when plugged into the computer).



Currently, when we refer to “external storage” we are referring to either one of two things: the actual removable microSD card or the virtual “SDCard” partition located in /data/media.




The History of Android’s Virtual Filesystems


Now that “sdcard” is treated as a virtual filesystem, it meant that it could be formatted as any filesystem that Google wanted. Starting with the Nexus S and Android 2.3, Google chose to format “sdcard” as VFAT (virtual FAT). This move made sense at the time, since mounting VFAT would allow nearly any computer to access the data stored on your phone. However, there were two major issues with this initial implementation.


The first primarily concerns the end user (you). In order to connect your device to your computer, you would be using USB Mass Storage Mode to transfer data. This, however, required the Android device to unmount the virtual partition before the computer could access the data. If a user wanted to use their device while plugged in, many things would show as unavailable.



The introduction of the Media Transfer Protocol (MTP) solved this first issue. When plugged in, your computer sees your device as a “media storage” device. It requests a list of files from your phone, and MTP returns a list of files that the computer can download from the device. When a file is requested to be deleted, MTP sends a command to remove the requested file from storage. Unlike USB Mass Storage Mode which actually mounts the “sdcard”, MTP allows the user to continue using their device while plugged in. Furthermore, the file system present on the Android phone no longer matters for the computer to recognize the files on the device.


Secondly, there was the fact that VFAT did not provide the kind of robust permission management that Google needed. Early on, many application developers would treat the “sdcard” as a dumping ground for their application’s data, with no unified sense of where to store their files. Many applications would simply create a folder with its app name and store its files on there.


Nearly every application out there at the time required the WRITE_EXTERNAL_STORAGE permission to write their application files to the external storage. However, what was more troubling was the fact that nearly every application also required the READ_EXTERNAL_STORAGE permission – just to read their own data files! This meant that applications could easily have access to data stored anywhere on the external storage, and such a permission was often granted by the user because it was required for many apps to even function.


Google clearly saw this as problematic. The whole idea behind permission management is to segregate what apps can and cannot have access to. If nearly every app is being granted read access to potentially sensitive user data, then the permission is meaningless. Thus, Google decided they needed a new approach. That is where FUSE comes in.



Filesystem in Userspace (FUSE)


Starting with Android 4.4, Google decided to no longer mount the virtual “sdcard” partition as VFAT. Instead, Google began to use FUSE to emulate FAT32 on the “sdcard” virtual partition. With the sdcard program calling FUSE to emulate FAT-on-sdcard style directory permissions, applications could begin accessing its data stored on external storage without requiring any permissions. Indeed, starting with API Level 19, READ_EXTERNAL_STORAGE was no longer required to access files located on external storage – provided the data folder created by the FUSE daemon matches the app’s package name. FUSE would handle synthesizing the owner, group, and modes of files on external storage when an application is installed.


FUSE differs from in-kernel modules as it allows for non-privileged users to write virtual filesystems. The reason Google implemented FUSE is rather simple – it did what they wanted and was already well understood and documented in the world of Linux. To quote a Google developer on the matter:



“Because FUSE is a nice stable API, there is essentially zero maintenance work required when moving between kernel versions. If we migrated to an in-kernel solution, we’d be signing up for maintaining a set of patches for each stable kernel version.” -Jeff Sharkey, Software Engineer at Google



However, it was becoming quite clear that FUSE’s overhead was introducing a hit in performance among other issues. The developer I spoke to regarding this matter, Michal Kowalczyk, penned an excellent blog post over a year ago detailing the current issues with FUSE. More technical details can be read on his blog, but I will describe his findings (with his permission) in more layman’s terms.



The Problem with FUSE


In Android, the “sdcard” userspace daemon utilizes FUSE to mount /dev/fuse to the emulated external storage directory on boot. After that, the sdcard daemon polls the FUSE device for any pending messages from the kernel. If you listened to the podcast, you might have heard Mr. Lemarchand refer to FUSE introducing overhead during I/O operations – here is essentially what happens.



In the real world, this performance hit affects any file stored on external storage.



Problem #1 – I/O Overhead


Let us say that we create a simple text file, called “test.txt”, and store it in /sdcard/test.txt (which, let me remind you, is actually /data/media/0/test.txt assuming the current user is the primary user on the device). If we wanted to read (command cat) this file, we would expect the system to issue 3 commands: open, read, then close. Indeed, as Mr. Kowalczyk demonstrates using strace, that is what happens:



But because the file is located on the external storage which is managed by the sdcard daemon, there are many additional operations that need to be performed. According to Mr. Kowalczyk, there are essentially 8 additional steps needed for each of these 3 individual commands:


  1. Userspace application issues system call that will be handled by FUSE driver in kernel (we see it in the first strace output)
  2. FUSE driver in kernel notifies userspace daemon (sdcard) about new request
  3. Userspace daemon reads /dev/fuse
  4. Userspace daemon parses command and recognizes file operation (ex. open)
  5. Userspace daemon issues system call to the actual filesystem (EXT4)
  6. Kernel handles physical data access and sends data back to the userspace
  7. Userspace modifies (or not) data and passes it through /dev/fuse to kernel again
  8. Kernel completes original system call and moves data to the actual userspace application (in our example cat)

This seems like a lot of overhead just to a single I/O command to be run. And you would be right. To demonstrate this, Mr. Kowalczyk attempted two different I/O tests: one involving copying a large file and the other copying lots of small files. He compared the speed of FUSE (on the virtual partition mounted as FAT32) handling these operations versus the kernel (on the data partition formatted as EXT4), and he found that the FUSE was indeed contributing significant overhead.



In the first test, he copied a 725MB file under both test conditions. He found that the FUSE implementation transferred large files 17% more slowly.



In the second test, he copied 10,000 files – each of them 5KBs in size. In this scenario, the FUSE implementation was over 40 seconds slower to copy basically 50MBs worth of data.


In the real world, this performance hit affects any file stored on external storage. This means apps such as Maps storing large files on /sdcard, Music apps storing tons of music files, Camera apps and photos, etc. Any I/O operation being carried out that involves the external storage is affected by FUSE’s overhead. But I/O overhead isn’t the only issue with FUSE.


Problem #2 – Double Caching


Caching of data is important in improving data access performance. By storing essential pieces of data in memory, the Linux kernel is able to quickly recall that data when needed. But due to the way FUSE is implemented, Android stores double the amount of cache that is needed.


As Mr. Kowalczyk demonstrates, a 10MB file is expected to be saved in cache as exactly 10 MBs, but instead ups to cache size by around 20MBs. This is problematic on devices with less RAM, as the Linux kernel stores uses page cache to store data in memory. Mr. Kowalczyk tested this double caching issue using this approach:


  1. Create a file with a known size (for testing, 10MBs)
  2. Copy it to /sdcard
  3. Drop the page cache
  4. Take a snapshot of the page cache use
  5. Read the test file
  6. Take another snapshot of the page cache use

What he found was that prior to his test, 241MBs were being used by the kernel for page cache. Once he read his test file, he expected to see 251MBs utilized for page cache. Instead, he found that that kernel was using 263MBs for page cache – about twice what was expected. The reason this occurs is because the data is first cached by the user application that originally issued the I/O call (FUSE), and second by the sdcard daemon (EXT4 FS).


Problem #3 – Incomplete Implementation of FAT32


There are two more issues stemming from the use of FUSE emulating FAT32 that are less widely known in the Android community.


The first involves incorrect timestamps. If you’ve ever transferred a file (such as a photo) and noticed that the timestamp is incorrect, it’s because of Android’s implementation of FUSE. This issue has existed for years. To be more specific, the issue involves the utime() system call which allows you to change the access and modification time of a file. Unfortunately, calls made to the sdcard daemon as a standard user do not have the proper permission to execute this system call. There are workarounds for this, but they require you to have root access.



If you’ve ever transferred a file (such as a photo) and noticed that the timestamp is incorrect, it’s because of Android’s implementation of FUSE.



The next problem is more concerning for businesses using something like a smartSD card. Prior to FUSE, app makers could monitor the O_DIRECT flag in order to communicate with an embedded microcontroller in the card. With FUSE, developers can only access the cached version of a file, and are unable to see any commands sent by a microcontroller. This is problematic for some enterprise/government/banking apps which communicate with value-added microSD cards.



Dumping FUSE for SDCardFS


Some OEMS recognized these problems early on, and began looking for an in-kernel solution to replace FUSE. Samsung, for example, developed SDCardFS which is based on WrapFS. This in-kernel solution emulates FAT32 just like FUSE does, but foregoes the I/O overhead, double caching, and other issues I’ve mentioned above. (Yes, let me reiterate that point, this solution that Google is now implementing is based on Samsung’s work).


Google themselves have finally acknowledged the drawbacks associated with FUSE, which is why they have begun moving towards the in-kernel FAT32 emulation layer developed by Samsung. The company, as mentioned in the Android Developers Backstage podcast, has been working on making SDCardFS available for all devices in an upcoming version of the kernel. You can currently see the progress of their work in AOSP.


As a Google developer explained earlier, the biggest challenge with implementing an in-kernel solution is how to map the package name to application ID necessary for a package to access its own data in external storage without requiring any permissions. But that statement was made a year ago, and we’ve reached the point where the team is calling SDCardFS their “next big thing.” They’ve already confirmed that the dreaded timestamp error has been fixed, thanks to moving away from FUSE, so we can look forward to seeing all of the changes brought forth with the abandonment of FUSE.



Fact-Checking Misconceptions


If you’re reached this far into the article, then kudos for keeping up with everything so far! I wanted to clarify a few questions I had myself when writing this article:


  • SDCardFS has nothing to do with actual SD Cards. It is just named as such because it handles I/O access for /sdcard. And as you might recall, /sdcard is an outdated label referring to the “external” storage of your device (where apps store their media).

  • SDCardFS is not a traditional file system like FAT32, EXT4, or F2FS. It is a stackable wrapper file system that passes commands to the lower, emulated file systems (in this case, it would be FAT32 on the /sdcard).

  • Nothing will change with respect to MTP. You will continue using MTP to transfer files to/from your computer (until Google settles on a better protocol). But at least the timestamp error will be fixed!

  • As mentioned before, when Google refers to “External Storage” they are either talking about the (for all intents and purposes) internal /sdcard virtual FAT32 partition OR they are talking about an actual, physical, removable microSD card. The terminology is confusing, but it’s what we’re struck with.


Conclusion


By moving away from FUSE and implementing an in-kernel FAT32 emulation layer (SDCardFS), Google will be reducing significant I/O overhead, eliminating double caching, and solving some obscure issues related to its FUSE’s emulation of FAT32.


Since these changes will be made to a kernel, they can be rolled out without a major new version of Android alongside it. Some users are expecting to see these changes officially implemented in Android 8, but it’s possible for any future OTA on a Pixel device to bring the Linux kernel version 4.1 that Google has been working on.


For some of you, SDCardFS is not a new concept. In fact, Samsung devices have been utilizing it for years (they were the ones to develop it after all). Ever since SDCardFS was introduced in AOSP last year, some custom ROM and kernel developers have chosen to implement it into their work. CyanogenMOD at one point considered implementing it, but rolled it back when users encountered issues with their photos. But hopefully with Google taking the reigns on this project, Android users on all future devices can take advantage of the improvements introduced with abandoning FUSE.

Previous Post
Next Post

post written by:

0 comments: