My NFS story: In my first job, we used NFS to maintain the developer desktops. They were all FreeBSD and remote mounted /usr/local. This worked great! Everyone worked in the office with fast local internet, and it made it easy for us to add or update apps and have everyone magically get it. And when the NFS server had a glitch, our devs could usually just reboot and fix it, or wait a bit. Since they were all systems developers they all understood the problems with NFS and the workarounds.
What I learned though was that NFS was great until it wasn't. If the server hung, all work stopped.
When I got to reddit, solving code distribution was one of the first tasks I had to take care of. Steve wanted to use NFS to distribute the app code. He wanted to have all the app servers mount an NFS mount, and then just update the code there and have them all automatically pick up the changes.
This sounded great in theory, but I told him about all the gotchas. He didn't believe me, so I pulled up a bunch of papers and blog posts, and actually set up a small cluster to show him what happens when the server goes offline, and how none of the app servers could keep running as soon as they had to get anything off disk.
To his great credit, he trusted me after that when I said something was a bad idea based on my experience. It was an important lesson for me that even with experience, trust must be earned when you work with a new team.
I set up a system where app servers would pull fresh code on boot and we could also remotely trigger a pull or just push to them, and that system was reddit's deployment tool for about a decade (and it was written in Perl!)
I was at Apple around 15 years ago working as a sysadmin in their hardware engineering org, and everything - and I mean everything - was stored on NFS. We ran a ton of hardware simulation, all the tools and code were on NFS as well as the actual designs and results.
At some point a new system came around that was able to make really good use of the hardware we had, and it didn’t use NFS at all. It was more “docker” like, where jobs ran in containers and had to pre-download all the tools they needed before running. It was surprisingly robust, and worked really well.
The designers wanted to support all of our use cases in the new system, and came to us about how to mount our NFS clusters within their containers. My answer was basically: let’s not. Our way was the old way, and their way was the new way, and we shouldn’t “infect” their system with our legacy NFS baggage. If engineers wanted to use their system they should reformulate their jobs to declare their dependencies up front and use a local cache, and all the other reasonable constraints their system had. They were surprised by my answer but I think it worked out in the end: it was the impetus for things to finally move off the legacy infrastructure, and it worked out well in the end.
NFS volumes (for home dirs, SCM repos, tools, and data) were a godsend for workstations with not enough disk, and when not everyone had a dedicated workstation (e.g., university), and for diskless workstations (which we used to call something rude, and now call "thin clients"), and for (an ISV) facilitating work on porting systems.
But even when when you needed a volume only very infrequently, if there was a server or network problem, then even doing an `ls -l` in the directory where the volume's mount point was would hang the kernel.
Now that we often have 1TB+ of storage locally on a laptop workstation (compare to the 100MB default of an early SPARCstation), I don't currently need NFS for anything. But NFS is still a nice tool to have in your toolbox, for some surprise use case.
> To his great credit, he trusted me after that when I said something was a bad idea based on my experience. It was an important lesson for me that even with experience, trust must be earned when you work with a new team.
True, though, on a risky moving-fast architectural decision, even with two very experienced people, it might be reasonable to get a bit more evidence.
And in that particular case, it might be that one or both of you were fairly early in your career, and couldn't just tell that they could bet on the other person's assessment.
Though there are limits to needing to re-earn trust from scratch with a new team. For example, the standard FAANG-bro interview of everyone having to start from scratch for credibility, like they are fresh out of school with zero track record, and zero better ways to assess, is ridiculous. The only thing more ridiculous is when companies that pay vastly less try to mimic that interview style. Every time I see that, I think that this company apparently doesn't have experienced engineers on staff who can get a better idea just by talking with someone, rather than fratbro hazing ritual.
> Now that we often have 1TB+ of storage locally on a laptop workstation (compare to the 100MB default of an early SPARCstation), I don't currently need NFS for anything.
While diskless (or very limited disk) workstations were one use case for NFS, that was far from the primary one.
The main use case was to have a massive shared filesystem across the team, or division, or even whole company (as we did at Sun). You wouldn't want to be duplicating these files locally no matter how much local disk, the point was to have the files be shared amongst everyone for collaboration.
NFS was truly awesome, it is sad that everything these days is subpar. We use weak substitutes like having files on shared google drives, but that is so much inferior to having the files of the entire company mounted on the local filesystem through NFS.
(Using the past tense, since it's not used so much anymore, but my home fileserver exports directories over NFS which I mount on all other machines and laptops at home, so very much using it today, personally.)
Other things that changed were the Web, and the popularity of Git.
For example, one of the big uses of NFS we had was for engineering documents, all of which could be accessed from FrameMaker or Interleaf running on your workstation. Nowaways, all the engineering documentation and more would be accessed through a Web browser from a non-NFS server, and no longer on a shared filesystem.
Another use of NFS we had was for collaborating on shared code by some projects, with SCM storing to NFS servers (other projects used DSEE and ClearCase). But nowaways almost everyone in industry uses distributed Git, syncing to non-NFS servers, with cached copies on their local storage.
I suppose a third thing that changed was CSCW distributed change syncing becoming popular at moving into other tools, such as a live "shared whiteboard" document editing that people can access in their Web browsers. I have mixed feelings about some of the implementations and how they're deployed, but it's pretty wild to have 4 remote people during Covid editing a document in real time at once, and NFS isn't helping with the hard part of that.
Right now, the use case for NFS that first comes to mind is individual humans working with huge files (e.g., for AI training, or other big data), where you want the convenience of being able to access them with any tool from your workstation, and maybe also have big compute servers working with them, without copying things around. You could sorta do these things with big complicated MLops infrastructure, but sometimes that slows you down more than it speeds you up.
Interesting. I self-host Forgejo or GitLab, with SSH or HTTPS access from workstations' local repos, to the "origin" Git server.
The advantage you find to NFS for this is that you share workspaces between the client machines? Or reduce the local storage requirements on the client machines?
Mainly so I don't need to run any source control server, it's all just files.
Same for mercurial. Most of my internal use repositories are mercurial since it's so much more pleasant to use than git and for my hobby time I want pleasant tools that don't hate me. But I digress..
It's the model I've used since the 90s in the days of teamware at Sun.
That was still in place at least when I left, and I'd be amazed if it got replaced. It was one of those wonderful pieces of infrastructure that you rarely even notice because it just quietly works the whole time.
NCSA also used it for some data archival and I believe for hosting the website files.
I looked up at one point whatever happened to AFS and it turns out that it has some Amdahl’s Law glass ceiling that ultimately limits the aggregate bandwidth to something around 1 GBps, which was fine when it was young but not fine when 100Mb Ethernet was ubiquitous and gigabit was obtainable with deep enough pockets. If adding more hardware can’t make the filesystem faster you’re dead.
I don’t know if or how openAFS has avoided these issues.
The Amdahl's Law limitations are specific to the implementation and not at all tied to the protocols. The 1990 AFS 3.0 server design was built upon a cooperative threading system ("Light Weight Processes") designed by James Gosling as part of the Andrew Project. Cooperative processing influences the design of the locking model since there isn't any simultaneous between tasks. When the AFS fileserver was converted to pthreads for AFS 3.5, the global state of each library was protected by wrapping it with a global mutex. Each mutex was acquired when entering the library and dropped when exiting it. To complete any fileserver RPC required acquisition of at least six or seven global mutexes depending upon the type of vnode being be accessed. In practice, the global mutexes restricted the fileserver process to 1.7 cores regardless of how many cores were present in the system.
AuriStor's RX and UBIK protocol and implementation improvements would be worthless if the application services couldn't scale. To accomplish this required converting each subsystem so it could operate with minimal lock contention.
This 2023 presentation by Simon Wilkinson describes the improvements that were made to AuriStor's RX implementation up to that point.
> In practice, the global mutexes restricted the fileserver process to 1.7 cores regardless of how many cores were present in the system.
So in theory the bandwidth could scale with single CPU and/or point to point bandwidth but cannot scale horizontally at all. Except on the new implementations.
Correct, and the point-to-point bandwidth is limited by the maximum RX window size because of the bandwidth delay product. As round-trip latency increases, at some point the window size becomes insufficient to keep the pipe full, at which point data transfers stall.
One site which recently lifted and shifted their AFS cell to a cloud made the following observations:
We observed the following performance while copying a 1g file from local disk into AFS.
AuriStor Client (2021.05-65) -> OpenAFS server (1.6.24): 3m.11s
AuriStor Client (2021.05-65) -> AuriStor Server (2021.05-65): 1m
AuriStor Client (2025.00.11) -> AuriStor Server (2025.00.11): 30s
All of the above tests were performed from clients located on campus to fileservers located in in the cloud.
There are many RX implementation differences between the three versions. It is important to note that the window size grows from 32 -> 128 -> 512.
I may be confusing two systems but I believe that AFS system was also encompassed the first iteration of “AWS Glacier” I encountered in the wild. A big storage that required queuing a job to a tape array or pinging an undergrad to manually load something for retrieval.
AFS implements weak consistency, which may be a bit surprising. It also seems to share objects, not block devices. Judging by its features, it seems to make most sense when there is a cluster of servers. It looks cool though, a bit more like S3 than like NFS.
The cephfs model of a file system logically constructed from an object store closely mirrors the AFS architecture. The AFS fileserver is horribly misnamed. Whereas AFS 1.0 fileserver exported the contents of local filesystems much as NFS and CIFS do, AFS 2.x/3.x/OpenAFS/AuriStorFS fileservers export objects (aka vnodes) which are stored in an object store. Each AFS vice partition stored zero or more object stores each consisting of the objects belonging to a single volume group. A volume group consists of one or more of the RWVOL, ROVOL and/or BACKVOL instances.
The AFS consistency model is fairly strong. Each client (aka cache manager) is only permitted to access the data/metadata of a vnode if it has been issued a callback promise from the AFS fileserver. File lock transitions, metdata modifications, and data modifications as well as volume transactions cause the fileserver to break the promise. At which point the client is required to fetch updated status information before it can decide it is safe to reuse the locally cached data.
Unlike optimistic locking models, the AFS model permits cached data to be validated after an extended period of time by requesting up to date metadata and a new callback promise.
An AFS fileserver will not permit a client to perform a state changing operation as long as there exist broken callback promises which have yet to be successfully delivered to the client.
Not everyone ignored it but unlike nfs it didn't come in the box with the operating system, and you had to pay for it. In addition, AFS provided strong cryptographic authentication and wire privacy which meant that it couldn't be licensed in many countries because the U.S. government did not grant appropriate export licenses.
I often wonder how the world would be different if AFS 3.0 could have been freely distributed world wide in 1989 precluding the need for HTTP to be developed at CERN.
There were a few technical obstacles which other people mentioned, but I think timing was biggest issue (remember--AFS dates to something like 1983-ish).
1) AFS, IIRC, required more than one machine in its original configuration. That meant hardware and sysadmins which were expensive--until, suddenly they weren't.
2) Disk, memory and bandwidth were scarce--and then they weren't. AFS made a bunch of solid architectural decisions and then wasted a bunch of time backing some of them down in deference to the hardware of the day and then all that work was wasted when Moore's Law overran everything, anyhow.
3) Everybody was super happy to be running everything locally to escape the tyranny of the "Mainframe Operator" (meaning no NFS or AFS or the like)--until they weren't. Once enough non-technical people appeared who didn't want to do system administration, like, ever, that flipped.
We lost the VMS filesystem in this timeframe, too. Which was also a distributed, remote filesystem.
Don't know about FreeBSD but hard hanging on a mounted filesystem is configurable (if it's essential configure it that way, otherwise don't). To this day I see plenty of code written that hangs forever if a remote resource is unavailable.
> Don't know about FreeBSD but hard hanging on a mounted filesystem is configurable (if it's essential configure it that way, otherwise don't).
In theory that should work, but I find that kind of non-default config option tends to be undertested and unreliable. Easier to just switch to Samba where not hanging is default/expected.
It's down to the mount options, use 'soft' and the program trying to access the (inaccessible) server gets an error return after a while, or 'intr' if you want to be able to kill the hung process.
The caveat is a lot of software is written to assume things like fread(), fopen() etc will either quickly fail or work. However, if the file is over a network obviously things can go wrong so the common default behaviour is to wait for the server to come back online. Same issue applies to any other network filesystem, different OS's (and even the same OS with different configs) handle the situation differently.
'After a while' usually requiring the users to wait with an unresponsive desktop environment, because they opened a file manager whilst NFS was huffing. So they'd manage to switch to a virtual terminal and then out of habit type 'ls', locking that up too.
After a few years of messing around with soft mounts and block sizes and all sorts of NFS config nonsense, I switched to SMB and never looked back
I heard rumors at first and later saw it once that the sparc lab at my university occasionally had to be shut down and turned on in a particular order to get the whole thing to spool back up after a server glitch. I think the problem got really nasty once you had NFS mounts from multiple places.
You probably gave bad advice. By the time Reddit existed, you could have just gotten an netapp filer. They had higher availability than most data centers back then, so “the NFS server hung” wouldn’t be anywhere near the top of your “things that cause outages or interfere with engineering” list.
These days, there are plenty of NFS vendors with similar reliability. (Even as far back as NFSv3, the protocol makes it possible for the server to scale out).
I guess I have to earn your trust too. I was actually intimately familiar with Netapp filers at the time, since that is what we used to drive the NFS mounts for the desktops at the first place I mentioned. They were not as immune as you think and were not suitable.
Also, we were a startup, and a Netapp filer was way outside the realm of possibility.
Also, that would be a great solution if you have one datacenter, but as soon as you have more than one, you still have to solve the problem of syncing between the filers.
Also, you generally don't want all of your app servers to update to new code instantly all the same time, in case there is a bug. You want to slow roll the deploy.
Also, you couldn't get a filer in AWS, once we moved there.
And before we moved to AWS the rack was too full for a filer, I would have had to get a whole extra rack.
FWIW, NetApps were generally pretty solid, and they should have no problem keeping in sync across datacenters. You pay handsomely for the privilege though.
Failover, latency, and so on are something you need to think about independently of what transfer protocol you use. NFS may present its own challenges with all the different extensions and flags, but that's true of any mature technology.
That said, live code updates probably aren't a very good idea anyway, for exactly the reasons you mention. Those are the reasons you were right at the time, not any inherent deficiencies on the NFS protocol.
100% this. Sometimes it's not even the filer itself. `hard` NFS mounts on clients in combination with network issues have led to downtimes where I work. Soft mounts can be a solution for read only workloads that have other means of fault tolerance in front of them, but it's not a panacea.
I haven’t seen these problems at much larger scales than are being discussed here. I’ve heard of people buying crappy nfs filers or trying to use the Linux server in prod (it doesn’t support HA!), but I’ve also heard of people losing data when they install a key value store or consensus protocol on < 3 machines.
The only counterexample involved a buggy RHEL-backported NFS client that liked to deadlock, and that couldn’t be upgraded for… reasons.
Client bugs that force a single machine/process restart can happen with any network protocol.
> You probably gave bad advice. By the time Reddit existed, you could have just gotten an netapp filer. They had higher availability than most data centers back then, so “the NFS server hung” wouldn’t be anywhere near the top of your “things that cause outages or interfere with engineering” list.
Or distributed NFS filers like Isilon or Panasas: any particular node can be rebooted and its IPs are re-distributed between still-live node. At my last job we used one for HPC and it stored >11PB with minimal hassle. OS upgrades can be done in a rolling fashion so client service is not interrupted.
Newer NFS vendors like Vast Data have all-NVME backends (Isilon can have a mix if you need both fast and archival storage: tiering can happen on (e.g.) file age).
NetApps were a game changer. Large Windows Server 2003 file servers that ran CIFS, NFS, and AFP simultaneously could take 60-90 minutes to come back online because of the resource fork enumeration scan required by AFP sharing.
I find it fascinating that the fact that NFS mounts hang the process when they don't work is due to the broken I/O model Unix historically employed.
See, unlike some other more advanced, contemporary operating systems like VMS, Unix (and early versions of POSIX) did not support async I/O; only nonblocking I/O. Furthermore, it assumed that disk-based I/O was "fast" (I/O operations could always be completed, or fail, in a reasonably brief period of time, because if the disks weren't connected and working you had much bigger problems than the failure of one process) and network-based or piped I/O was "slow" (operations could take arbitrarily long or even fail completely altogether after a long wait); so nonblocking I/O was not supported for file system access in the general case. Well, when you mount your file system over a network, you get the characteristics of "slow" I/O with the lack of nonblocking support of "fast" I/O.
A sibling comment mentions that FreeBSD has some clever workarounds for this. And of course it's largely not a concern for modern software because Linux has io_uring and even the POSIX standard library has async I/O primitives (which few seem to use) these days.
And this is one of those things that VMS (and Windows NT) got right, right from the jump, with I/O completion ports,
But issues like this, and the unfortunate proliferation of the C programming language, underscore the price we've paid as a result of the Unix developers' decision to build an OS that was easy and fun to hack, rather than one that encouraged correctness of the solutions built on top of it.
It wasn’t until relatively recently approaches like await because commonplace. Imagine all the software that wouldn’t have been written if they were forced to use async primitives before languages were ready for them.
Yes, it is to synchronous programming's great credit that it is simple, and to its great discredit that it is inefficient. Engineering tradeoffs, and all that.
Quote[0]:
> In Ingo's view, there are only two solutions to any operating system problem which are of interest: (1) the one which is easiest to program with, and (2) the one that performs the best. In the I/O space, he claims, the easiest approach is synchronous I/O calls and user-space processes. The fastest approach will be "a pure, minimal state machine" optimized for the specific task; his Tux web server is given as an example.
Granted, most software is not developed for the Linux kernel, but neither is asynchronous programming black magic. I think the software space has rather been negatively impacted by being slow to adopt asynchronous programming, among other old practices.
Imagine all the software that would've been written, or made much nicer, earlier on had Unix devs not been forced to use synchronous I/O primitives.
Synchronous I/O may be simple, but it falls down hard at the "complex things should be possible" bit. And people have been doing async I/O for decades before they got handholding constructs like 'async' and 'await'. Programming the Amiga, for instance, was done entirely around async I/O to and from the custom chips. The CPU needn't do much at all to blow away the PC at many tasks; just initiate DMA transfers to Paula, Denise, and Agnus.
What I learned though was that NFS was great until it wasn't. If the server hung, all work stopped.
When I got to reddit, solving code distribution was one of the first tasks I had to take care of. Steve wanted to use NFS to distribute the app code. He wanted to have all the app servers mount an NFS mount, and then just update the code there and have them all automatically pick up the changes.
This sounded great in theory, but I told him about all the gotchas. He didn't believe me, so I pulled up a bunch of papers and blog posts, and actually set up a small cluster to show him what happens when the server goes offline, and how none of the app servers could keep running as soon as they had to get anything off disk.
To his great credit, he trusted me after that when I said something was a bad idea based on my experience. It was an important lesson for me that even with experience, trust must be earned when you work with a new team.
I set up a system where app servers would pull fresh code on boot and we could also remotely trigger a pull or just push to them, and that system was reddit's deployment tool for about a decade (and it was written in Perl!)