Skip to content

Your AI bottleneck isn’t compute. It’s the file system

When AI platforms misbehave in production, the instinct is to buy more GPUs. Many teams are solving the wrong problem.

The real bottleneck may not be compute, but storage. Teams add GPUs, optimize batching, tune the model, and still see latency spike unpredictably, scaling take longer than it should, and workers stall waiting for data. The compute isn’t the constraint; the infrastructure delivering data to that hardware is.


Key takeaways

  • Compute is rarely the constraint when AI platforms misbehave in production. The storage layer delivering data to that compute usually is.
  • Training and inference both demand fast, concurrent file access from many workers. That’s a file access problem, not a compute problem.
  • Common workarounds (local copies, object storage, custom pipelines) each carry hidden costs that scale poorly.
  • A high-performance shared file system gives every worker one authoritative copy and a consistent view, with no bespoke distribution logic.
  • SMB 3 with RDMA, multichannel, and scale-out outperforms NFS on AI training workloads. Fusion SMB delivers SMB 3’s full potential on Linux.

Where storage pressure actually comes from 

The phase where storage matters most is training. AI training feeds models large datasets of unstructured data: files, documents, and images. It can run for days, weeks, or months. Checkpointing alone creates sustained, intensive storage demand throughout that process. Saving the model’s state at regular intervals so training can resume after failure means the cluster pauses repeatedly to write large amounts of data. In recent MLPerf llama3-8b checkpoint testing, Fusion SMB completed a 107.6 GB checkpoint write in 7.3 seconds; Samba took 27.1 seconds for the same operation. At training scale, that difference compounds across thousands of checkpoints into hours of lost compute time.

After training comes inference. This is the operational phase. The trained model is now responding to live requests. The long, storage-intensive work is complete, though the training data rarely remains static, and the cycle continues as new data is added. Storage still matters at production scale, but the nature of the demand shifts. Rather than sustained data ingestion, the challenge becomes fast, concurrent access across many workers simultaneously.

In both phases, the same underlying problem emerges: dozens, hundreds, or thousands of compute workers need fast, concurrent access to the same data. That’s not a computing challenge; it’s a file access challenge.

The common workarounds each carry hidden costs:

  • Local copies mean collocating data and storage with compute, logistically difficult or outright impossible at scale. Model updates become a distribution problem.
  • Object storage can introduce latency variability that is difficult to predict, particularly under high-concurrency AI workloads. Most AI training data exists as files, petabytes of it, not objects. Teams want to work now, not spend time transforming data into a different format before they can begin.
  • Custom pipelines trade one risk for another: operational complexity that compounds as the platform grows.

The solution: shared file access that was built for this 

A remote shared file system solves this cleanly: one authoritative copy of each model or dataset, a consistent view across all workers, and fast startup when new nodes come online. No bespoke distribution logic. No synchronization overhead.

A remote shared file system requires a network protocol with enormous throughput and low latency. SMB has been evolving for over 40 years, but the version released in 2012 (SMB 3) is so different from what you saw in the 90s that you could consider it a new protocol.

AI and ML training is not metadata-heavy. Recent MLPerf 3D-unet training results make this concrete: on identical hardware at 2x200GbE, Fusion SMB delivered 25.45 GB/s of training throughput, while NFSD managed 13.93 GB/s and Samba 2.82 GB/s. NFS was designed with metadata performance as a priority. SMB 3, with RDMA, multichannel, and scale-out, was built to move very large amounts of data fast. AI training rewards the latter. Forget what you know about SMB from open source and old Windows environments; this is a different protocol designed for a different job.

MLPerf benchmark comparison on identical hardware. Left panel shows AI training throughput for 3D-unet at 2x200GbE: Fusion SMB 25.45 GB/s, NFSD 13.93 GB/s, Ganesha 12.92 GB/s, Samba 2.82 GB/s. Right panel shows checkpoint save time for a 107.6 GB llama3-8b model: Fusion SMB 7.3 seconds, NFSD 8.8 seconds, Ganesha 9.8 seconds, Samba 27.1 seconds.

Tuxera didn’t alter SMB itself (SMB is an open but Microsoft-owned protocol) but built an exceptionally high-performance implementation that fully realizes SMB 3’s potential. For teams with a Linux background, Samba is usually the reference point for SMB performance, but Samba isn’t representative of what SMB 3 can actually do.

Fusion SMB on Linux is significantly faster than Samba, scales to far more concurrent workers, and outperforms Windows Server’s own SMB implementation on the workloads that matter for AI. It’s the SMB engine behind several leading high-performance storage platforms, including Weka and IBM Storage Scale.

So what does this mean for your platform? 

Storage problems in AI infrastructure are rarely visible until they become production incidents. By the time latency is spiking and workers are stalling, the storage layer is already a liability.

Fusion SMB removes storage as a limiting factor, not by introducing an exotic new system, but by delivering shared, predictable, high-performance file access that scales with the workload. It lets teams reuse familiar tools and security models, the ones already governing the rest of their infrastructure, while meeting the performance demands of modern AI at scale.

For teams moving AI from experimentation into production, that means fewer incidents, faster scaling, and a storage layer that simply stops being a problem. Reliability isn’t a nice-to-have. It’s the foundation everything else is built on.

See the benchmarks for yourself

The numbers in this article come from MLPerf testing on a single hardware configuration. Your workload, your network, and your storage will look different. We will run a proof of concept on your infrastructure and share the results.

Talk to a Fusion engineer

Suggested content for:

Our products

Your mission-critical systems demand uncompromising reliability. Tuxera products mean absolute data integrity. We specialize in file systems, software flash controllers, and secure networking and connectivity solutions. We are the perfect fit for data-intensive, mission-critical workloads. Using Tuxera’s time-proven solutions means that your data is safe and secure – always.

Proven success

Our solutions are trusted by major brands worldwide. When you need reliable, scalable, and lightening-fast data access and transfer across any system or device, Tuxera delivers. Our track record speaks for itself. We’ve been in this business for decades with a clear mission: to be the partner you can trust. Read on to find out more.

Related pages and blog posts
Technical Articles
Datasheets & Specs
Whitepapers