Linux Data 'Write to Storage' Woes

Today’s CTO and CIO designed Storage Infrastructures are most often powered by Linux servers having either direct  or SAN attached Flash Storage. 

Linux either way, serves most of the read and write needs of a myriad of Databases, Data lakes, Lakehouses, Data Warehouses serving the customer experience and application workloads. All of these logical data store methods are more and more serving genAI assisted user and workload data access. 

The latest massive end user interest and adoption of GenAI to access information of all types hosted in Linux data stores is served by AI Large Language Models ‘LLMs’ providing ‘engineered prompt’ answers. The quality of those engineered prompt GenAI answers are being regularly tuned iteratively by Linux Hosted Machine Learning systems reading from and writing to these same Linux hosted data stores.  

Linux is everywhere hosting most all logical data stores.

Unfortunately Linux ‘out of the box’, has an old, slow ‘tape drive like’ way of writing data to Flash storage  and because there are many writes performed defragmenting data while underlying drive controllers are performing the actual wear levelling data block write moves, standard Linux can wear out Flash drives fast. In fact Flash drives in high use write applications such as databases will actually be retired in five months before data corruption wear sets in a few months later, the real reason most Flash drive vendors include media replacement, for 25% of the initial system purchase price,  per year, so that in four years you have spent as much on flash drive media replacement as it cost you to buy the system in the first place.

Why these flash drives wear out so fast in such high volume small payload write environments as databases has to do with the write performance inefficiencies of the File System used and the inefficiencies of the Linux kernel’s standard read and write data management processes, first designed to work on rotating tape drive and hard disks.

It’s important to note today’s Flash Drives are actually ‘physically’ quiet fast, yet that physical level write speed is never realized via today’s default Linux kernel data management approach when writing data to Flash. As a result, a standard Linux server (RHEL, SLES, Ubuntu, take your pick) writing to a new flash array will be fast initially and, as that Linux data management system fills up  the drive capacity with data, the data write speeds will slow down dramatically to be only 3X to 5X the speed of the fastest rotating hard disks, instead of being at least 15 times faster.  

Worse yet, many of the most popular Linux File Systems writing to Linux ‘block” volume storage like ZFS, have big “Write Amplification’  capital acquisition and operating costs. 

Write Amplification is a write method used to protect data from physical media flaws emerging as the media wears out from applying higher amperage flows to change the magnetism polarity from north to south or visa versa, where north is a ‘digital’ zero and south is digital ‘one’.  

A File System’s  Write Amplification algorithm is programmed differently in different file systems like ZFS, btrfs, ext3, ext4, ReiserFS etc, depending on the file system author”s file system write versus read design trade offs balancing read and write speed with how much data safety is enough given the type of target media used, etc. 

ZFS for example, being a much older file system(vintage Sun Systems) and considered by some CIOs, to be a more trusted file system versus say btrfs which is newer with less write amplification, uses more compute cycles and ram to write multiple copies of data to different locations on the media every time there is a new data write. This ZFS scattering of multiple write data copies, scatters as many as 20 copies for a single data write in certain cases and is the method ZFS uses to protect against physical media wear ‘flipping a bit’ to corrupt user or workload data on the storage media.

Bit Flipping means the media, as it wears out and loses its magnet properties within a designed ‘ hold a charge’ threshold, for example turns a zero into a one, will  effectively will corrupt the data stored on the media when it is read the next time. As the media wears out, this ‘bit rot’ is a big threat to your data integrity. So ZFS scattering the data to different physical locations of the media (sectors/cylinders) is an especially important  data protection method, despite this old method being write performance inefficient, especially on flash. 

ZFS scattering data to different physical locations on the media when writing data to rotating tape drives and also rotating disk media does wear the media out faster than any other Linux File System. Rotating media, disk or tape, has the advantage that different competing programs can alternatively write their data blocks in a ‘block by block’ granular fashion to different locations on the media. Once the user session or workload data is written by the rotating media, selectively, block by block, the underlying file system data management perform performs the ‘clean up’ procedure called defragmentation, which is implemented by the Linux kernel’s data management module to re-organize each user file or workload ‘related’ set of blocks to be organized in a sequential manner so as to help maintain the speed of user and application workload reads. 

Any Linux kernel data manager  ensures as much ‘Unassigned’ data storage capacity on the media is used by this process to keep those processes completing quickly so as to ensure more of the systems VPU clocks and RAM is used to actually write user session and workload data, where the  more unassigned space available means the defragmentation and wear levelling operations occur faster. That same ‘defrag’ process will start to slow down the user session and workload data writes as drives get closer to their configured write capacities,  which is usually configured to 70% of the actual flash media capacity size. How much slower production write speed will decline depends on the volume of  production writes  plus the volume of  drive ‘speed and data protection’ maintainance writes  multiplied by the average user session and workload payload write size. Lots of little record writes to databases therefore will slow flash systems down faster.

Anyway you design your Linux Data Store Flash Infrastructure, the fact is, with standard Linux editions employed, your user session and workload write speeds will decline as the drive capacity is filled because the ‘clean up” processes lose ‘ write copy working space’ gleaned by utilizing unassigned drive capacity, which obviously shrinks  forcing the CPU to read and write more often from memory, stealing valuable cycles normally used for writing user session or workload data to Flash.  

n.b-  A data block close to falling below minimum charge tolerances are reported by the hardware media controllers to the Linux kernel for action to move the affected data blocks to newer less worn out media locations, where on more expensive drives  the drive’s onboard controller will interrupt and the read and buffer the affected data writes and mark the old locations to be zeroed out, or marketed in the block header for re-use,  only doing the latter after  the ‘move’ writes  of the affected blocks are completed to the new location. 

ZFS, after first scattering multiple write copies everywhere on the media, then uses this compaction or defragmentation method in set intervals to essentially read a few trusted block copies into a new, side by side ‘faster to read’ sets of block locations. ZFS in doing so regularly uses  ‘out of band’ compute cycles and RAM  to de-scatter the user and workload data originally scattered in the first write .  Once ZFS has those ‘related session or workload’  blocks are re-written to be together, ZFS then starts the process of  erasing or marking the old blocks no longer required for  ‘new writes’ , a process known as de-duplication to recover disk or tape capacity. Where media containing the block is flawed, that location is stored in lookup table which ensures ZFS will not make any future writes to that location.

As such, ZFS like ALL file systems, incurs a Write Amplification overhead cost which slows user session and workload write speeds by consuming system CPU clock cycles and RAM space as each defragmentation read is written to RAM memory, then re-written by the Linux data manager on ZFS’s behalf to a new media location , a process repeated continually by ZFS  writing many times multiple copies  to media to avoid ‘bit flip’ and data corruption. 

In the case of Flash systems, where any change to existing data on the flash drive requires the Linux underlying data management system to first read the entire set of old blocks into memory,  then have the Linux kernel’s  flash management controller append or correct the change to data in memory and then, finally ‘re-flash’ the entire set of blocks with the new change to the data to the flash media, ZFS  and other file systems previously designed to only work on tape and disk, will actually incur a higher write to media count  which  means ZFS Write Amplification is higher than most all other file systems.

Each Linux File System ext 3, ext4 and other recent different file system designs btrfs, reiserfs, have different write amplification characteristics which are most often fixed, with a few being configurable at the time of physical drive format.

Regardless, all of the  Linux File Systems measured using the same 4K block size  random read and write data pattern using the standard Linux data write and read management methods, a pattern seen in multiple users accessing multiple data tables simultaneously will see a 2X Improvement by switching to CloudProx TurboStorage.

Such database write patterns as above, using ZFS, will normally be observed fragmenting the location of the data on the media quickly, which then requires ZFS to regularly employ ‘in-band’ write speed stealing use of more CPU clock cycles and RAM, slowing down effective write IOPs, especially as the drive fills up to engineered capacity, which in the case of flash, requires 30% of the space remain un-assigned to give these out of band processes space to write their interim ‘data move’ steps, before the ‘same session and workload’ related blocks are finally re-organized to be sequential to keep related user session or workload database table reads fast. 

TurboStorage handles all write and read management in system memory to  linearize and compress the session data into large blocks written by TurboStorage’s ‘in memory’ Flash Translation Layer to TurboStorage physical media format across all flash drives managed in the physical hardware system. TurboStorage stores additional session meta data about age and media condition, which is both recorded  on disk write lookup and reverse lookup read tables and constantly monitored by the TurboStorage  kernel module in system RAM to effectively reduce all  file system write amplifications by at least 2X across all drives in the Linux powered flash system.

It’s also important to note TurboStorage increases data capacity across all drives in the storage system, be it direct attached or a SAN Appliance by implementing Intel or AMD built-in system hardware data compression. 

TurboStorage  hardware data compression incurs minimal additional compute clock cycle and RAM overhead and has a minimal -1% to -2% impact on data write IOPs speed with hardware data compression turned on.  

TurboStorage  system hardware data compression is made possible because of TurboStorage  in-memory Flash Translation Layer  ‘FTL’ linearization of all data writes, a Linux Flash Storage industry first. TurboStorage FTL Linearization in memory for all drives in the system means CTOs and IOs can design their flash storage infrastructure to use inexpensive, 50%  less costly SSDs (without expensive compression on board), saving at least 50% on the cost of media in the process. 

The CloudProx TurboStorage improvement in Write Amplification reduction, depending on actual session or workload write payload data size  and write frequency, speeds Write IOPs at least 2X or more and, extends Drive durability by at least 2X.

How much more improvement TurboStorage delivers to your customer experience and workload write  IOPs speeds,  drive  durability and storage power savings will depend on each CTO or CIO’s selected percentage mix of File Systems utilized to service related user and workload writes, as each File System’s own Write Amplification method does consume additional CPU clock cycles and RAM space, which will impact your  overall Linux Storage Infrastructure write IOPs performance efficiency gains, drive durability life extension and storage power reduction savings.