Data Compression | Tuxera

Data Compression

NTFS has a special storage format for compressed files. This format is compatible with compressing and decompressing on the fly, which makes transparent reading and writing compressed files by applications possible, saving storage space at the expense of computing requirements.

Currently reading compressed files is supported by all ntfs-3g versions. Creating new compressed files, clearing contents, and appending data to existing compressed files are supported since ntfs-3g-2009.11.14. Modifying existing compressed files by overwriting existing data (or existing holes) are supported since ntfs-3g-2010.8.8.

When the mount option compression is set, files created in a directory marked for compression are created compressed. They remain compressed when they are moved (by renaming) to a regular directory in the same volume, and data appended to them after they have been moved are compressed. Conversely files which were present in a directory before it is marked for compression, and files moved from a directory not marked for compression are not compressed. Copying a compressed file always decompresses it, just to compress it again if the target directory is marked for compression.

A directory is marked for compression by setting the attribute flag FILE_ATTRIBUTE_COMPRESSED (hex value 0×800). This can be done by setfattr applied to the extended attribute system.ntfs_attrib. Marking or unmarking a directory for compression has no effect on existing files or directories, the mark is only used when creating new files or directories in the marked directory.

# Mark a directory for compression (on a small-endian computer)
setfattr -h -v 0x00080000 -n system.ntfs_attrib directory-name
# Disable compression for files to be created in a directory
setfattr -h -v 0x00000000 -n system.ntfs_attrib directory-name

Notes :

  • compression is not recommended for files which are frequently read, such as system files or files made available on file servers. Moreover compression is not effective on files compressed by other means (such as zip, gz, jpg, gif, mp3, etc.)
  • ntfs-3g tries to allocated consecutive clusters to a compressed file, thus avoiding fragmentation of the storage space when files are created without overwriting
  • some programs, like gcc or torrent-type downloaders, overwrite existing data or holes in files they are creating. This implies multiple decompressions and recompressions, and causes fragmentation when the recompressed data has not the same size as the original. Such inefficient situations should be avoided.
  • compression is not possible if the cluster size is greater than 4K bytes.

Compression method

NTFS compression is based on the public domain algorithm LZ77 (Ziv and Lempel, 1977). It is faster than most widely used compression methods, and does not require to decompress the beginning of the file to access a random part of it, but its compression rate is moderate.

The file to compress is split into 4096 byte blocks, and compression is applied on each block independently. In each block, when a sequence of three bytes or more appears twice, the second occurrence is replaced by the position and length of the first one. A block can thus be decompressed, provided its beginning can be located, by locating the references to a previous sequence and replacing the references by the designated bytes.

If such a block compresses to 4094 bytes or less, two bytes mentioning the new size are prepended to the bloc. If it does not, the block is not compressed and two bytes mentioning a count of 4096 are prepended.

Several compressed blocks representing 16 clusters of uncompressed data are then concatenated. If the total compressed size is 15 clusters or less, the needed clusters are written and marked as used, and the remaining ones are marked as unneeded. If 16 or 17 clusters are needed, no compression is done, the 16 clusters are filled with uncompressed data. The cluster size is defined when formating the volume (generally 512 bytes for small volumes and 4096 for big volumes).

Only the allocated clusters in a set of 16 or less are identified in the allocation tables, with neighbouring ones being grouped. When seeking to a random byte for reading, the first cluster in the relevant set is directly located. If the set is found to contain 16 allocated clusters, it is not compressed and the requested byte is directly located. If it contains 15 clusters or less, it contains blocks of compressed data, and the first couple of bytes of each block indicates its compressed size, so that the relevant block can be located, it has to be decompressed to access the requested byte.

When ntfs-3g appends data to a compressed file, the data is first written uncompressed, until 16 clusters are filled, which implies the 16 clusters are allocated to the file. When the set of 16 clusters is full, data is read back and compressed. Then, if compression if effective, the needed clusters are written again and the unneeded ones are deallocated.

When the file is closed, the last set of clusters is compressed, and if the file is opened again for appending, the set is decompressed for merging the new data.

This page is maintained by Jean-Pierre André

Partners