When archiving, the degree of file compression depends on. What determines the compression ratio of a file? Concept and main aspects. Determining the compressibility of files of different formats

Most users know that sometimes compression is used to reduce the size of source files in order to make them more convenient to store or send, for example, by email. However, for some reason, in this case, the association occurs only with archiver applications, and other data compression techniques are not taken into account. Next, we will look at what determines the degree of file compression, using the example of several of the most common situations.

What is meant by file compression ratio?

Let's start with theoretical questions. What is the compression ratio of a file? Based on the simplest interpretations of this term, it means the ratio of the size of the final (compressed) object to the initial volume. However, such an explanation may largely apply exclusively to archival data, since it does not address at all some issues associated with changing the multimedia format, where compression is also very common. In general, it is impossible to say that the degree of file compression depends only on one particular characteristic. In this case, the type of object, the programs used to compress the data, and the speed of the compression process play a role. Next, we will briefly discuss some important aspects that may affect the final result of reducing the size of the source data.

The degree of file compression depends only on the file type: is this really true?

Yes, indeed, the type of data being compressed has a fairly large impact on reducing the final file size, and not all formats can be subjected to such procedures. This can be explained using the example of sound files, which are initially already compressed in themselves.

When trying to pack such data into an archive, it is almost impossible to achieve a significant reduction in size. The same goes for the WAV format. However, if you do not compress, but transcode from WAV to MP3, the size can be reduced by a factor of ten or more. Many users are immediately put off by the fact that the degree of file compression depends on the initial and final format. This is not entirely true, since the used recoding algorithm also plays an important role, which will be discussed separately. For now, let's focus on using archivers.

What determines the degree of file compression when packing into an archive?

To initially understand the essence of this type of compression, for ease of explanation we will use the most common WinRAR archiver as an example. We will not touch the types of data being packed, but will focus on the tools of the application itself.

To begin with, you should pay attention to the final format of the archive, as well as the packaging method used. It is clear that in this case, the degree of file compression by the archiving program depends on the preferred technique. With the high-speed method, the compression will be minimal, but with the maximum compression ratio, the size will be reduced more significantly and more time will be required.

If we consider file formats in relation to archivers, text documents of any format can be distinguished from the most compressible ones.

Some executable files in EXE format are compressed relatively well (using the standard compression method, you can reduce the size by more than half). The most incompressible, as already mentioned, are multimedia objects. And, even if it is somehow possible to reduce the size of pictures, such actions do not work with audio and video without changing the initial format, and archivers have absolutely nothing to do with it.

Types of graphics, video and audio compression

When it comes to multimedia, there are two main types of compression: lossy and lossless. And in this case, the degree of file compression depends precisely on the compression technology used.

In the first case, the compression is maximum, in the second it can vary, which is influenced by the set of codecs used and the final container format. So, for example, the same AVI file can be a container containing completely different types of data and with different degrees of compression. Because of this, by the way, sometimes there may be problems with video playback on home players.

In general, if we talk specifically about multimedia, here you need to clearly understand that it is almost impossible to achieve the maximum reduction in the size of the source file of any format without significant loss of quality, despite technologies for removing redundant content (for example, for graphics or video this only works in the case of unchangeable scenes). In the case of audio, the bitrate is reduced and certain frequencies are cut out. The average user may not feel the difference, but a professional with keen ears will immediately tell what is missing.

The most common programs for all occasions

Let’s figure it out a little bit about what the degree of file compression depends on. Now we should say a few words about the software products used. Among the archivers, the most common are WinRAR, WinZIP and 7-Zip.

As for multimedia compression, in the simplest case, you can use special converter applications that work on the principle of transcoding the source material into another format in order to reduce the file size.

Brief summary

To summarize, it can be noted that the degree of file compression by the archiver depends on several factors, and most often on the type of data being compressed, the software used and (usually the Huffman and Lempel-Ziv algorithms, working in pairs, are used). In the case of multimedia content, the situation is almost the same, but the dominant position is occupied by the conversion of format from one to another.

The degree of information compression depends on several reasons:

First, the type of data being compressed matters a lot. Graphic and text files are best compressed. For them, the compression ratio can be from five to forty percent. Files of executable programs, load modules, and multimedia files are compressed worse.

Secondly, the compression method is of great importance.

Thirdly, it is also important which archiver is used. When choosing the type of archiver, they are usually guided by the following considerations: so that the compression ratio is as high as possible, and the time spent packing and unpacking files is as little as possible.

Information compression programs

Compression occurs using archiver programs. Today, the most common are four archivers - WinRar, WinAce, 7Zip and WinZip. As for the latest program, it does not stand up to criticism.

Let's take a closer look at the archiver - WinRar. This archiver can be associated with the following file types: RAR, ZIP, CAB, ARJ, LZH, ACE, 7-Zip, TAR, GZip, UUE, BZ2, JAR, ISO.

The program supports files of almost unlimited size (up to 8,589,934,591 GB). However, to work with files larger than 4 GB you need to work in the NTFS file system.

There are a few things to consider when choosing the optimal compression settings:

Although WinRAR supports the ZIP format, in most cases it is recommended to choose RAR. This will provide a higher level of compression. You can compress files in ZIP if you are not sure that the computer on which the files will be decompressed will have a program installed that can be used to decompress RAR files.

You need to decide which compression method is best to use. The higher the compression ratio, the more time it will take to archive, so you need to consider the purpose for which the data is being archived. If this is long-term storage, of course, it makes sense to wait and get an archive with the maximum compression level, but if you just need to send several documents by mail, the Normal compression level is also suitable for you.

If you need to achieve maximum file compression, use the Create solid archive option. However, it also has its drawbacks. Firstly, it will take longer to unpack such files than to extract them from a regular archive. Imagine that you have two hundred files in your archive. If it was created in the usual way, you can easily extract one of the files. If you used solid archive, it will matter how the file you need is archived. If it was in the middle of the second hundred, then to unpack it the program will need to unpack 150 files before it gets to it. Creating archives in this way can also entail great losses, because if the archive is damaged, you will lose all the files that were in it. In the case of packaging in the usual way, you will be able to extract, if not all, but most of the files from the damaged archive.

If you need to create a large archive, this can take quite a lot of time. WinRar allows you to determine approximately how long it will take to complete a particular task. The Benchmark and hardware test option is intended for this. Another reason why you can use this option is to identify possible errors that may occur when archiving a particular configuration on a computer due to a hardware failure.

Among other WinRar settings, one can note the ability to create self-extracting archives indicating the unpacking path. Such files do not require an archiver program on the computer on which they are planned to be unzipped. Such archives are called SFX-archives. Their disadvantage compared to conventional archives files are larger in size, since they, in addition to the actual packed files, also contain an executive EXE module.

The contents of a RAR archive can be made invisible. To do this, in the program settings, in the Archiving with Password window, you need to check the box next to the Encrypt File Names line.

You can also set a password to open the archive. As a result of an error in transferring the archive over a local network or downloading it from the Internet, as well as due to a hardware failure or virus attack, the archive may be damaged. WinRar allows you to determine the integrity of the data by testing the archive using the Test Archived Files option.

In order to minimize the likelihood of data loss, when creating WinRar archives, it is recommended to use the Put Recovery Record option (this checkbox can be found on the General tab of the archive creation window).

If this has been done, then if the archive is damaged, it can be restored.

In addition, in WinRar, you can reduce the likelihood of damage to the RAR archive by specifying the size of the recovery information when creating it. To do this, run the Commands > Protect Archive From Damage command in the Winrar window. In this case, the volume of Recovery Record cannot exceed ten percent of the total archive size.

To restore damaged RAR archives, you need to select the desired file in the WinRar window and execute the command Tools > Repair.

WinRAR can be integrated into the context menu, and it supports not only the Explorer menu, but also other programs, for example the popular file manager Total Commander. This makes it possible to quickly archive files using default settings and without opening a program window. By the way, the default settings can be changed in accordance with the requirements you place on your archives. This can be done by opening the WinRar window and running the Options > Settings command. In this window, you need to go to the Compression tab and click the Create Default button. The settings specified in this window will be used for quick archiving. If you need to change the archiving settings, this can also be done using the context menu. To do this, select the Add to Archive... command. Here you can set the format and compression level, specify the archive name and select other archiving parameters.

WinRar allows you to save user-specified settings to a file with the Reg extension. This file can later be imported into the program to reuse the given configuration. This file stores information such as the history of archives that were recently created, default compression settings, etc.

Another convenient Winrar option is the ability to create your own bookmarks - Favorities. Very often it is necessary to regularly archive the same folders on your hard drive. By adding information about the location of these folders to bookmarks, you can quickly navigate to them in the program window and archive the necessary files and subdirectories.

Part one - historical.

Introduction

Existing data compression algorithms can be divided into two large classes - lossy and lossless. Lossy algorithms are commonly used for image and audio compression. These algorithms allow high compression rates to be achieved through selective quality loss. However, by definition, it is impossible to recover the original data from the compressed result.
Lossless compression algorithms are used to reduce the size of data, and work in such a way that it is possible to restore the data exactly as it was before compression. They are used in communications, archivers and some algorithms for compressing audio and graphic information. Next, we will consider only lossless compression algorithms.
The basic principle of compression algorithms is based on the fact that in any file containing non-random data, the information is partially repeated. Using statistical mathematical models, you can determine the probability of repetition of a certain combination of symbols. You can then create codes to represent the selected phrases and assign the shortest codes to the most frequently repeated phrases. Various techniques are used for this, for example: entropy coding, repetition coding, and dictionary compression. With their help, an 8-bit character, or an entire string, can be replaced with just a few bits, thus eliminating unnecessary information.

Story

Hierarchy of algorithms:

Although data compression became widespread with the Internet and after the invention of the Lempel and Ziv algorithms (LZ algorithms), several earlier examples of compression can be cited. Morse, when inventing his code in 1838, wisely assigned the most frequently used letters in the English language, “e” and “t,” the shortest sequences (dot and dash, respectively). Shortly after the advent of mainframes in 1949, the Shannon-Fano algorithm was invented, which assigned codes to characters in a block of data based on the probability of their appearance in the block. The probability of a character appearing in a block was inversely proportional to the length of the code, which made it possible to compress the data representation.
David Huffman was a student in Robert Fano's class and chose as his course work the search for an improved method of binary data encoding. As a result, he managed to improve the Shannon-Fano algorithm.
Early versions of the Shannon-Fano and Huffman algorithms used predefined codes. Later, they began to use codes created dynamically based on data intended for compression. In 1977, Lempel and Ziv published their LZ77 algorithm, based on the use of a dynamically created dictionary (also called a “sliding window”). In '78 they published the LZ78 algorithm, which first parses the data and creates a dictionary, instead of creating it dynamically.

Rights issues

The LZ77 and LZ78 algorithms gained great popularity and caused a wave of improvers, of which DEFLATE, LZMA and LZX have survived to this day. Most of the popular algorithms are based on LZ77 because the LZ78-derived LZW algorithm was patented by Unisys in 1984, after which they started trolling everyone, including even cases of using GIF images. At that time, a variation of the LZW algorithm called LZC was used on UNIX, and due to rights problems their use had to be phased out. Preference was given to the DEFLATE algorithm (gzip) and the Burrows-Wheeler transform, BWT (bzip2). Which was for the best, since these algorithms almost always outperform LZW in compression.
By 2003, the patent had expired, but the train had already left and the LZW algorithm was preserved, perhaps, only in GIF files. Algorithms based on LZ77 are dominant.
There was another patent battle in 1993 when Stac Electronics discovered that the LZS algorithm it had developed was being used by Microsoft in a disk compression program that came with MS-DOS 6.0. Stac Electronics sued and they were able to win the case, resulting in them receiving over $100 million.

The rise of Deflate

Large corporations used compression algorithms to store ever-increasing amounts of data, but the real rise of algorithms came with the birth of the Internet in the late 1980s. The capacity of the channels was extremely narrow. To compress data transmitted over the network, the ZIP, GIF and PNG formats were invented.
Tom Henderson invented and released the first commercially successful archiver, ARC, in 1985 (System Enhancement Associates). ARC was popular among BBS users because... it was one of the first to be able to compress several files into an archive, and its sources were also open. ARC used a modified LZW algorithm.
Phil Katz, inspired by the popularity of ARC, released the PKARC program in shareware format, in which he improved the compression algorithms by rewriting them in Assembly language. However, Henderson was tried and found guilty. PKARC copied ARC so openly that it sometimes even repeated typos in source code comments.
But Phil Katz was not at a loss, and in 1989 he greatly changed the archiver and released PKZIP. After he was attacked for his patent on the LZW algorithm, he changed the basic algorithm to a new one called IMPLODE. The format was replaced again in 1993 with the release of PKZIP 2.0, and DEFLATE became the replacement. Among the new features was the function of splitting the archive into volumes. This version is still widely used, despite its advanced age.
The GIF (Graphics Interchange Format) image format was created by CompuServe in 1987. As you know, the format supports lossless image compression and is limited to a palette of 256 colors. Despite all the efforts of Unisys, it was unable to stop the spread of this format. It is still popular today, especially due to its animation support.
Slightly troubled by patent issues, CompuServe released the Portable Network Graphics (PNG) format in 1994. Like ZIP, it used the fancy new DEFLATE algorithm. Although DEFLATE was patented by Katz, he did not make any claims.
This is now the most popular compression algorithm. In addition to PNG and ZIP, it is used in gzip, HTTP, SSL and other data transfer technologies.

Unfortunately, Phil Katz did not live to see DEFLATE's triumph; he died of alcoholism in 2000 at the age of 37. Citizens – excessive alcohol consumption is dangerous to your health! You may not live to see your triumph!

Modern archivers

ZIP reigned supreme until the mid-90s, but in 1993, a simple Russian genius, Evgeniy Roshal, came up with his own RAR format and algorithm. Its latest versions are based on the PPM and LZSS algorithms. Now ZIP is perhaps the most common of the formats, RAR was until recently the standard for distributing various illegal content via the Internet (thanks to increased bandwidth, files are increasingly distributed without archiving), and 7zip is used as the format with the best compression with an acceptable operating time. In the UNIX world, the combination tar + gzip is used (gzip is an archiver, and tar combines several files into one, since gzip cannot do this).

Note translation Personally, in addition to those listed, I also came across the ARJ (Archived by Robert Jung) archiver, which was popular in the 90s during the BBS era. It supported multi-volume archives, and just like RAR after it, it was used to distribute games and other software. There was also the HA archiver from Harri Hirvola, which used HSC compression (I did not find any clear explanations - only “bounded context model and arithmetic encoding”), which did a good job of compressing long text files.

In 1996, the open source version of the BWT algorithm, bzip2, appeared and quickly gained popularity. In 1999, the 7-zip program appeared with the 7z format. In terms of compression, it competes with RAR, its advantage is openness, as well as the ability to choose between bzip2, LZMA, LZMA2 and PPMd algorithms.
In 2002, another archiver appeared, PAQ. Author Matt Mahone used an improved version of the PPM algorithm using a technique called "contextual blending." It allows you to use more than one statistical model to improve the prediction of symbol frequency.

The future of compression algorithms

Of course, God knows, but apparently the PAQ algorithm is gaining popularity due to its very good compression ratio (although it is very slow). But thanks to the increase in computer speed, speed is becoming less critical.
On the other hand, the Lempel-Ziv–Markov LZMA algorithm represents a trade-off between speed and compression ratio and can give rise to many interesting ramifications.
Another interesting technology is “substring enumeration” or CSE, which is still little used in programs.

In the next part we will look at the technical side of the mentioned algorithms and the principles of their operation.

All compression algorithms operate on an input stream of information in order to obtain a more compact output stream through some transformation. The main technical characteristics of compression processes and the results of their work are:

· compression ratio - the ratio of the volumes of the original and resulting streams;

·compression speed - the time spent on compressing a certain amount of information in the input stream until an equivalent output stream is obtained from it;

·compression quality is a value that shows how tightly the output stream is compressed when re-compression is applied to it using the same or another algorithm.

Algorithms that eliminate data recording redundancy are called data compression algorithms, or archiving algorithms. Currently, there are a huge variety of data compression programs based on several basic methods.

All data compression algorithms are divided into:

) lossless compression algorithms, when used, the data at the receiving end is restored without the slightest changes;

) lossy compression algorithms that remove from the data stream information that has little effect on the essence of the data, or is completely unperceivable by humans.

There are two main lossless archiving methods:

Huffman algorithm, aimed at compressing sequences of bytes that are not related to each other,

Lempel-Ziv algorithm (English: Lempel, Ziv), focused on compression of any types of texts, that is, using the fact of repeated repetition of “words” - sequences of bytes.

Almost all popular lossless archiving programs (ARJ, RAR, ZIP, etc.) use a combination of these two methods - the LZH algorithm.

Huffman algorithm.

The algorithm is based on the fact that some characters from the standard 256-character set in free text can occur more often than the average repetition period, while others, accordingly, less often. Therefore, if you use short bit sequences less than 8 in length to write common characters, and long ones to write rare characters, the total file size will decrease.

Lempel-Ziv algorithm. The classic Lempel-Ziv algorithm -LZ77, named after the year of its publication, is extremely simple. It is formulated as follows: if a similar sequence of bytes has already been encountered in a previous output stream, and the record about its length and offset from the current position is shorter than this sequence itself, then a reference (offset, length), and not the sequence itself, is written to the output file.

4.File compression ratio

Compression of information in archive files is achieved by eliminating redundancy in various ways, for example by simplifying the codes, eliminating constant bits from them, or representing repeated characters or a repeating sequence of characters in terms of a repetition factor and corresponding characters. Algorithms for such information compression are implemented in special archiver programs (the most famous of which are arj/arjfolder, pkzip/pkunzip/winzip, rar/winrar) certain ones are used. Both one and several files can be compressed, which in a compressed form are placed in the so-called archive file or archive.

The purpose of file packaging is usually to ensure a more compact placement of information on disk, reducing the time and, accordingly, the cost of transmitting information over communication channels in computer networks. Therefore, the main indicator of the effectiveness of a particular archiver program is the degree of file compression.

The degree of file compression is characterized by the coefficient Kc, defined as the ratio of the volume of the compressed file Vc to the volume of the original file Vo, expressed as a percentage (some sources use the opposite ratio):

Ks=(Vc/Vo)*100%

The degree of compression depends on the program used, the compression method, and the type of source file.

The most well-compressed files are graphic images, text files and data files, for which the compression ratio can reach 5 - 40%; files of executable programs and load modules are compressed less, Kc = 60 - 90%. Archive files are almost not compressed. This is not difficult to explain if you know that most archiving programs use variants of the LZ77 (Lempel-Ziv) algorithm for compression, the essence of which is the special encoding of repeating sequences of bytes (read: characters). The frequency of occurrence of such repetitions is highest in texts and dot graphics and is practically reduced to zero in archives.

In addition, archiving programs still differ in the implementation of compression algorithms, which accordingly affects the degree of compression.

Some archiver programs additionally include tools aimed at reducing the compression ratio Kc. Thus, the WinRAR program implements a continuous (solid) archiving mechanism, using which a 10 - 50% higher compression ratio can be achieved than conventional methods, especially if a significant number of small files of the same type are packed.

The characteristics of archivers are inversely dependent quantities. That is, the higher the compression speed, the lower the compression ratio, and vice versa.

There are many archivers offered on the computer market - each has its own set of supported formats, its own pros and cons, and its own circle of admirers who firmly believe that the archiver they use is the best. We will not dissuade anyone or anything - we will simply try to impartially evaluate the most popular archivers in terms of functionality and efficiency. These include WinZip, WinRAR, WinAce, 7-Zip - they lead in the number of downloads on software servers. It is hardly advisable to consider other archivers, since the percentage of users using them (judging by the number of downloads) is small.

3. File compression ratio

Ks=(Vc/Vo)*100%

The degree of compression depends on the program used, the compression method, and the type of source file.

In addition, archiving programs still differ in the implementation of compression algorithms, which accordingly affects the degree of compression.

Characteristics of archivers are inversely dependent quantities. That is, the higher the compression speed, the lower the compression ratio, and vice versa.

There are many archivers offered on the computer market - each has its own set of supported formats, its own pros and cons, and its own circle of admirers who firmly believe that the archiver they use is the best. We will not dissuade anyone of anything - we will simply try to impartially evaluate the most popular archivers in terms of functionality and efficiency. These include WinZip, WinRAR, WinAce, 7-Zip - they lead in the number of downloads on software servers. It is hardly advisable to consider other archivers, since the percentage of users using them (judging by the number of downloads) is small.

Since in the task each pair of values (,) occurs once, the correlation table will take the form of a unit matrix. This means that the conditional averages coincide with the values. It follows that the correlation ratio is 1 and therefore...

Approximation of functions using the least squares method

Next, we approximate the function with a quadratic function. To determine the coefficients, we will use system (3.2.1) Using the totals of table 3, located in cells B29, C29, D29, E29, F29, G29 and H29, we will write system (2.1.4) in the form (3.2...

Types of archivers

A large number of different methods, their modifications and subtypes for data compression have been developed. Modern archivers tend to use several methods simultaneously. There are some main...

How can information systems be classified by degree of automation?

Classification of information technologies

Information technologies should be classified primarily by their area of application and the extent to which they use computers. There are such areas of application of information technologies as science, education, culture, economics...

As one of the necessary stages in creating an effective security system for a territory, an enterprise, an OID, we will perform an analysis of the vulnerability of the FitMax enterprise...

Mathematical substantiation of the degree of vulnerability of an information activity object using the example of the company FitMax LLC

For each channel, calculations are made of the probability of the degree of information security. Acoustic channel: Vibroacoustic channel: Human factor: Based on the data obtained...

Microprocessor: purpose, composition, main characteristics

There are several ways to work with files in Windows. The first is the “disk folders” in the “My Computer” folder. With their help, you can get to any desired file in any folder and on any drive. However, this method is far from ideal...

Search engine

1. Manual information systems are characterized by the absence of modern technical means of information processing and all operations are performed by humans. For example, about the activities of a manager in a company where there are no computers...

The main indicator of the effectiveness of the work of a door sales consultant is the share of effective visits he provides in the total volume of client traffic related to him - an indicator of the salesperson’s effective attendance...

Calculation of parameters of an asynchronous energy-saving electric motor

The characteristics of the degrees of protection of electrical machines are indicated by two Latin letters IP (International Protection) and two numbers...

E-commerce and statistics system for the sale of automobile spare parts

It is important to analyze the number of not only new, but also returning visitors to your online store. This will allow you to assess how interesting your site is to your target audience. Plus, it's always easier to return visitors...