• Data compression methods. Information compression methods

    Submitting your good work to the knowledge base is easy. Use the form below

    good job to the site">

    Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

    Posted on http://www.allbest.ru/

    Data compression

    1. Information. Its types and properties

    In the literature one can find quite a lot of definitions of the term “information”, reflecting different approaches to the interpretation of this concept. Dictionary Russian language Ozhegova gives 2 definitions of the word “information”:

    Information about the surrounding world and the processes occurring in it, perceived by a person or a special device.

    Messages informing about the state of affairs, the state of something. (Scientific, technical and newspaper information, media - print, radio, television, cinema).

    Information and its properties are the object of study in a number of scientific disciplines, such as information theory (the mathematical theory of information transmission systems), cybernetics (the science of communication and control in machines and animals, as well as in society and human beings), semiotics (the science of signs). and sign systems), theory mass communication(study of the media and their impact on society), computer science (study of the processes of collection, transformation, storage, protection, search and transmission of all types of information and means of automated processing), physics and mathematics.

    Information has a dual nature: material - it can be transmitted, stored, etc.; and intangible - by transmission sulfur it can be replenished. Information cannot exist without its material carrier, means of transferring it in space and time. The carrier can be the physical object itself or its energy equivalent in the form of sound, light, electrical and other signals.

    For this purpose, many methods have now been invented for storing information on external (relative to the human brain) media and transmitting it over vast distances.

    The main types of information according to its form of presentation, methods of encoding and storing it, which has highest value for computer science, this is:

    · graphic or pictorial - the first type for which a method of storing information about the surrounding world was implemented in the form of rock paintings, and later in the form of paintings, photographs, diagrams, drawings on paper, canvas, marble and other materials depicting pictures of the real world;

    · sound- the world around us is full of sounds and the problem of storing and replicating them was solved with the invention of sound recording devices in 1877; its type is musical information - for this type, a coding method was invented using special characters, which makes it possible to store it in a similar way graphic information;

    · text- a method of encoding human speech special characters- in letters, and different nations have different languages and use various sets letters to represent speech; This method became especially important after the invention of paper and printing;

    · numeric- a quantitative measure of objects and their properties in the surrounding world; acquired especially great importance with the development of trade, economics and monetary exchange; similarly text information to display it, a coding method is used with special symbols - numbers, and the coding (number) systems can be different;

    · video information- a method of preserving “living” pictures of the surrounding world, which appeared with the invention of cinema.

    To transfer information to long distances initially coded light signals were used, with the invention of electricity - transmission of a signal coded in a certain way through wires, and later using radio waves.

    With the advent of computers (or, as they were initially called in our country, computers - electronic computers), a means for processing numerical information. However, in the future, especially after widespread personal computers(PC), computers began to be used for storing, processing, transmitting and retrieving text, numeric, visual, sound and video information. Since the advent of the first personal computers - PCs (80s of the 20th century) - up to 80% of their working time is devoted to working with text information.

    Information storage when using computers is carried out on magnetic disks or tapes, on laser discs(CD and DVD), special non-volatile memory devices (flash memory, etc.). These methods are constantly being improved, new devices and storage media are being invented. Information processing (reproduction, transformation, transmission, recording on external media) is performed by the computer processor. Using a computer, it is possible to create and store new information of any kind, for which they serve special programs, used on computers, and information input devices.

    A special type of information can currently be considered information presented in global network Internet. It uses special techniques for storing, processing, searching and transmitting large volumes of distributed information and special ways of working with various types information. Constantly improving software, providing collective work with information of all types.

    Information Properties

    Many different properties of information can be cited. Each scientific discipline considers those that are more important to it. From the point of view of computer science, the following properties seem to be the most important:

    1. Objectivity and subjectivity of information. More objective information is considered to be the one in which the methods introduce less of a subjective element. During information process the degree of objectivity of information always decreases.

    2. Completeness of information. Completeness of information largely characterizes the quality of information and determines the sufficiency of data for making decisions or creating new data based on existing ones.

    3. Reliability of information. Data arises when signals are recorded, but not all signals are “useful” - there is always a level of extraneous signals present.

    5. Availability of information.

    6. Relevance.

    2. Data compression

    The well-known rule exists in computer world what capacity hard drive there can never be too much. Indeed, it’s hard to disagree with him: no matter how huge the hard drive may seem when you buy it, it quickly becomes clogged with all sorts of things. unnecessary information. Since it’s a pity to delete everything, it’s worth time and again “storing” all this stuff into some kind of storage, archive.

    WITHdata compression- a procedure for recoding data performed in order to reduce its volume. Used for more rational use of data storage and transmission devices. If information compression methods are applied to finished documents, then the term “data compression” is often replaced with the term “data archiving.”

    Compression is based on eliminating redundant information contained in the source data. An example of redundancy is the repetition of fragments (for example, words of natural or machine language) in the text. Such redundancy is usually eliminated by replacing the repeated sequence with a shorter value (code). Another type of redundancy is due to the fact that some values ​​in the compressed data occur more often than others, and it is possible to replace frequently occurring data with shorter codes, and rare ones with longer ones (probabilistic compression). Compression of data that does not have redundancy properties (for example, random signal or noise), is impossible without losses. Also, it is usually not possible to compress encrypted information.

    Compression algorithms for texts/files of unknown format

    There are 2 main approaches to compress files of unknown format.

    At each step of the compression algorithm, either the next character is placed as is (with a special flag indicating that it is not compressed), or the boundaries of the word from the previous piece that coincides with the next characters of the file are indicated. Unzipping files compressed in this way is very fast, so these algorithms are used to create self-extracting programs.

    For each sequence at each moment in time, statistics of its occurrence in the file are collected. Based on these statistics, the probability of values ​​for the next symbol is calculated. You can then use some form of statistical encoding, such as arithmetic or Huffman encoding, to replace frequently occurring sequences with shorter ones and infrequently occurring sequences with longer ones.

    Compression can be lossless (when it is possible to restore the original data without distortion) or with losses (recovery is possible with distortions that are insignificant from the point of view of further use of the recovered data). Lossless compression is commonly used in processing computer programs and data, less often - to reduce the volume of audio, photo and video information. Lossy compression is used to reduce the volume of audio, photo and video information; it is much more effective than lossless compression.

    3. Data compression software

    If information compression methods are applied to finished documents. Often the term “data compression” is replaced by the term “data archiving”, and software that perform these operations are called archivists.

    Archivers are designed to compress files, i.e. to reduce the disk space they occupy. They allow, through special methods of packaging information, to compress information on disks, creating copies of files into one archive file. Despite the fact that computer memory volumes are constantly growing, the need for archiving does not decrease.

    So, archiving can come in handy:

    1) When storing copies of files and floppy disks, because floppy disk is limited in size;

    2) To free up space on your hard drive;

    3) When transmitting information over a network.

    Archiving information is a transformation of information in which its volume does not decrease, but the amount of information remains the same.

    The compressed file is called an archive. Archive file is a specially organized file containing one or more files in compressed and uncompressed form and service information about their names.

    The degree of information compression depends on the type of source file, the program used, and the chosen packaging method. Files are compressed best graphic objects, text files and data files, for which the compression ratio can reach 5-40%, files are compressed less executable programs and loading modules -60-90%.

    Various developers have created many archiver programs. Among them, the most common for Windows are WINRAR, WINZIP.

    According to its popularity, the WinRAR archiver , Without a doubt, it is in first place in Russia, and one of the first in the whole world. The archiver was developed by Evgeny Roshal in 2003. The program provides full control files in archives, recovery damaged archives, encryption, creation of self-extracting and multi-volume archives.

    WinZip is one of the most popular programs on the Internet, having collected a significant number of awards from various computer publications all over the world.

    The Zip algorithm itself is freely used in dozens of programs, however, for many Windows users EXACTLY WinZip is standard program for working with archives. Built-in WinZIP archive processing tools allow you to package, view and extract files from widely used archive formats such as ZIP, CAB, Microsoft Compress, GZIP, TAR, etc. WinZip is very simple and easy to use.

    However, it is not always justified to use separate archivers with their own graphical shells. The most convenient shell for archivers is the regular one. file manager, for example, Windows Commander, which has the ability to view and unpack archive files in ZTP, ARJ, RAR, TAR, GZ, CAB, ACE formats. Still, most operations with files, including archives, are performed in such managers.

    4. Lossy data compression

    Lossy data compression is a data compression method where the decompressed file is different from the original, but is "close enough" to be useful in some way. This type of compression is often used on the Internet, especially in streaming and telephony. These methods are often called codecs in this context. An alternative is lossless compression.

    Types of lossy compression

    There are two main lossy compression schemes:

    In transform codecs, frames of images or sound are taken, cut into small segments, transformed into a new basis space, and quantized. The result is then compressed using entropy methods.

    In predictive codecs, previous and/or subsequent data is used to predict the current frame of an image or sound. The error between the predicted data and the actual data, along with the additional information needed to make the prediction, is then quantized and encoded.

    Some systems combine these two techniques by using transform codecs to compress error signals generated during the prediction stage.

    Lossy vs. Lossless Compression

    The advantage of lossy compression methods over lossless compression methods is that the former are significantly superior in terms of compression ratio, while continuing to meet the specified requirements.

    Lossy compression methods are often used to compress audio or images.

    In such cases, the uncompressed file may be very different from the original at the bit-for-bit level, but is virtually indistinguishable to the human ear or eye in most practical applications.

    Many methods focus on the structural features of human sensory organs. The psychoacoustic model determines how much sound can be compressed without degrading the perceived sound quality. Imperfections caused by lossy compression that are noticeable to the human ear or eye are known as compression artifacts.

    Audio data that has undergone lossy compression is not accepted by courts as material evidence (and is not even taken into account) due to the fact that the information that has undergone compression acquires compression artifacts and loses the natural noise of the environment from which the recording was made. Therefore, it is impossible to determine whether the recording is genuine or synthesized. Therefore, it is recommended to make important recordings in PCM format or use a tape recorder.

    Photos recorded in JPEG format, can be accepted by the court (despite the fact that the data has undergone lossy compression). But at the same time, the camera with which they were taken, or the corresponding photo color rendition table must be provided.

    Lossy data compression methods

    v Image compression:

    · Reduced color depth;

    · Principal component method;

    · Fractal compression;

    v Video compression:

    · Flash (also supports moving JPEG images);

    · MPEG-1 Part 2;

    · MPEG-2 Part 2;

    · MPEG-4 Part 2;

    v Audio compression:

    · MP3 - Defined by the MPEG-1 specification;

    · Ogg Vorbis (characterized by the absence of patent restrictions and more high quality);

    · AAC, AAC+ - exists in several versions, defined by the MPEG-2 and MPEG-4 specifications, used, for example, in Apple Computer;

    · eAAC+ - a format offered by Sony as an alternative to AAC and AAC+;

    · WMA is the property of Microsoft;

    information compression archiver loss

    5. Data compression without loss of information

    Lossless compression(English: Lossless data compression) - a method of information compression, using which encoded information can be restored with bit accuracy. In this case, the original data is completely restored from its compressed state. This type of compression is fundamentally different from lossy data compression. For each type digital information, as a rule, there are their own optimal lossless compression algorithms.

    Lossless data compression is used in many applications. For example, it is used in the popular file ZIP format and the Unix utility Gzip. It is also used as a component in lossy compression.

    Lossless compression is used when it is important that the compressed data is identical to the original. A common example is executable files And source code. Some graphic file formats, such as PNG or GIF, use only lossless compression; while others (TIFF, MNG) can use both lossy and lossless compression.

    Lossless compression technique

    From combinatorics it follows that there is no lossless compression algorithm that can reduce any file by at least one byte. However, this is not a sign of the quality of a compression algorithm - the algorithm must work effectively on the data for which it is designed.

    Multi-purpose compression algorithms are distinguished by the fact that they are capable of reducing a wide range of data - executable files, data files, texts, graphics, etc., and are used in archivers. Specialized algorithms are designed for a certain type of file (text, graphics, sound, etc.), but they compress such files much more strongly. For example: archivers compress audio by about a third (1.5 times), while FLAC compresses it by 2.5 times. Most specialized algorithms are of little use for “foreign” file types: for example, audio data is poorly compressed by an algorithm designed for texts.

    Most lossless compression algorithms operate in two stages: the first generates a statistical model for the incoming data, the second maps the incoming data to a bitwise representation, using the model to produce "probabilistic" (that is, frequently occurring) data, which is used more often than "non-probabilistic" data. .

    Statistical models of algorithms for text (or text binary data such as executable files) include:

    Burrows-Wheeler transform (block-sorting preprocessing that makes compression more efficient)

    LZ77 and LZ78 (DEFLATE is used)

    Coding algorithms through generating bit sequences:

    · Huffman algorithm (DEFLATE is also used)

    Arithmetic coding

    Lossless compression methods

    · Multi-purpose

    · Coding of run lengths - simple circuit, which gives good compression for data that contains many duplicate values

    · LZW - used in gif and many others.

    · Deflate - used in gzip, an advanced version of zip, and as part of the PNG compression process.

    · LZMA - used in 7-zip.

    v Audio compression:

    · Apple Lossless - ALAC (Apple Lossless Audio Codec);

    · Audio Lossless Coding - also known as MPEG-4 ALS;

    · Direct Stream Transfer - DST;

    · Free Lossless Audio Codec - FLAC;

    v Graphics compression

    · ABO - Adaptive Binary Optimization;

    · GIF - (lossless only for images containing less than 256 colors);

    · JBIG2 - (with or without lossy B/W images);

    · JPEG-LS - (lossless / almost lossless compression standard);

    · JPEG 2000 - (includes lossless compression; also tested by Sunil Kumar, professor at San Diego State University);

    · PGF - Progressive Graphics File (lossless/lossless compression);

    · PNG - Portable Network Graphics;

    · WMPhoto - (including lossless compression method);

    v Video compression

    · Animation codec;

    · CamStudio Video Codec;

    6. Storage of information (text, graphic, sound)

    Information is stored using certain storage media. A person stores his knowledge either in his own memory or on some external media.

    At first, a person used his memory to store and accumulate information - he simply memorized the information received and remembered it for some time. Gradually, people came to the conclusion that this method of storing information has a number of disadvantages. Realizing the unreliability of this method of storing and accumulating information, man began to record information in the form of drawings, with the invention of writing - on papyri, and later in books. Then photographic plates and sound recording devices appeared as elements external memory video and audio information, notebooks, reference books, encyclopedias, etc., which we call external data storage. By the middle of the 20th century, the computer was invented. The question immediately arose of how he would store the information.

    The information medium can be of different nature: paper. Mechanical, magnetic, electrical. Information recorded on media can be in the form of a symbol that is understandable to humans, or in a coded form. Information for tape recorder, video recorder, movie camera - sound stored on special devices: audio cassettes, video cassettes, films. Using a microphone and other devices, audio information is recorded on magnetic tape.

    In computers, the following devices for recording and reading information began to be used: punched card readers; magnetic tape drives, floppy (disk drive) and hard (hard drive) magnetic disk drives; compact disc drives (CD-ROM) and other more modern devices accumulation and storage of information.

    Bibliography

    1. Federal Law of the Russian Federation “On Information, Informatization and Information Protection” dated July 27, 2006 No. 149-FZ.

    2. Levin A.Sh. Self-instruction manual for working on a computer. - St. Petersburg: Peter, 2006. - 655 p.

    3. Romanova N.I. Fundamentals of computer science. - St. Petersburg: Politekhnika, 2004. -224 p.

    4. Simonovich S.V. Informatics. Basic course. - St. Petersburg: Peter, 2008 -640 p.

    Posted on Allbest.ru

    Similar documents

      Types of data compression: lossy and lossless. Compression with minimal redundancy. Coding using the Shannon-Fano method. Testing the program for compressing bmp and xls files. Delphi implementation of the Shannon and Huffman compression algorithm.

      course work, added 01/26/2011

      Classification and main characteristics of the data compression method. Calculation of compression ratios and evaluation of their effectiveness. Algorithms for polynomial, extrapolation and interpolation compression methods and their comparison. Optimal linear prediction.

      course work, added 03/17/2011

      Archiving and compression as methods of image compression. Data compression algorithms. Auxiliary tools that are used to reduce file sizes: change color model images, change resolution raster file,resampling.

      presentation, added 01/06/2014

      Study of the main types of archiver programs. File compression during archiving. An indicator of the degree of file compression. Evaluation of the functionality of the most popular packaging programs. Specifications compression processes. Lossless archiving methods.

      abstract, added 12/05/2013

      Disclosure of the purpose of file compression and characterization of the purpose of archivers as programs that pack and unpack files into an archive for ease of transfer and storage. The main types of archivers: file, software, disk. Lossless compression method.

      presentation, added 04/05/2011

      Basic concepts and methods of data compression. Transforming information stored in a file into a form that reduces redundancy in its presentation. Statistical and dictionary compression methods. Archive programs, basic features of WinRAR.

      test, added 03/12/2011

      Brief overview basic theories of compression. Concepts of ideas and their implementation. Data compression using the Burrows-Wheeler transform. Huffman's static algorithm. Locally adaptive compression algorithm. Ziv-Lempel algorithm (Welch) and Shannon-Fano method.

      practical work, added 04/24/2014

      Entropy and the amount of information. Combinatorial, probabilistic and algorithmic estimation of the amount of information. Modeling and coding. Some data compression algorithms. Arithmetic coding algorithm. Incremental transmission and reception.

      course work, added 07/28/2009

      The use of algorithms that provide a high degree of compression to increase the speed of data transmission over communication channels. Features and methods of finding singular value decomposition. Development of a program that implements image compression using SVD compression.

      thesis, added 10/13/2015

      Programs for creating archives. Data compression efficiency as most important characteristic archivers. Basic data compression methods. Characteristics of the program for packaging texts and WinRar programs. Unpacking files, packing files and folders into a common archive.

    Purpose of the lecture: study the main types and algorithms of data compression and learn to solve data compression problems using the Huffman method and using code trees.

    Claude Shannon is considered to be the founder of the science of information compression. His optimal encoding theorem shows what we should strive for when encoding information and how much this or that information will be compressed. In addition, he conducted experiments to empirically assess the redundancy of English text. Shanon asked people to guess the next letter and estimated the probability of guessing correctly. Based on a series of experiments, he came to the conclusion that amount of information in English text it ranges from 0.6 to 1.3 bits per character. Despite the fact that the results of Shannon's research were truly in demand only decades later, it is difficult to overestimate their importance.

    Data compression is a process that reduces the volume of data by reducing its redundancy. Data compression is associated with the compact arrangement of pieces of data standard size. Data compression can be divided into two main types:

    • Lossless compression (fully reversible) is a data compression method in which a previously encoded portion of data is restored after unpacking it completely without making changes. For each type of data, as a rule, there are optimal lossless compression algorithms.
    • Lossy compression is a data compression method that uses maximum degree compression of the original data array, part of the data contained in it is discarded. For text, numeric and tabular data, the use of programs that implement such compression methods is unacceptable. Basically, such algorithms are used to compress audio and video data, static images.

    Data compression algorithm (archiving algorithm) is an algorithm that eliminates data recording redundancy.

    Let us introduce a number of definitions that will be used further in the presentation of the material.

    Code alphabet– the set of all symbols of the input stream. When compressing English-language texts, many of the 128 ASCII codes are usually used. In image compression, the set of pixel values ​​may contain 2, 16, 256, or other number of elements.

    Code character– the smallest unit of data to be compressed. Typically a character is 1 byte, but it can be a bit, trit (0,1,2), or anything else.

    Code word is a sequence of code symbols from the code alphabet. If all words have the same length (number of characters), then such a code is called uniform (fixed length), and if words of different lengths are allowed, then - uneven (variable length).

    Code is a complete set of words.

    Token– a unit of data written to a compressed stream by some compression algorithm. A token consists of several fields of fixed or variable length.

    Phrase– a piece of data placed in a dictionary for further use in compression.

    Coding– data compression process.

    Decoding– the reverse process of encoding, in which data is restored.

    Compression ratio is one of the most commonly used quantities to indicate the efficiency of a compression method.

    A value of 0.6 means that the data takes up 60% of the original volume. Values ​​greater than 1 mean that the output stream is larger than the input stream (negative compression, or expansion).

    Compression Ratio– the reciprocal of the compression ratio.

    Values ​​greater than 1 indicate compression, and values ​​less than 1 indicate expansion.

    Average codeword length is a quantity that is calculated as the probability-weighted sum of the lengths of all codewords.

    L cp =p 1 L 1 +p 2 L 2 +...+p n L n ,

    where are the probabilities of code words;

    L 1 ,L 2 ,...,L n – lengths of code words.

    There are two main ways to perform compression.

    Statistical methods– compression methods that assign variable length codes to characters in the input stream, with shorter codes assigned to characters or groups of characters that are more likely to appear in the input stream. The best statistical methods Huffman coding is used.

    Dictionary compression are compression methods that store pieces of data in a “dictionary” (some data structure). If a string of new data coming into the input is identical to some fragment already in the dictionary, a pointer to that fragment is placed in the output stream. The best dictionary methods use the Ziv-Lempel method.

    Let's look at several well-known data compression algorithms in more detail.

    Huffman method

    This algorithm information coding was proposed by D.A. Huffman in 1952. Huffman coding (compression) is a widely used compression method that assigns alphabet characters variable length codes based on the probabilities of those characters appearing.

    The idea of ​​the algorithm is as follows: knowing the probabilities of the occurrence of characters in the source text, it is possible to describe the procedure for constructing variable-length codes consisting of an integer number of bits. Characters are more likely to be assigned shorter codes. Thus, in this method, when data is compressed, each character is assigned an optimal prefix code based on the probability of its occurrence in the text.

    Prefix code is a code in which no codeword is a prefix of any other codeword. These codes have variable length.

    Optimal prefix code is a prefix code that has a minimum average length.

    Huffman algorithm can be divided into two stages.

    1. Determining the probability of characters appearing in the source text.

      Initially, you need to read the entire source text and calculate the probabilities of characters appearing in it (sometimes counting the number of times each character appears). If all 256 characters are taken into account, then there will be no difference in compression of a text file or a file of another format.

    2. Finding the optimal prefix code.

      Next, two symbols a and b with the lowest probabilities of occurrence are found and replaced with one fictitious symbol x, which has a probability of occurrence equal to the sum of the probabilities of the appearance of characters a and b. Then, using this procedure recursively, the optimal prefix code is found for a smaller set of characters (where the characters a and b are replaced by a single character x ). The code for the original set of characters is obtained from the replacement character codes by adding 0 or 1 before the replacement character code, and these two new codes are accepted as the replacement character codes. For example, the character code of a will correspond to the code of x with a zero added before that code, and for the character b a one will be added before the code of the character x.

    Huffman codes have a unique prefix, which allows them to be decoded unambiguously, despite their variable length.

    Example 1. Software implementation Huffman method.

    #include "stdafx.h" #include using namespace std; void Expectancy(); long MinK(); void SumUp(); void BuildBits(); void OutputResult(char **Result); void Clear(); const int MaxK = 1000; long k, a, b; char bits; char sk; bool Free; char *res; long i, j, n, m, kj, kk1, kk2; char str; int _tmain(int argc, _TCHAR* argv)( char *BinaryCode; Clear(); cout<< "Введите строку для кодирования: "; cin >>str; Expectancy(); SumUp(); BuildBits(); OutputResult(&BinaryCode); cout<< "Закодированная строка: " << endl; cout << BinaryCode << endl; system("pause"); return 0; } //описание функции обнуления данных в массивах void Clear(){ for (i = 0; i < MaxK + 1; i++){ k[i] = a[i] = b[i] = 0; sk[i] = 0; Free[i] = true; for (j = 0; j < 40; j++) bits[i][j] = 0; } } /*описание функции вычисления вероятности вхождения каждого символа в тексте*/ void Expectancy(){ long *s = new long; for (i = 0; i < 256; i++) s[i] = 0; for (n = 0; n < strlen(str); n++) s]++; j = 0; for (i = 0; i < 256; i++) if (s[i] != 0){ j++; k[j] = s[i]; sk[j] = i; } kj = j; } /*описание функции нахождения минимальной частоты символа в исходном тексте*/ long MinK(){ long min; i = 1; while (!Free[i] && i < MaxK) i++; min = k[i]; m = i; for (i = m + 1; i <= kk2; i++) if (Free[i] && k[i] < min){ min = k[i]; m = i; } Free[m] = false; return min; } //описание функции подсчета суммарной частоты символов void SumUp(){ long s1, s2, m1, m2; for (i = 1; i <= kj; i++){ Free[i] = true; a[i] = 0; b[i] = 0; } kk1 = kk2 = kj; while (kk1 >2)( s1 = MinK(); m1 = m; s2 = MinK(); m2 = m; kk2++; k = s1 + s2; a = m1; b = m2; Free = true; kk1--; ) ) / /description of the function for generating prefix codes void BuildBits())( strcpy(bits,"1"); Free = false; strcpy(bits],bits); strcat(bits] , "0"); strcpy(bits],bits); strcat(bits] , "1"); strcpy(bits[m],"0"); strcpy(bits],bits[m]); , "0"); strcpy(bits],bits[m]); strcat(bits] , "1"); for (i = kk2 - 1; i > 0; i--) if (!Free[i] ) ( strcpy(bits],bits[i]); strcat(bits] , "0"); strcpy(bits],bits[i]); strcat(bits] , "1"); ) ) //function description output data void OutputResult(char **Result)( (*Result) = new char; for (int t = 0; i< 1000 ;i++) (*Result)[t] = 0; for (i = 1; i <= kj; i++) res] = bits[i]; for (i = 0; i < strlen(str); i++) strcat((*Result) , res]); } Листинг.

    The Huffman algorithm is universal; it can be used to compress data of any type, but it is ineffective for small files (due to the need to save the dictionary). Currently, this method is practically not used in its pure form; it is usually used as one of the compression stages in more complex schemes. This is the only algorithm that does not increase the size of the source data in the worst case (except for the need to store the lookup table along with the file).

    Data compression methods have a fairly long history of development, which began long before the advent of the first computer. This article will attempt to give a brief overview of the main theories, concepts of ideas and their implementations, without, however, claiming absolute completeness. More detailed information can be found, for example, in Krichevsky R.E. , Ryabko B.Ya. Witten I.H. , Rissanen J. , Huffman D.A. , Gallager R.G. , Knuth D.E. , Vitter J.S. etc.

    Information compression is a problem that has a fairly long history, much longer than the history of the development of computer technology, which (history) usually ran parallel to the history of the development of the problem of encoding and encrypting information. All compression algorithms operate on an input stream of information, the minimum unit of which is a bit, and the maximum unit is several bits, bytes or several bytes. The purpose of the compression process, as a rule, is to obtain a more compact output stream of information units from some initially non-compact input stream by means of some transformation of them. The main technical characteristics of compression processes and the results of their work are:

    The degree of compression (compress rating) or the ratio (ratio) of the volumes of the original and resulting streams;

    Compression rate - the time spent compressing a certain amount of information from an input stream before obtaining an equivalent output stream from it;

    Compression quality is a value that shows how tightly the output stream is compressed by applying re-compression to it using the same or another algorithm.

    There are several different approaches to the problem of information compression. Some have a very complex theoretical mathematical basis, others are based on the properties of the information flow and are algorithmically quite simple. Any approach and algorithm that implements data compression or compression is designed to reduce the volume of the output information stream in bits using its reversible or irreversible transformation. Therefore, first of all, according to the criterion related to the nature or format of the data, all compression methods can be divided into two categories: reversible and irreversible compression.

    Irreversible compression means such a transformation of the input data stream in which the output stream, based on a certain information format, represents, from a certain point of view, an object quite similar in external characteristics to the input stream, but differs from it in volume. The degree of similarity between the input and output streams is determined by the degree of correspondence of certain properties of the object (i.e., compressed and uncompressed information, in accordance with some specific data format) represented by a given information stream. Such approaches and algorithms are used to compress, for example, data from raster graphics files with a low degree of byte repetition in the stream. This approach uses the structure property of the graphic file format and the ability to present a graphic image approximately similar in display quality (for perception by the human eye) in several (or rather n) ways. Therefore, in addition to the degree or magnitude of compression, the concept of quality arises in such algorithms, because Since the original image changes during the compression process, quality can be understood as the degree of correspondence between the original and resulting images, assessed subjectively based on the information format. For graphic files, this correspondence is determined visually, although there are also corresponding intelligent algorithms and programs. Irreversible compression cannot be used in areas where it is necessary to have an exact match between the information structure of the input and output streams. This approach is implemented in popular formats for presenting video and photo information, known as JPEG and JFIF algorithms and JPG and JIF file formats.

    Reversible compression always leads to a reduction in the volume of the output information stream without changing its information content, i.e. - without loss of information structure. Moreover, from the output stream, using a reconstruction or decompression algorithm, the input can be obtained, and the recovery process is called decompression or decompression, and only after the decompression process is the data suitable for processing in accordance with its internal format.

    In reversible algorithms, encoding as a process can be viewed from a statistical point of view, which is even more useful, not only for constructing compression algorithms, but also for assessing their effectiveness. For all reversible algorithms there is a concept of coding cost. Coding cost refers to the average length of a codeword in bits. Coding redundancy is equal to the difference between the cost and entropy of encoding, and a good compression algorithm should always minimize redundancy (remember that the entropy of information is a measure of its disorder). Shannon's fundamental theorem about information encoding says that “the cost of encoding is always no less than the entropy of the source, although it can be arbitrarily close to it.” Therefore, for any algorithm, there is always a certain limit on the degree of compression, determined by the entropy of the input stream.

    Let us now move directly to the algorithmic features of reversible algorithms and consider the most important theoretical approaches to data compression associated with the implementation of encoding systems and methods of information compression.

    Compression by series encoding method

    The most well-known simple approach and algorithm for compressing information in a reversible way is Run Length Encoding (RLE). The essence of the methods in this approach is to replace chains or series of repeating bytes or their sequences with one coding byte and a counter for the number of their repetitions. The problem with all similar methods is only to determine the way in which the decompressing algorithm could distinguish an encoded series from other unencoded byte sequences in the resulting byte stream. The solution to the problem is usually achieved by placing marks at the beginning of the coded chains. Such marks can be, for example, characteristic bit values ​​in the first byte of a coded series, the values ​​of the first byte of a coded series, etc. These methods, as a rule, are quite effective for compressing raster graphics images (BMP, PCX, TIF, GIF), because the latter contain quite a lot of long series of repeating byte sequences. The disadvantage of the RLE method is the rather low compression ratio or the cost of encoding files with a small number of series and, even worse, with a small number of repeating bytes in the series.

    Compression without using the RLE method

    The process of data compression without using the RLE method can be divided into two stages: modeling and, in fact, encoding. These processes and their implementing algorithms are quite independent and diverse.

    Coding process and its methods

    Coding usually means processing a stream of characters (in our case, bytes or nibbles) in some alphabet, and the frequencies of appearance of characters in the stream are different. The purpose of encoding is to convert this stream into a stream of bits of the minimum length, which is achieved by reducing the entropy of the input stream by taking into account symbol frequencies. The length of the code representing characters from the stream alphabet must be proportional to the amount of information in the input stream, and the length of the stream characters in bits may not be a multiple of 8 or even variable. If the probability distribution of frequencies of occurrence of symbols from the alphabet of the input stream is known, then an optimal coding model can be constructed. However, due to the existence of a huge number of different file formats, the task becomes much more complicated. The frequency distribution of data symbols is unknown in advance. In this case, in general, two approaches are used.

    The first is to view the input stream and construct encoding based on the collected statistics (this requires two passes through the file - one to view and collect statistical information, the second for encoding, which somewhat limits the scope of such algorithms, because, in this way , eliminates the possibility of single-pass on-the-fly encoding used in telecommunication systems, where the volume of data is sometimes unknown, and its retransmission or parsing can take an unreasonably long time). In this case, the statistical scheme of the encoding used is written to the output stream. This method is known as static Huffman coding.

    Introduction.

    Compression reduces the amount of space required to store files on a computer, and

    the amount of time required to transmit information over an established channel

    transmission width. This is a form of coding. Other coding purposes

    are error detection and correction, as well as encryption. Search process and

    error correction is the opposite of compression - it increases data redundancy,

    when they do not need to be presented in a form convenient for human perception. Deleting

    from text redundancy, compression promotes encryption, which makes searching difficult

    cipher using a statistical method accessible to an attacker.

    Let's consider reversible compression or compression without interference, where the original

    the text can be exactly restored from its compressed state. Irreversible or

    disadvantageous compression is used for digital recording of analog signals such as

    human speech or drawings. Reversible compression is especially important for texts

    written in natural and artificial languages, since in this case

    mistakes are usually unacceptable. Although the primary area of ​​application

    of the methods under consideration is text compression, which is reflected in our terminology,

    however, this technique may have applications in other cases, including reversible

    coding of sequences of discrete data.

    There are many good reasons to allocate computer resources for compressed

    presentation, because faster data transfer and reduced space for

    their storage allows you to save significant money and often improve

    computer indicators. Compression will likely remain an area of ​​focus due to all

    increasing volumes of data stored and transmitted to the computer, in addition it can be

    used to overcome certain physical limitations, such as

    for example, the relatively low bandwidth of telephone channels.

    APPLICATION OF EXPANDING TREES FOR DATA COMPRESSION.

    Compression algorithms can improve the efficiency of data storage and transmission

    by reducing their redundancy. The compression algorithm takes

    as input the source text and produces the corresponding compressed text,

    when the unfolding algorithm has compressed text as input and receives from

    The output is the original text of the source. Most compression algorithms

    consider the source text as a set of strings consisting of letters of the alphabet

    source text.

    The redundancy in the string representation S is L(S) - H(S), where L(S) is the length

    representations in bits, and H(S) - entropy - a measure of information content, also

    expressed in bits. Algorithms that could compress without loss of information

    a string with fewer bits than its entropy does not exist. If from

    extract the source text one letter at a time from some random set,

    using the alphabet A, then the entropy is found by the formula:

    H(S) = C(S) p(c) log ---- ,

    where C(S) is the number of letters in the line, p(c) is the static probability

    occurrence of some letter C. If the frequency of occurrence of

    each letter c in the string S, then H(C) is called the self-entropy of the string S. In this

    article H(S) will be used to denote the self-entropy of a string taken from

    static source.

    Expanding trees typically describe forms of lexicographic ordering

    binary search trees, but trees used in data compression may not

    have a constant orderliness. Elimination of order leads to

    significant simplification of basic expansion operations. Received as a result

    the algorithms are extremely fast and compact. In the case of using Huffman codes,

    expansion results in a locally adapted compression algorithm that

    It is remarkably simple and fast, although it does not allow for optimal compression.

    When it is applied to arithmetic codes, the compression result is close to

    optimal and approximately optimal in time.

    PREFIX CODES.

    Most widely studied data compression algorithms are based on codes

    Huffman. In Huffman code, each letter of the source text is represented in an archive

    variable length code. More frequent letters are represented by short codes,

    less frequent - long. The codes used in the compressed text must comply

    properties of the prefix, namely: the code used in the compressed text cannot be

    prefix of any other code.

    Prefix codes can be found through a tree in which each leaf

    corresponds to one letter of the source alphabet. Figure 1 shows the code tree

    prefix for a 4 letter alphabet. The prefix code for a letter can be read by

    traversing the tree from the root to this letter, where 0 corresponds to choosing its left branch,

    and 1 is right. A Huffman code tree is a weight-balanced tree, where each

    the sheet has a weight equal to the frequency of occurrence of the letter in the source text, and

    internal nodes do not have their own weight. The tree in the example will be optimal if

    the frequencies of the letters A, B, C and D will be 0.125, 0.125, 0.25 and 0.5 respectively.

    Conventional Huffman codes require prior information about the frequency of occurrence

    letters in the source text, which leads to the need to view it twice - one

    to obtain letter frequency values, the other to carry out the compression itself. IN

    Subsequently, the values ​​of these frequencies must be combined with the compressed text itself in order

    make it possible to deploy it in the future. Adaptive compression in progress

    in one step, because the code used for each letter of the source text is based

    at the frequencies of all other letters of the alphabet except her. Basics for Effective

    implementations of adaptive Huffman code were started by Gallagher, Knuth published

    a practical version of such an algorithm, and Witter developed it.

    The optimal adapted Witter code always lies within one bit per

    source letter relative to the optimal static Huffman code, which is usually

    is a few percent of H. In addition, static Huffman codes are always

    lie within one bit per letter of the source text from H (they reach this

    the limit is only when for all letters p(C) = 2). There are compression algorithms

    that can overcome these limitations. The Ziv-Lempell algorithm, for example,

    assigns words from a fixed-length archive to lines of source text

    variable length, and arithmetic compression can be used for encoding

    source letters even a fraction of a bit.

    Applying an extension to prefix codes.

    Expanding trees were first described in 1983 and are described in more detail

    reviewed in 1985. Initially they were understood as a type of self-balanced

    binary search trees, and have also been shown to allow

    the fastest implementation of priority queues. If an expanding tree node

    is available, then it is extended. This means that the available node becomes

    root, all nodes to the left of it form a new left subtree, nodes to the right -

    new right subtree. The expansion is achieved by traversing the tree from the old one

    root to the target node and making local changes, so the price

    expansion is proportional to the length of the path traveled.

    Tarjan and Slayton showed that expanding trees are statically optimal.

    In other words, if the codes of available nodes are taken according to the static

    probability distribution, then the speed of access to the expanding tree and

    statically balanced, optimized by this distribution, will be

    differ from each other by a constant factor, noticeable when sufficiently

    long series of accesses. Since the Huffman tree is an example

    statically balanced tree, then when using expansion for compression

    data, the size of the compressed text will be within a certain coefficient from

    archive size obtained using the Huffman code.

    As originally described, the extension applies to trees storing

    data in internal nodes, not leaves. Trees of prefix codes carry all

    your data is only in the leaves. There is, however, an extension option called

    semi-extension, which is applicable to the prefix code tree. With him the target

    the node is not moved to the root and its descendants are not modified,

    instead, the path from the root to the target is simply halved. Half expansion reaches

    the same theoretical boundaries within a constant coefficient as

    extension.

    In the case of a zigzag traversal of the lexicographic tree, carrying out as

    extensions and semi-extensions become more complicated, in contrast to the direct route along

    left or right edge of the tree to the target node. This simple case is shown in

    Figure 2. Impact of half-expansion on the route from the root (node ​​w) to the leaf

    node A is to swap the places of each pair of internal ones following each other

    other nodes, as a result of which the length of the path from the root to the leaf node is reduced by

    2 times. During the process of half-expansion, the nodes of each pair that are farther from the root

    are included in the new path (nodes x and z), and closer ones from it

    are excluded (nodes w and y).

    Preservation of lexicographic order in code trees by the half-expansion operation

    prefix is ​​optional. The only thing important in code operations

    prefix is ​​an exact match of the tree used by the compression procedure

    tree used by the deployment procedure. Any change made

    between consecutive letters, is performed only if

    both procedures make the same changes in the same order.

    The need to support lexicographic order greatly simplifies

    half-expansion operations by eliminating the zigzag case. It could be

    My supervisor and I are preparing a small monograph on image processing. I decided to present to the Habra community a chapter dedicated to image compression algorithms. Since it’s hard to fit a whole chapter into one post, I decided to break it into three posts:
    1. Data compression methods;
    2. Lossless image compression;
    3. Lossy image compression.
    Below you can read the first post in the series.

    Currently, there are a large number of lossless compression algorithms, which can be divided into two large groups:
    1. Stream and dictionary algorithms. This group includes algorithms from the RLE (run-length encoding), LZ*, etc. families. A feature of all algorithms in this group is that the encoding uses not information about the frequencies of symbols in the message, but information about sequences encountered previously.
    2. Algorithms for statistical (entropy) compression. This group of algorithms compresses information by taking advantage of the irregular frequencies with which different characters occur in a message. Algorithms in this group include arithmetic and prefix coding algorithms (using Shannon-Fanno, Huffman, secant trees).
    Algorithms for converting information can be classified into a separate group. Algorithms of this group do not directly compress information, but their use greatly simplifies further compression using stream, dictionary and entropy algorithms.

    Stream and Dictionary Algorithms

    Run length encoding

    Run-Length Encoding (RLE) is one of the simplest and most common data compression algorithms. In this algorithm, a sequence of repeated characters is replaced by a character and the number of times it is repeated.
    For example, the string “AAAAAA”, which requires 5 bytes to be stored (assuming a byte is allocated for storing one character), can be replaced with “5A”, consisting of two bytes. Obviously, this algorithm is more effective the longer the series of repetitions.

    The main disadvantage of this algorithm is its extremely low efficiency on sequences of non-repeating characters. For example, if we consider the sequence “ABABAB” (6 bytes), then after applying the RLE algorithm it will turn into “1A1B1A1B1A1B” (12 bytes). There are various methods to solve the problem of non-repeating characters.

    The simplest method is the following modification: the byte encoding the number of repetitions should store information not only about the number of repetitions, but also about their presence. If the first bit is 1, then the next 7 bits indicate the number of repetitions of the corresponding character, and if the first bit is 0, then the next 7 bits indicate the number of characters that must be taken without repetition. If we encode “ABABAB” using this modification, we will get “-6ABABAB” (7 bytes). It is obvious that the proposed technique can significantly increase the efficiency of the RLE algorithm on non-repeating character sequences. The implementation of the proposed approach is shown in Listing 1:

    1. type
    2. function RLEEncode(InMsg: ShortString) : TRLEEncodedString;
    3. MatchFl: boolean;
    4. MatchCount: shortint ;
    5. EncodedString: TRLEEncodedString;
    6. N, i: byte ;
    7. begin
    8. N:=0;
    9. SetLength(EncodedString, 2 * length(InMsg) ) ;
    10. while length(InMsg) >= 1 do
    11. begin
    12. MatchFl : = (length(InMsg) > 1 ) and (InMsg[ 1 ] = InMsg[ 2 ] ) ;
    13. MatchCount: = 1;
    14. while (MatchCount<= 126 ) and (MatchCount < length(InMsg) ) and ((InMsg[ MatchCount] = InMsg[ MatchCount + 1 ] ) = MatchFl) do
    15. MatchCount: = MatchCount + 1;
    16. if MatchFl then
    17. begin
    18. N:=N+2;
    19. EncodedString[ N - 2 ] : = MatchCount + 128 ;
    20. EncodedString[ N - 1 ] : = ord ( InMsg[ 1 ] ) ;
    21. else
    22. begin
    23. if MatchCount<>length(InMsg) then
    24. MatchCount : = MatchCount - 1 ;
    25. N : = N + 1 + MatchCount;
    26. EncodedString[ N - 1 - MatchCount] : = - MatchCount + 128 ;
    27. for i : = 1 to MatchCount do
    28. EncodedString[ N - 1 - MatchCount + i] : = ord (InMsg[ i] ) ;
    29. end ;
    30. delete(InMsg, 1 , MatchCount) ;
    31. end ;
    32. SetLength(EncodedString, N) ;
    33. RLEEncode := EncodedString;
    34. end ;

    Decoding a compressed message is very simple and boils down to a single pass through the compressed message, see Listing 2:
    1. type
    2. TRLEEncodedString = array of byte ;
    3. function RLEDecode(InMsg: TRLEEncodedString) : ShortString;
    4. RepeatCount: shortint ;
    5. i, j: word ;
    6. OutMsg: ShortString;
    7. begin
    8. OutMsg : = "" ;
    9. i := 0 ;
    10. while i< length(InMsg) do
    11. begin
    12. RepeatCount : = InMsg[ i] - 128 ;
    13. i : = i + 1 ;
    14. if RepeatCount< 0 then
    15. begin
    16. RepeatCount := abs (RepeatCount) ;
    17. for j : = i to i + RepeatCount - 1 do
    18. OutMsg : = OutMsg + chr (InMsg[ j] ) ;
    19. i : = i + RepeatCount;
    20. else
    21. begin
    22. for j : = 1 to RepeatCount do
    23. OutMsg : = OutMsg + chr (InMsg[ i] ) ;
    24. i : = i + 1 ;
    25. end ;
    26. end ;
    27. RLEDecode := OutMsg;
    28. end ;

    The second method for increasing the efficiency of the RLE algorithm is to use information transformation algorithms that do not directly compress the data, but bring it into a form more convenient for compression. As an example of such an algorithm, we will consider the BWT permutation, named after the inventors of the Burrows-Wheeler transform. This permutation does not change the characters themselves, but only changes their order in the string, while repeated substrings after applying the permutation are collected into dense groups, which are much better compressed using the RLE algorithm. Direct BWT conversion comes down to the following steps:
    1. Adding to the original string a special end-of-line character that does not occur anywhere else;
    2. Obtaining all cyclic permutations of the original string;
    3. Sorting the received strings in lexicographic order;
    4. Returning the last column of the resulting matrix.
    An implementation of this algorithm is shown in Listing 3.
    1. const
    2. EOMsg = "|" ;
    3. function BWTEncode(InMsg: ShortString) : ShortString;
    4. OutMsg: ShortString;
    5. LastChar:ANSIChar;
    6. N, i: word ;
    7. begin
    8. InMsg : = InMsg + EOMsg;
    9. N : = length(InMsg) ;
    10. ShiftTable[ 1 ] : = InMsg;
    11. for i : = 2 to N do
    12. begin
    13. LastChar : = InMsg[ N] ;
    14. InMsg : = LastChar + copy(InMsg, 1 , N - 1 ) ;
    15. ShiftTable[ i] : = InMsg;
    16. end ;
    17. Sort(ShiftTable) ;
    18. OutMsg : = "" ;
    19. for i : = 1 to N do
    20. OutMsg : = OutMsg + ShiftTable[ i] [ N] ;
    21. BWTEncode := OutMsg;
    22. end ;

    The easiest way to explain this transformation is with a specific example. Let’s take the string “PINEAPPLE” and agree that the end of the string character will be the “|” character. All cyclic permutations of this string and the result of their lexicographic sorting are given in Table. 1.

    Those. The result of a direct conversion is the string "|NNAAAC". It is easy to see that this string is compressed much better than the original one by the RLE algorithm, because it contains long subsequences of repeated letters.
    A similar effect can be achieved using other transformations, but the advantage of the BWT transformation is that it is reversible, although the reverse transformation is more complicated than the direct one. In order to restore the original string, you must perform the following steps:
    Create an empty matrix of size n*n, where n is the number of characters in the encoded message;
    Fill the rightmost empty column with the encoded message;
    Sort table rows in lexicographical order;
    Repeat steps 2-3 as long as there are empty columns;
    Return the string that ends with the end-of-line character.

    Implementing the reverse conversion is not difficult at first glance, and one of the implementation options is shown in Listing 4.

    1. const
    2. EOMsg = "|" ;
    3. function BWTDecode(InMsg: ShortString) : ShortString;
    4. OutMsg: ShortString;
    5. ShiftTable: array of ShortString;
    6. N, i, j: word ;
    7. begin
    8. OutMsg : = "" ;
    9. N : = length(InMsg) ;
    10. SetLength(ShiftTable, N + 1 ) ;
    11. for i : = 0 to N do
    12. ShiftTable[ i] : = "" ;
    13. for i : = 1 to N do
    14. begin
    15. for j := 1 to N do
    16. ShiftTable[ j] : = InMsg[ j] + ShiftTable[ j] ;
    17. Sort(ShiftTable) ;
    18. end ;
    19. for i : = 1 to N do
    20. if ShiftTable[ i] [ N] = EOMsg then
    21. OutMsg : = ShiftTable[ i] ;
    22. delete(OutMsg, N, 1 ) ;
    23. BWTDecode := OutMsg;
    24. end ;

    But in practice, the efficiency depends on the sorting algorithm chosen. Trivial algorithms with quadratic complexity will obviously have a very negative impact on performance, so it is recommended to use efficient algorithms.

    After sorting the table obtained in the seventh step, you need to select a row from the table that ends with the “|” character. It's easy to see that this is the only line. That. We looked at the BWT transformation using a specific example.

    To summarize, we can say that the main advantage of the RLE group of algorithms is the simplicity and speed of operation (including decoding speed), and the main disadvantage is inefficiency on non-repeating character sets. The use of special permutations increases the efficiency of the algorithm, but also greatly increases the running time (especially decoding).

    Dictionary compression (LZ algorithms)

    The group of dictionary algorithms, in contrast to the algorithms of the RLE group, encodes not the number of repetitions of characters, but previously encountered sequences of characters. While the algorithms under consideration are running, a table is dynamically created with a list of already encountered sequences and their corresponding codes. This table is often called a dictionary, and the corresponding group of algorithms is called dictionary.

    The simplest version of the dictionary algorithm is described below:
    Initialize the dictionary with all characters appearing in the input string;
    Find in the dictionary the longest sequence (S) that matches the beginning of the encoded message;
    Print the code of the found sequence and remove it from the beginning of the encoded message;
    If the end of the message is not reached, read the next character and add Sc to the dictionary, go to step 2. Otherwise, exit.

    For example, the newly initialized dictionary for the phrase “CUCKOOKOOKUSHONKOOKUPILAKAHOOD” is shown in Table. 3:

    During the compression process, the dictionary will be supplemented with sequences found in the message. The process of updating the dictionary is given in Table. 4.

    When describing the algorithm, a description of the situation when the dictionary is completely filled was deliberately omitted. Depending on the variant of the algorithm, different behavior is possible: complete or partial clearing of the dictionary, stopping filling the dictionary, or expanding the dictionary with a corresponding increase in the code capacity. Each of these approaches has certain disadvantages. For example, stopping replenishment of the dictionary can lead to a situation where the dictionary stores sequences that occur at the beginning of the string being compressed, but do not occur subsequently. At the same time, cleaning the dictionary can lead to the removal of frequent sequences. Most of the implementations used, when filling the dictionary, begin to monitor the compression level, and when it decreases below a certain level, the dictionary is rebuilt. Next, we will consider the simplest implementation that stops updating the dictionary when it is full.

    First, let's define a dictionary as a record that stores not only the substrings encountered, but also the number of substrings stored in the dictionary:

    Previously encountered subsequences are stored in the Words array, and their code is the subsequence numbers in this array.
    We will also define the functions of searching in the dictionary and adding to the dictionary:

    1. const
    2. MAX_DICT_LENGTH = 256 ;
    3. function FindInDict(D: TDictionary; str: ShortString) : integer ;
    4. r:integer;
    5. i:integer ;
    6. fl:boolean ;
    7. begin
    8. r : = - 1 ;
    9. if D. WordCount > 0 then
    10. begin
    11. i := D. WordCount ;
    12. fl := false ;
    13. while (not fl) and (i >= 0 ) do
    14. begin
    15. i : = i - 1 ;
    16. fl : = D. Words [ i] = str;
    17. end ;
    18. end ;
    19. if fl then
    20. r := i;
    21. FindInDict := r;
    22. end ;
    23. procedure AddToDict(var D: TDictionary; str: ShortString) ;
    24. begin
    25. if D. WordCount< MAX_DICT_LENGTH then
    26. begin
    27. D. WordCount : = D. WordCount + 1 ;
    28. SetLength(D. Words, D. WordCount) ;
    29. D. Words [ D. WordCount - 1 ] : = str;
    30. end ;
    31. end ;

    Using these functions, the encoding process according to the described algorithm can be implemented as follows:
    1. function LZWEncode(InMsg: ShortString) : TEncodedString;
    2. OutMsg: TEncodedString;
    3. tmpstr: ShortString;
    4. D: TDictionary;
    5. i, N: byte ;
    6. begin
    7. SetLength(OutMsg, length(InMsg) ) ;
    8. N:=0;
    9. InitDict(D) ;
    10. while length(InMsg) > 0 do
    11. begin
    12. tmpstr : = InMsg[ 1 ] ;
    13. while (FindInDict(D, tmpstr) >= 0 ) and (length(InMsg) > length(tmpstr) do
    14. tmpstr : = tmpstr + InMsg[ length(tmpstr) + 1 ] ;
    15. if FindInDict(D, tmpstr)< 0 then
    16. delete(tmpstr, length(tmpstr) , 1 ) ;
    17. OutMsg[ N] : = FindInDict(D, tmpstr) ;
    18. N:=N+1;
    19. delete(InMsg, 1 , length(tmpstr) ) ;
    20. if length(InMsg) > 0 then
    21. AddToDict(D, tmpstr + InMsg[ 1 ] ) ;
    22. end ;
    23. SetLength(OutMsg, N) ;
    24. LZWEncode := OutMsg;
    25. end ;

    The result of encoding will be the numbers of words in the dictionary.
    The decoding process is reduced to direct decoding of codes, and there is no need to transfer the created dictionary; it is enough that during decoding the dictionary is initialized in the same way as during encoding. Then the dictionary will be completely restored directly during the decoding process by concatenating the previous subsequence and the current symbol.

    The only problem is possible in the following situation: when it is necessary to decode a subsequence that is not yet in the dictionary. It is easy to see that this is only possible when it is necessary to extract a substring that should be added at the current step. This means that the substring satisfies the cSc pattern, i.e. begins and ends with the same character. In this case, cS is the substring added in the previous step. The considered situation is the only one when it is necessary to decode a line that has not yet been added. Considering the above, we can propose the following option for decoding a compressed string:

    1. function LZWDecode(InMsg: TEncodedString) : ShortString;
    2. D: TDictionary;
    3. OutMsg, tmpstr: ShortString;
    4. i: byte ;
    5. begin
    6. OutMsg : = "" ;
    7. tmpstr : = "" ;
    8. InitDict(D) ;
    9. for i : = 0 to length(InMsg) - 1 do
    10. begin
    11. if InMsg[ i] >= D. WordCount then
    12. tmpstr : = D. Words [ InMsg[ i - 1 ] ] + D. Words [ InMsg[ i - 1 ] ] [ 1 ]
    13. else
    14. tmpstr : = D. Words [ InMsg[ i] ] ;
    15. OutMsg : = OutMsg + tmpstr;
    16. if i > 0 then
    17. AddToDict(D, D. Words [ InMsg[ i - 1 ] ] + tmpstr[ 1 ] ) ;
    18. end ;
    19. LZWDecode := OutMsg;
    20. end ;

    The advantages of dictionary algorithms include their greater compression efficiency compared to RLE. However, it must be understood that the actual use of these algorithms involves some implementation difficulties.

    Entropy coding

    Coding using Shannon-Fano trees

    The Shannon-Fano algorithm is one of the first compression algorithms developed. The algorithm is based on the idea of ​​representing more frequent characters using shorter codes. Moreover, codes obtained using the Shannon-Fano algorithm have the property of prefixity: i.e. no code is the beginning of any other code. The prefix property ensures that the encoding is one-to-one. The algorithm for constructing Shannon-Fano codes is presented below:
    1. Divide the alphabet into two parts, the total probabilities of the symbols in which are as close as possible to each other.
    2. Add 0 to the prefix code of the first part of the characters, add 1 to the prefix code of the second part of the characters.
    3. For each part (which contains at least two characters), recursively perform steps 1-3.
    Despite its comparative simplicity, the Shannon-Fano algorithm is not without its drawbacks, the most significant of which is non-optimal coding. Although the partition at each step is optimal, the algorithm does not guarantee an optimal result as a whole. Consider, for example, the following string: "AAABVGDEJ". The corresponding Shannon-Fano tree and the codes obtained on its basis are presented in Fig. 1:

    Without using encoding, the message will occupy 40 bits (provided that each character is encoded with 4 bits), and using the Shannon-Fano algorithm 4*2+2+4+4+3+3+3=27 bits. The message volume decreased by 32.5%, but below we will show that this result can be significantly improved.

    Coding with Huffman Trees

    The Huffman coding algorithm, developed several years after the Shannon-Fano algorithm, also has the property of prefixity, and, in addition, proven minimal redundancy, which is what determines its extremely wide distribution. To obtain Huffman codes, use the following algorithm:
    1. All characters of the alphabet are represented as free nodes, and the weight of the node is proportional to the frequency of the character in the message;
    2. From the set of free nodes, two nodes with minimal weight are selected and a new (parent) node is created with a weight equal to the sum of the weights of the selected nodes;
    3. The selected nodes are removed from the free list, and the parent node created on their basis is added to this list;
    4. Steps 2-3 are repeated until there is more than one node in the free list;
    5. Based on the constructed tree, each character of the alphabet is assigned a prefix code;
    6. The message is encoded with the received codes.

    Let's consider the same example as in the case of the Shannon-Fano algorithm. The Huffman tree and codes obtained for the message “AAAABVGDEJ” are presented in Fig. 2:

    It is easy to calculate that the volume of the encoded message will be 26 bits, which is less than in the Shannon-Fano algorithm. Separately, it is worth noting that due to the popularity of the Huffman algorithm in at the moment There are many variants of Huffman coding, including adaptive coding, which does not require transmission of symbol frequencies.
    Among the disadvantages of the Huffman algorithm, a significant part are problems associated with the complexity of implementation. Using symbols of real variables to store frequencies is associated with a loss of precision, so in practice integer variables are often used, but since The weight of the parent nodes is constantly growing, sooner or later an overflow occurs. Thus, despite the simplicity of the algorithm, its correct implementation can still cause some difficulties, especially for large alphabets.

    Coding with secant function trees

    Coding using secant functions is an algorithm developed by the authors that allows you to obtain prefix codes. The algorithm is based on the idea of ​​constructing a tree, each node of which contains a secant function. To describe the algorithm in more detail, it is necessary to introduce several definitions.
    A word is an ordered sequence of m bits (the number m is called the word capacity).
    A secant literal is a pair of the form digit-digit value. For example, the literal (4,1) means that bit 4 of the word must be 1. If the condition of the literal is satisfied, then the literal is considered true, otherwise it is false.
    A k-bit secant is a set of k literals. If all literals are true, then the secant function itself is true, otherwise it is false.

    The tree is constructed so that each node divides the alphabet into as close parts as possible. In Fig. Figure 3 shows an example of a secant tree:

    The tree of secant functions in the general case does not guarantee optimal coding, but it provides extremely high speed work due to the simplicity of operation in the nodes.

    Arithmetic coding

    Arithmetic coding is one of the most effective ways information compression. Unlike the Huffman algorithm, arithmetic coding allows messages to be encoded with entropy less than 1 bit per character. Because Most arithmetic coding algorithms are protected by patents; only the basic ideas will be described below.
    Let us assume that the alphabet in use contains N symbols a_1,…,a_N, with frequencies p_1,…,p_N, respectively. Then the arithmetic coding algorithm will look like this:
    Take ) as a working half-interval)