What parameters determine the quality of digital sound. Formats: what digital sound is like The quality of digitized sound depends on

Sound It is a wave propagating most often in air, water or other medium with a continuously changing intensity and frequency.

A person can perceive sound waves (air vibrations) with the help of hearing in the form of sound, while distinguishing volume and tone.

The more intensity sound wave, the louder the sound, the higher the frequency of the wave, the higher the tone of the sound.

The dependence of loudness, as well as the pitch of the sound, on the intensity and frequency of the sound wave

Hertz(denoted by Hz or Hz) - a unit of measure for the frequency of periodic processes (for example, oscillations). 1 Hz means one execution of such a process in one second: 1 Hz = 1/s.

If we have 10 Hz, then this means that we have ten executions of such a process in one second.

The human ear can perceive sound at frequencies ranging from 20 vibrations per second (20 Hertz, low sound) to 20,000 vibrations per second (20 kHz, high sound).

In addition, a person can perceive sound in a wide range of intensities, in which the maximum intensity is 1014 times greater than the minimum (one hundred thousand billion times).

In order to measure the loudness of sound, a special unit was invented and used " decibel" (dB)

A decrease or increase in sound volume by 10 dB corresponds to a decrease or increase in sound intensity by 10 times.

Sound volume in decibels

In order for computer systems to process audio, a continuous audio signal must be converted into a digital, discrete form using temporal sampling.

To do this, a continuous sound wave is divided into separate small time sections, for each such section a certain amount of sound intensity is set.

Thus, the continuous dependence of sound loudness on time A(t) is replaced by a discrete sequence of loudness levels. On the graph, this looks like replacing a smooth curve with a sequence of "steps".

Temporal Audio Sampling

A microphone connected to the sound card is used to record analog audio and convert it to digital form.

The denser the discrete stripes are located on the graph, the better the result will be to recreate the original sound.

The quality of the received digital sound depends on the number of measurements of the sound volume level per unit of time, i.e. the sampling frequency.

Audio sampling rate is the number of measurements of sound volume in one second.

The more measurements are made in one second (the higher the sampling rate), the more accurately the "ladder" of the digital audio signal repeats the curve of the analog signal.

Each "step" on the graph is assigned a certain value of the sound volume level. Sound volume levels can be thought of as a set of possible states N(gradations), which require a certain amount of information to encode I, which is called the audio coding depth.

Audio encoding depth is the amount of information needed to encode discrete digital audio loudness levels.

If the encoding depth is known, then the number of digital sound volume levels can be calculated using the general formula N = 2 I.

For example, if the audio encoding depth is 16 bits, then the number of audio volume levels is:

N = 2 I = 2 16 = 65536.

During the encoding process, each sound volume level is assigned its own 16-bit binary code, the lowest sound level will correspond to the code 0000000000000000, and the highest - 1111111111111111.

Quality of digitized sound

So, the higher the sampling frequency and the sound coding depth, the better the quality of the digitized sound will be and the better the digitized sound can be brought closer to the original sound.

The lowest quality of digitized audio, corresponding to the quality of telephone communication, is obtained at a sampling rate of 8000 times per second, sampling depth of 8 bits and recording of one audio track ("mono" mode).

The highest quality of digitized sound, corresponding to the quality of an audio CD, is achieved with a sampling rate of 48,000 times per second, a sampling depth of 16 bits and recording of two audio tracks (stereo mode).

It must be remembered that the higher the quality of digital sound, the greater the information volume of the sound file.

You can easily estimate the information volume of a 1 second digital stereo sound file with medium sound quality (16 bits, 24,000 samples per second). To do this, the encoding depth must be multiplied by the number of measurements per 1 second and multiplied by 2 channels (stereo sound):

16 bits × 24,000 × 2 = 768,000 bits = 96,000 bytes = 93.75 KB.

Sound editors

Sound editors allow you not only to record and play sound, but also to edit it. The most prominent can be safely called, such as Sony Sound Forge, Adobe Audition, GoldWave and others.

Digitized sound is presented in sound editors in a clear visual form, so the operations of copying, moving and deleting parts of an audio track can be easily performed using a computer mouse.

In addition, you can overlay, overlap audio tracks on top of each other (mix sounds) and apply various acoustic effects (echo, reverse playback, etc.).

Sound editors allow you to change the quality of digital sound and the volume of the final sound file by changing the sampling rate and encoding depth. Digitized audio can be saved uncompressed as audio files in universal WAV format (Microsoft's format) or in OGG, MP3 (lossy) compressed formats.
Less common but noteworthy formats with lossless compression are also available.

When saving sound in compressed formats, low-intensity audio frequencies that are inaudible and imperceptible ("excessive") for human perception, coinciding in time with high-intensity audio frequencies, are discarded. The use of this format allows you to compress sound files dozens of times, but leads to irreversible loss of information (files cannot be restored in their original, original form).

Bits, hertz, shaped dithering...

What is hidden behind these concepts? When developing the audio compact disc (CD Audio) standard, values were taken 44 kHz, 16 bit and 2 channel (i.e. stereo). Why exactly so many? What is the reason for this choice, and also why are attempts being made to increase these values to, say, 96 kHz and 24 or even up to 32 bits ...

Let's deal first with the resolution of sampling - that is, with the bit depth. It just so happens that you have to choose between the numbers 16, 24 and 32. Intermediate values \u200b\u200bwould, of course, be more convenient in terms of sound, but too unpleasant for use in digital technology (a rather controversial statement, given that many ADCs have 11 or 12 bit digital output - state note).

What is this parameter responsible for? In a nutshell - for the dynamic range. The range of simultaneously reproduced volumes is from the maximum amplitude (0 decibels) to the smallest amplitude that the resolution allows, for example, about minus 93 decibels for 16-bit audio. Oddly enough, this is strongly related to the noise level of the phonogram. In principle, for 16-bit audio, it is quite possible to transmit signals with a power of -120 dB, however, these signals will be difficult to apply in practice due to such a fundamental concept as sampling noise. The fact is that when taking digital values, we make mistakes all the time, rounding the real analog value to the nearest possible digital value. The smallest possible error is zero, but the maximum error is half the last digit (bit, hereinafter the term LSB will be abbreviated to MB). This error gives us the so-called sampling noise - a random discrepancy between the digitized signal and the original. This noise is constant and has a maximum amplitude equal to half of the least significant digit. This can be thought of as random values mixed into a digital signal. Sometimes this is called rounding or quantization noise (which is a more accurate name, since amplitude coding is called quantization, and sampling is the process of converting a continuous signal into a discrete (pulse) sequence - approx. state.).

Let us dwell in more detail on what is meant by signal power, measured in bits. The strongest signal in digital sound processing is usually taken as 0 dB, which corresponds to all bits set to 1. If the most significant bit (hereinafter referred to as SB) is set to zero, the resulting digital value will be half as much, which corresponds to a level loss of 6 decibels (10 * log(2) = 6). Thus, zeroing the units from the highest to the lowest digits, we will reduce the signal level by six decibels. It is clear that the minimum signal level (one in the least significant digit, and all other digits are zeros) is (N-1) * 6 decibels, where N is the bit depth of the sample. For 16 bits, we get the level of the weakest signal - 90 decibels.

When we say "half the LSB", we don't mean -90/2, but half the step to the next bit - that is, another 3 decibels lower, minus 93 decibels.

We return to the choice of digitization resolution. As already mentioned, digitization introduces noise at the level of half the least significant digit, which means that a record digitized in 16 bits constantly making noise at minus 93 decibels. It can transmit signals even quieter, but the noise still remains at -93 dB. On this basis, the dynamic range of digital sound is determined - where the signal-to-noise ratio turns into noise / signal (there is more noise than the useful signal), the bottom limit of this range is located. Thus, main digitization criterion - how much noise can we afford in a restored signal? The answer to this question depends in part on how much noise was in the original recording. An important takeaway is that if we are digitizing something at minus 80 decibels, there is absolutely no reason to digitize it at more than 16 bits since, on the one hand, -93 dB noise adds very little to the already huge (comparatively) -80 noise. dB, and on the other hand - quieter than -80 dB in the phonogram itself, noise / signal already begins, and it is simply not necessary to digitize and transmit such a signal.

Theoretically, this is the only criterion for choosing a digitization resolution. More we do not contribute absolutely no distortions or inaccuracies. Practice, oddly enough, almost completely repeats the theory. This is what guided those people who chose 16-bit resolution for audio CDs. Noise minus 93 decibels is a pretty good condition, which almost exactly corresponds to the conditions of our perception: the difference between the pain threshold (140 decibels) and the usual background noise in the city (30-50 decibels) is just about a hundred decibels, and given that on the volume level that brings pain, they don’t listen to music - which narrows the range even more - it turns out that the real noises of the room or even the equipment are much stronger than the quantization noise. If we can hear a level under minus 90 decibels in a digital recording, we will hear and perceive quantization noise, otherwise we will simply never determine whether this audio is digitized or live. There is simply no other difference in terms of dynamic range. But in principle, a person can meaningfully hear in the range of 120 decibels, and it would be nice to keep the whole range, which 16 bits seem to be unable to cope with.

But this is only at first glance: with the help of a special technique called shaped dithering, you can change the frequency spectrum of sampling noise, almost completely move them to the region of more than 7-15 kHz. We seem to be changing the frequency resolution (refusing to reproduce quiet high frequencies) for an additional dynamic range in the remaining frequency range. In combination with the peculiarities of our hearing - our sensitivity to the kicked-out high-frequency region is tens of dB lower than in the main region (2-4 kHz) - this makes it possible to transmit relatively noiseless useful signals an additional 10-20 dB quieter than -93 dB - thus, the dynamic range of 16-bit sound for a person is about 110 decibels. And in general - at the same time, a person simply cannot hear sounds 110 decibels quieter than a loud sound just heard. The ear, like the eye, adjusts to the volume of the surrounding reality, so the simultaneous range of our hearing is relatively small - about 80 decibels. Let's talk about dithring in more detail after discussing frequency aspects.

For CDs, the sample rate is 44100 Hz. There is an opinion (based on a misunderstanding of the Kotelnikov-Nyquist theorem) that all frequencies up to 22.05 kHz are reproduced, but this is not entirely true. We can only say unequivocally that there are no frequencies above 22.05 kHz in the digitized signal. The real picture of the reproduction of digitized sound always depends on specific technique and is always not as perfect as we would like, and as consistent with theory. It all depends on the specific DAC (digital-to-analog converter responsible for receiving an audio signal from a digital sequence).

Let's figure out first what we would like to receive. A middle-aged person (rather young) can feel sounds from 10 Hz to 20 kHz, it is meaningful to hear - from 30 Hz to 16 kHz. Sounds above and below are perceived, but do not constitute an acoustic sensation. Sounds above 16 kHz are felt as an annoying unpleasant factor - pressure on the head, pain, especially loud sounds bring such sharp discomfort that you want to leave the room. Unpleasant sensations are so strong that the action of security devices is based on this - a few minutes of a very loud high-frequency sound will drive anyone crazy, and it becomes absolutely impossible to steal anything in such an environment. Sounds below 30 - 40 Hz with sufficient amplitude are perceived as vibration coming from objects (speakers). Rather, it would even be said so - just a vibration. A person acoustically almost does not determine the spatial position of such low sounds, therefore other sense organs are already being used - tactile, we feel such sounds with our body.

With high frequencies, everything is a little worse, at least for sure more difficult. Almost the whole essence of the improvements and complications of the DAC and ADC is aimed precisely at a more reliable transmission of high frequencies. By "high" we mean frequencies comparable to the sampling frequency - that is, in the case of 44.1 kHz, this is 7-10 kHz and higher.

Imagine a sinusoidal signal with a frequency of 14 kHz, digitized with a sampling rate of 44.1 kHz. There are about three points (counts) for one period of the input sinusoid, and in order to restore the original frequency in the form of a sinusoid, you need to show some imagination. The process of restoring the waveform from the samples also occurs in the DAC, this is done by the recovery filter. And if relatively low frequencies are almost ready-made sinusoids, then the shape and, accordingly, the quality of the restoration of high frequencies lies entirely on the conscience of the DAC restoration system. Thus, the closer the signal frequency is to one second of the sampling frequency, the more difficult it is to restore the signal shape.

This is the main problem when reproducing high frequencies. The problem, however, is not as bad as it might seem. All modern DACs use multirate technology, which consists in digitally restoring to a several times higher sampling rate, and then converting it to an analog signal at an increased frequency. Thus, the problem of restoring high frequencies is shifted to the shoulders of digital filters, which can be of very high quality. So high quality that in the case of expensive devices, the problem fully removed - provides undistorted reproduction of frequencies up to 19-20 kHz. Resampling is also used in not very expensive devices, so in principle this problem can also be considered solved. Devices in the region of $30 - $60 (sound cards) or music centers up to $600, usually similar in DAC to these sound cards, perfectly reproduce frequencies up to 10 kHz, tolerably up to 14 - 15, and somehow the rest. This quite enough for most real musical applications, and if someone needs more quality - he will find it in professional-grade devices, which are not that much more expensive - they are just smartly made.

Back to dithering, let's see how we can usefully increase the dynamic range beyond 16 bits.

The idea of dithering is to mix into the signal noise. Strange as it may sound, in order to reduce noise and unpleasant quantization effects, we add your noise. Let's consider an example - let's use CoolEdit's ability to work in 32 bits. 32 bits is 65 thousand times more accurate than 16 bits, so in our case, 32 bits can be considered an analog original, and converting it to 16 bits is digitization. Let the highest sound level in the original 32-bit sound correspond to minus 110 decibels. This is marginally much quieter than the dynamic range of 16-bit audio, for which the weakest audible sound corresponds to minus 90 decibels. Therefore, if we simply round the data to 16 bits, we will get complete digital silence.

Let's add "white" noise to the signal (that is, broadband and uniform over the entire frequency band) with a level of minus 90 decibels, approximately corresponding in terms of the level of quantization noise. Now, if we convert this mixture of signal and "white" noise into 16 bits (only integer values are possible - 0, 1, -1, ...), then it turns out that some part of the signal remains. Where the original signal had a higher level, there are more ones, where the lower one is zeros.

For experimental verification of the above method, you can use the Cool Edit audio editor (or any other that supports 32-bit format). To hear what happens, you should amplify the signal by 14 bits (by 78 dB).

The result is noisy 16-bit audio containing the original signal, which was minus 110 decibels. In principle, this is the standard way to expand the dynamic range, which often turns out almost by itself - there is enough noise everywhere. However, this in itself is rather meaningless - the level of sampling noise remains at the same level, and transmitting a signal weaker than noise is not very clear from the point of view of logic ... (A very erroneous opinion, since transmitting a signal with a level that is less than the level noise, is one of the fundamental methods of data encoding.

More complicated way - shaped dithering, lies in the fact that since we still do not hear high frequencies in very quiet sounds, it means that the main power of the noise should be directed to these frequencies, while you can even use the noise of a higher level - I will use the level of 4 least significant digits (two bits in 16 bit signal). We convert the resulting mixture of 32-bit signal and noise into a 16-bit signal, filter out the high frequencies (which are not really perceived by a person by ear) and increase the signal level so that we can evaluate the result.

This is already quite good (for an extremely low volume) sound transmission, the noise is approximately equal in power to the sound itself with an initial level of minus 110 decibels! Important note: we raised real sampling noise from half the least significant bit (-93 dB) to the four least significant bits (-84 dB), downgrading audible sampling noise from -93dB to about -110dB. Signal to noise ratio worsened, but the noise went into the high-frequency region and ceased to be audible, which gave significant improvement in real(human-perceptible) signal-to-noise ratio.

(In other words, since the power of the noise is, as it were, "smeared" over the frequency range, without missing the upper frequencies, we take away part of the power from it, as a result of which the signal-to-noise ratio improves in the temporal representation of the signals. - Approx. stat.)

In practice, this is already the noise level of 20-bit audio sampling. The only condition of this technology is the presence of frequencies for noise. 44.1 kHz audio makes it possible to place noise in frequencies of 10-20 kHz that are inaudible at a quiet volume. But if you digitize at 96 kHz, the frequency domain for noise (inaudible to humans) will be so large that when using shaped dithering 16 bits really turn into all 24.

[Note: The PC Speaker is a one-bit device, but with a fairly high maximum sampling rate (on/off of that single bit). Using a process similar in essence to dithering, called rather pulse-width modulation, quite high-quality digital sound was played on it - 5-8 bits of low frequency were pulled out of one bit and a high sampling rate, and the inability of the equipment to reproduce such high frequencies, as well as our inability to hear them. A slight high frequency whistle, however - the audible part of this noise - was audible.]

Thus, shaped dithering allows you to significantly reduce the already low sampling noise of 16-bit audio, thus quietly expanding the useful (noiseless) dynamic range by all area of human hearing. Since now shaped dithering is always used when translating from a working format of 32 bits to a final 16 bit for a CD, our 16 bits are completely sufficient for a complete transfer of a sound picture.

It should be noted that this technology operates only at the stage of preparing the material for playback. When processing high-quality sound, simply necessary stay at 32 bits to avoid dithering after each operation, better encoding results back to 16 bits. But if the noise level of the phonogram is more than minus 60 decibels - you can, without the slightest scruples of conscience, carry out all the processing in 16 bits. Intermediate dithering will ensure that there are no rounding distortions, and the noise added by it will hundreds of times weaker than the existing one and therefore completely indifferent.

Why is it said that 32-bit sound is better than 16-bit?
A1:	They are wrong.
A2:	[They mean a little different: when processing or recording sound need to use higher resolution. They use it Always. But in the sound as in the finished product, a resolution of more than 16 bits is not required.]

Q:	Does it make sense to increase the sampling rate (eg up to 48 kHz or up to 96)?
A1:	Doesn't have. With at least how competent approach in the design of the DAC 44 kHz transmit the whole desired frequency range.
A2:	[They mean a little different: it makes sense, but only when processing or recording sound.]

Q:	Why is the introduction of high frequencies and bitness still going on?
A1:	It is important for progress to move. Where and why - is not so important ...
A2:	Many processes in this case are easier. If, for example, the device is going to process the sound, it will be easier for him to do this in 96 kHz / 32 bits. Almost all DSPs use 32 bits for sound processing, and the ability to forget about conversions is an easier development and still a slight increase in quality. And in general - the sound for further processing It has meaning to store in a higher resolution than 16 bits. For hi-end devices that only play sound, this is absolutely indifferent.

Q:	Are 32x or 24x or even 18 bit DACs better than 16 bit ones?
A:	In general - No. The quality of the conversion does not depend at all on the bit depth. The AC "97 codec (a modern sound card under $50) uses an 18-bit codec, and $500 cards, the sound of which cannot even be compared with this nonsense, use 16-bit. It makes absolutely no difference to playing 16 bit audio.. It's also worth bearing in mind that most DACs typically actually play back fewer bits than they take on. For example, the real noise level of a typical cheap codec is -90 dB, which is 15 bits, and even if it is 24 bits itself - you will not get any return on the "extra" 9 bits - the result of their work, even if it was available, will drown in their own noise. Most cheap devices are just ignore additional bits - they just don't really count in their sound synthesis process, although they go to the digital input of the DAC.

Q:	And for the record?
A:	For recording, it is better to have an ADC with a larger capacity. Again, more real bit depth. The bit depth of the DAC should correspond to the noise level of the original phonogram, or simply be sufficient to achieve the desired low level. noise. It's also handy to have a little more bit depth to use the higher dynamic range for less precise recording level control. But remember - you must always hit real codec range. In reality, a 32-bit ADC, for example, is almost completely meaningless, since the low ten bits will just make noise continuously - so low noise (under -200 dB) just can not be in an analog music source.

It is not worth demanding from the sound of increased bit depth or sampling frequency, in comparison with CD, better quality. 16bit/44kHz, pushed to the limit with shaped dithering, is quite capable fully convey the information we are interested in, if it is not about the sound processing process. Don't waste space on extra data in your finished material, just as don't expect the superior sound quality from DVD-Audio with its 96kHz/24bit. With a competent approach, when creating sound in a standard CD format, we will have a quality that just doesn't need in further improvement, and the responsibility for the correct sound recording of the final data has long been assumed by the developed algorithms and people who know how to use them correctly. In the past few years, you will not find a new disc without shaped dithering and other techniques for pushing sound quality to the limit. Yes, it will be more convenient for lazy or simply cranky people to give ready-made material in 32 bits and 96 kHz, but in theory - is it worth several times more audio data? ..

Sound information. Sound is a wave propagating in air, water or other medium with a continuously changing intensity and frequency.

A person perceives sound waves (air vibrations) with the help of hearing in the form of sound of various volumes and tones. The greater the intensity of the sound wave, the louder the sound, the greater the frequency of the wave, the higher the tone of the sound (Fig. 1.1).

Rice. 1.1. The dependence of the loudness and pitch of the sound on the intensity and frequency of the sound wave

The human ear perceives sound at frequencies ranging from 20 vibrations per second (low sound) to 20,000 vibrations per second (high sound).

A person can perceive sound in a huge range of intensities, in which the maximum intensity is 1014 times greater than the minimum (one hundred thousand billion times). To measure the volume of sound, a special unit "decibel" (dbl) is used (Table 5.1). A decrease or increase in sound volume by 10 dB corresponds to a decrease or increase in sound intensity by 10 times.

Table 5.1. Sound volume
Sound Loudness in decibels
The lower limit of sensitivity of the human ear 0
Rustle of leaves 10
Conversation 60
Car horn 90
Jet engine 120
Pain threshold 140
Time sampling of sound. In order for a computer to process audio, a continuous audio signal must be converted into discrete digital form using time sampling. A continuous sound wave is divided into separate small time sections, for each such section a certain value of sound intensity is set.

Rice. 1.2. Temporal Audio Sampling

Sampling frequency. A microphone connected to the sound card is used to record analog audio and convert it to digital form. The quality of the received digital sound depends on the number of measurements of the sound volume level per unit of time, i.e. the sampling frequency. The more measurements are made in I second (the higher the sampling frequency), the more accurately the "ladder" of the digital audio signal repeats the curve of the dialog signal.

The audio sample rate is the number of measurements of the sound volume in one second.

The audio sampling rate can range from 8,000 to 48,000 sound volume measurements per second.

Audio encoding depth. Each "step" is assigned a certain value of the sound volume level. Sound loudness levels can be considered as a set of possible states N, for encoding of which a certain amount of information I is needed, which is called the sound coding depth.

Audio encoding depth is the amount of information needed to encode discrete digital audio loudness levels.

If the coding depth is known, then the number of digital sound volume levels can be calculated using the formula N = 2I. Let the audio encoding depth be 16 bits, then the number of audio loudness levels is:

N = 2I = 216 = 65536.

During the encoding process, each sound volume level is assigned its own 16-bit binary code, the lowest sound level will correspond to the code 0000000000000000, and the highest - 1111111111111111.

The quality of the digitized sound. The higher the frequency and sampling depth of the sound, the better the quality of the digitized sound will be. The lowest quality of digitized audio, corresponding to the quality of telephone communication, is obtained at a sampling rate of 8000 times per second, sampling depth of 8 bits and recording of one audio track ("mono" mode). The highest quality of digitized sound, corresponding to the quality of an audio CD, is achieved with a sampling rate of 48,000 times per second, a sampling depth of 16 bits and recording of two audio tracks (stereo mode).

It must be remembered that the higher the quality of digital sound, the greater the information volume of the sound file. You can estimate the information volume of a digital stereo sound file with a sound duration of 1 second with an average sound quality (16 bits, 24,000 measurements per second). To do this, the encoding depth must be multiplied by the number of measurements per 1 second and multiplied by 2 (stereo sound):

16 bits × 24,000 × 2 = 768,000 bits = 96,000 bytes = 93.75 KB.

sound editors. Sound editors allow you not only to record and play sound, but also to edit it. Digitized sound is presented in sound editors in a visual form, so the operations of copying, moving and deleting parts of the audio track can be easily performed using the mouse. In addition, you can overlay audio tracks on top of each other (mix sounds) and apply various acoustic effects (echo, reverse playback, etc.).

Lesson " "

Analog and discrete ways to represent sound

Information, including graphics and sound, can be presented in analog or discrete form.

An example analog audio storage is a vinyl record (the sound track changes its shape continuously), and discrete - an audio CD (the sound track of which contains areas with different reflectivity).

Human perception of sound

Sound waves are captured by the auditory organ and cause irritation in it, which is transmitted through the nervous system to the brain, creating a sensation of sound.

The vibrations of the tympanic membrane, in turn, are transmitted to the inner ear and irritate the auditory nerve. This is how a person perceives sound.

Hertz (Hz or Hz) - unit of measurement of the oscillation frequency. 1 Hz= 1/s

The human ear can perceive sound at frequencies ranging from 20 vibrations per second (20 Hertz, low sound) to 20,000 vibrations per second (20 kHz, high sound).

- analog - continuous - sound

Audio encoding

In order for the computer to process sound, continuous the sound signal must be converted into a sequence of electrical impulses(binary zeros and ones).

In the process of encoding a continuous audio signal, its temporal sampling is performed. A continuous sound wave is divided into separate small time sections, and for each such section a certain amplitude value is set.

That. in binary encoding of a continuous audio signal, it is replaced by a sequence of discrete signal levels.

Rice. Temporal Audio Sampling

Thus, the continuous dependence of the signal amplitude on time A(t) is replaced by a discrete sequence of loudness levels.

On the graph, this looks like replacing a smooth curve with a sequence of "steps":

Each "step" is assigned the value of the sound volume level, its code (1, 2, 3, and so on).

Sound volume levels can be considered as a set of possible states, respectively, the more volume levels will be allocated in the coding process, the more information will be carried by the value of each level and the better the sound will be.

Modern sound cards provide 16-bit audio encoding depth. The number of different signal levels (states for a given encoding) can be calculated using the formula: N=2 i = 2 16 = 65536, where i is the depth of sound.

Thus, modern sound cards can encode 65536 signal levels. Each value of the amplitude of the audio signal is assigned a 16-bit code.

The number of measurements per second can range from 8000 to 48000, i.e. the sampling rate of the analog audio signal can take values from 8 to 48 kHz. At a frequency of 8 kHz, the quality of the sampled audio signal corresponds to the quality of a radio broadcast, and at a frequency of 48 kHz, the quality of the sound quality of audio-C D . It should also be noted that both mono and stereo modes are possible.

TASK 1.

You can estimate the information volume of a stereo audio file with a sound duration of 1 second at high sound quality (16 bits, 48 kHz). To do this, the number of bits per sample must be multiplied by the number of samples per 1 second and multiplied by 2 (stereo):

Solution: 16 bits 48,000 2 = 1,536,000 bits = 192,000 bytes = 187.5 KB.

TASK 2.

Estimate the information volume of a digital stereo sound file with a duration of 1 minute with an average sound quality (16 bits, 24 kHz).

Solution: 16 bits × 24,000 × 2 × 60 = 46,080,000 bits = 5,760 000 bytes = 5625 KB ≈ 5.5 MB

Standard Application sound recording plays the role of a digital tape recorder and allows you to record sound, that is, sample sound signals, and save them in sound files in the format W AV. This program allows you to edit sound files, mix them (overlay each other), and also play.

The quality of the binary encoding of an image or sound determined by the sampling rate and coding depth.

Homework- solve problems:

1. Determine the number of signal levels of a 24-bit sound card.

2. Can a song fit on a 1.44 MB floppy disk if it has the following parameters: stereo sound duration of 3 minutes with sound quality of 16 bits, 16 kHz.

We learned quite a lot about all this while working on our project, and today I will try to describe in my fingers some of the basic concepts that you need to know if you are dealing with digital sound processing. This article does not contain serious mathematics like fast Fourier transforms and other things - these formulas are easy to find on the net. I will describe the essence and meaning of things that will have to face.

Digitization, or there and back

First of all, let's figure out what a digital signal is, how it is obtained from an analog signal, and where the analog signal actually comes from. The latter can be defined as simply as possible as voltage fluctuations due to vibrations of the membrane in the microphone.

Rice. 1. Sound waveform

This is an oscillogram of the sound - this is what the audio signal looks like. I think everyone has seen pictures like this at least once in their life. In order to understand how the process of converting an analog signal into a digital one works, you need to draw an oscillogram of sound on millimeter paper. For each vertical line, we find the point of intersection with the oscillogram and the nearest integer value on the vertical scale - a set of such values will be the simplest record of a digital signal.

Let's use this interactive example to understand how waves of different frequencies overlap and how digitization occurs. In the left menu, you can turn on / off the display of graphs, adjust the parameters of the input data and sampling parameters, or you can simply move the control points.

In reality, to create a stereo effect when recording audio, most often not one, but several channels are recorded at once. Depending on the storage format used, they may be stored independently. Also, signal levels can be recorded as the difference between the level of the main channel and the level of the current one.

The reverse conversion from a digital signal to an analog one is carried out using digital-to-analogue converters, which can have a different device and operating principles. I will omit the description of these principles in this article.

Sampling

As you know, a digital signal is a set of signal level values recorded at specified time intervals. The process of converting a continuous analog signal into a digital signal is called sampling (by time and by level). There are two main characteristics of a digital signal - the sampling rate and the level sampling depth.

Green shows the frequency component, the frequency of which is higher than the Nyquist frequency. When digitizing such a frequency component, it is not possible to record enough data to correctly describe it. As a result, during playback, a completely different signal is obtained - a yellow curve.

Signal level

To begin with, you should immediately understand that when it comes to a digital signal, you can only talk about the relative level of the signal. The absolute depends primarily on the reproducing equipment and is directly proportional to the relative. When calculating relative signal levels, it is customary to use decibels. In this case, a signal with the maximum possible amplitude at a given sampling depth is taken as a reference point. This level is indicated as 0 dBFS (dB - decibel, FS = Full Scale - full scale). Lower signal levels are indicated as -1 dBFS, -2 dBFS, etc. It is quite obvious that there are simply no higher levels (we initially take the highest possible level).

At first, it can be difficult to figure out how the decibels and the actual signal level correlate. In fact, everything is simple. Every ~6 dB (more precisely 20 log(2) ~ 6.02 dB) indicates a change in signal level by a factor of two. That is, when we talk about a signal with a level of -12 dBFS, we understand that this is a signal whose level is four times less than the maximum, and -18 dBFS is eight times lower, and so on. If you look at the definition of a decibel, it states a value - then where does 20 come from? The thing is that the decibel is the logarithm of the ratio of two energy quantities of the same name, multiplied by 10. The amplitude is not energy value, so it must be converted to a suitable value. The power carried by waves of different amplitudes is proportional to the square of the amplitude. Therefore, for the amplitude (if all other conditions, except for the amplitude, are taken unchanged), the formula can be written as

N.B. It is worth mentioning that the logarithm in this case is taken as a decimal, while most libraries under a function called log assume a natural logarithm.

At different sampling depths, the signal level on this scale will not change. A -6 dBFS signal will remain a -6 dBFS signal. But still one characteristic will change - the dynamic range. The dynamic range of a signal is the difference between its minimum and maximum value. It is calculated by the formula , where n is the discretization depth (for rough estimates, you can use a simpler formula: n * 6). For 16 bit it's ~96.33 dB, for 24 bit it's ~144.49 dB. This means that the largest level drop that can be described with 24-bit sampling depth (144.49 dB) is 48.16 dB larger than the largest level drop with 16-bit depth (96.33 dB). Plus, the crushing noise at 24 bits is 48 dB quieter.

Perception

When we talk about human perception of sound, we must first understand how people perceive sound. Obviously, we hear with our ears. Sound waves interact with the eardrum, displacing it. Vibrations are transmitted to the inner ear, where they are picked up by receptors. How much the eardrum moves depends on a characteristic such as sound pressure. In this case, the perceived loudness depends on the sound pressure not directly, but logarithmically. Therefore, when changing the volume, it is customary to use the relative scale SPL (sound pressure level), the values \u200b\u200bof which are indicated in the same decibels. It is also worth noting that the perceived loudness of a sound depends not only on the sound pressure level, but also on the frequency of the sound:

Volume

The simplest example of sound processing is changing its volume. In this case, the signal level is simply multiplied by some fixed value. However, even in such a simple matter as adjusting the volume, there is one pitfall. As I noted earlier, perceived loudness depends on the logarithm of the sound pressure, which means that using a linear loudness scale is not very effective. With a linear volume scale, two problems arise at once - for a noticeable change in volume, when the slider is above the middle of the scale, you have to move it far enough, while closer to the very bottom of the scale, the shift is less than the thickness of a hair, it can change the volume twice (I think everyone has experienced this). To solve this problem, a logarithmic loudness scale is used. At the same time, moving the slider at a fixed distance over its entire length changes the volume by the same number of times. In professional recording and processing equipment, as a rule, it is the logarithmic loudness scale that is used.

Mathematics

Here I, perhaps, will return to mathematics a little, because the implementation of the logarithmic scale is not such a simple and obvious thing for many, and finding this formula on the Internet is not as easy as we would like. At the same time, I'll show you how easy it is to convert volume values to dBFS and vice versa. For further explanation it will be helpful.

// Minimum volume value - at this level, the sound is turned off var EPSILON = 0.001; // Coefficient for converting to and from dBFS var DBFS_COEF = 20 / Math.log(10); // Calculates the volume from the position on the scale var volumeToExponent = function(value) ( var volume = Math.pow(EPSILON, 1 - value); return volume > EPSILON ? volume: 0; ); // Calculates the position on the scale from the volume value var volumeFromExponent = function(volume) ( return 1 - Math.log(Math.max(volume, EPSILON)) / Math.log(EPSILON); ); // Convert volume value to dBFS var volumeToDBFS = function(volume) ( return Math.log(volume) * DBFS_COEF; ); // Convert dBFS value to volume var volumeFromDBFS = function(dbfs) ( return Math.exp(dbfs / DBFS_COEF); )

digital processing

Now back to the fact that we have a digital, not an analog signal. There are two features of a digital signal that you should consider when working with loudness:

the accuracy with which the signal level is indicated is limited (and quite strongly. 16 bits is 2 times less than used for a standard floating point number);
the signal has an upper level limit that it cannot go beyond.

The fact that the signal level has an accuracy limit implies two things:

The level of crushing noise increases as the volume increases. For small changes, this is usually not very critical, since the initial noise level is much quieter than the perceived one, and it can be safely raised by a factor of 4-8 (for example, use an equalizer with a scale limit of ±12dB);
you should not first greatly lower the signal level, and then greatly increase it - in this case, new crushing noises may appear, which were not originally there.

From the fact that the signal has an upper level limit, it follows that it is not safe to increase the volume above unity. In this case, the peaks that are above the limit will be “cut off” and data loss will occur.

In practice, all this means that the standard sampling parameters for Audio-CD (16-bit, 44.1 kHz) do not allow for high-quality sound processing, because they have very little redundancy. For these purposes, it is better to use more redundant formats. However, keep in mind that the total file size is proportional to the sampling parameters, so issuing such files for online playback is not a good idea.

Loudness measurement

In order to compare the loudness of two different signals, it must first be measured somehow. There are at least three metrics for measuring the loudness of signals - the maximum peak value, the average value of the signal level, and the ReplayGain metric.

The maximum peak value is a rather weak metric for assessing loudness. It does not take into account the overall volume level in any way - for example, if you record a thunderstorm, then most of the time on the recording it will quietly rustle rain and only a couple of times thunder will rumble. The maximum peak value of the signal level of such a recording will be quite high, but most of the recording will have a very low signal level. However, this metric is still useful - it allows you to calculate the maximum gain that can be applied to the record, at which there will be no data loss due to "cutting" of the peaks.

The average signal level is a more useful metric and is easily calculated, but still has significant drawbacks related to how we perceive sound. The squeal of a circular saw and the rumble of a waterfall, recorded with the same average signal level, will be perceived completely differently.

ReplayGain most accurately conveys the perceived volume level of the recording and takes into account the physiological and mental characteristics of sound perception. For industrial release of records, many recording studios use it, and it is also supported by most popular media players. (the WIKI contains many inaccuracies and actually does not correctly describe the very essence of the technology)

Volume normalization

If we can measure the loudness of different recordings, we can normalize it. The idea of normalization is to bring different sounds to the same perceived loudness level. To do this, several different approaches are used. As a rule, they try to maximize the volume, but this is not always possible due to the limitations of the maximum signal level. Therefore, some value is usually taken slightly less than the maximum (for example -14 dBFS), which all signals are trying to bring to.

Loudness is sometimes normalized within a single recording, with different parts of the recording being amplified by different amounts so that their perceived loudness is the same. This approach is very often used in computer video players - the soundtrack of many films can contain sections with very different loudness. In such a situation, problems arise when watching movies without headphones at a later time - at a volume at which the whispers of the main characters are normally heard, the shots can wake up the neighbors. And at a volume at which the shots do not hit the ears, the whisper becomes generally indistinguishable. With intra-track volume normalization, the player automatically raises the volume in quiet areas and lowers it in loud areas. However, this approach creates tangible playback artifacts during sharp transitions between quiet and loud sound, and also sometimes overestimates the volume of some sounds that, by design, should be background and barely distinguishable.

Also, internal normalization is sometimes performed to increase the overall volume of the tracks. This is called normalization with compression. With this approach, the average value of the signal level is maximized by amplifying the entire signal by a given amount. Those areas that should have been "cut off" are amplified by a smaller amount due to exceeding the maximum level, thus avoiding this. This method of increasing the volume significantly reduces the sound quality of the track, but, nevertheless, many recording studios do not hesitate to use it.

Filtration

I will not describe absolutely all audio filters, I will limit myself to only the standard ones that are present in the Web Audio API. The simplest and most common of these is the biquad filter (BiquadFilterNode) - this is an active second-order filter with an infinite impulse response that can reproduce a fairly large number of effects. The principle of operation of this filter is based on the use of two buffers, each with two samples. One buffer contains the last two samples in the input signal, the other buffer contains the last two samples in the output signal. The resulting value is obtained by summing five values: the current sample and samples from both buffers multiplied by pre-calculated coefficients. The coefficients of this filter are not set directly, but are calculated from the parameters of frequency, quality factor (Q) and gain.

All graphs below display the frequency range from 20 Hz to 20,000 Hz. The horizontal axis displays the frequency, a logarithmic scale is applied along it, the vertical axis - the magnitude (yellow graph) from 0 to 2, or the phase shift (green graph) from -Pi to Pi. The frequency of all filters (632 Hz) is marked with a red line on the graph.

Lowpass

Rice. 8. Lowpass filter.

Passes only frequencies below the set frequency. The filter is set by frequency and quality factor.

highpass

Rice. 9. Highpass filter.

Operates similarly to lowpass, except that it passes frequencies above the specified frequency, not below.

band pass

Rice. 10. Bandpass filter.

This filter is more selective - it passes only a certain frequency band.

Notch

Rice. 11. Notch filter.

It is the opposite of bandpass - passes all frequencies outside the given band. However, it is worth noting the difference in the attenuation curves of the impact and in the phase characteristics of these filters.

Lowshelf

Rice. 12. Lowshelf filter.

It is a more “smart” version of highpass - it boosts or attenuates frequencies below the set one, passes frequencies above without changes. The filter is set by frequency and gain.

Highshelf

Rice. 13. Highshelf filter.

A smarter version of lowpass - amplifies or attenuates frequencies above a given one, passes frequencies below without change.

Peaking

Rice. 14. peaking filter.

This is a more "smart" version of notch - it boosts or attenuates frequencies in a given range and passes the rest of the frequencies unchanged. The filter is set by frequency, gain and quality factor.

allpass filter

Rice. 15. Allpass filter.

Allpass is different from all the others - it does not change the amplitude characteristics of the signal, instead it makes a phase shift of the given frequencies. The filter is set by frequency and quality factor.

WaveShaperNode filter

Waveshaper () is used to create complex sound distortion effects, in particular, it can be used to implement the effects of "distortion", "overdrive" and "fuzz". This filter applies a special shaping function to the input signal. The principles of constructing such functions are quite complex and require a separate article, so I will omit their description.

ConvolverNode filter

A filter that linearly convolutions the input signal with an audio buffer that defines a certain impulse response. The impulse response is the response of a system to a single impulse. In simple terms, this can be called a "photo" of sound. If a real photograph contains information about light waves, about how they are reflected, absorbed and interact, then the impulse response contains similar information about sound waves. Convolution of an audio stream with such a “photo” imposes, as it were, the effects of the environment in which the impulse response was taken on the input signal.

For this filter to work, the signal must be decomposed into frequency components. This decomposition is done with the help of (unfortunately, in the Russian-language Wikipedia it is completely empty, written, apparently, for people who already know what an FFT is and can write the same empty article themselves). As I said in the introduction, I will not give the mathematics of the FFT in this article, but it would be wrong not to mention the cornerstone algorithm for digital signal processing.

This filter implements the reverb effect. There are many libraries of ready-made audio buffers for this filter that implement various effects ( , ), such libraries are well available on request.