Desktop Music Handbook - Digital Audio Section

Last updated on 3/29/2016

One of the most exciting developments in desktop music in recent years is the ability to work with digital audio on a home PC. Long the province of research institutions and recording studios, digital audio editing software has become nearly commonplace on the desktop, and is now among the most accessible and powerful types of computer software available. Recording, editing, and playing digital audio on a home computer gives the user considerable power to design and produce new sounds, and to edit and craft one's own music with great precision. Digital audio can be a highly technical and elusive concept though, and we'll try to make the terms and concepts perfectly clear.

What is Digital Audio

Digital audio is a numeric representation of sound; it is sound stored as numbers. In order to understand what the numbers mean, we need to review some of the basic principles of acoustics, the study of sound.

Sound is produced when molecules in the air are disturbed by some type of motion produced by a vibrating body. This body, which might be a guitar string, human vocal cord or garbage can, is set into motion because energy is applied to it. The guitar string is struck by a pick or finger, while the garbage can is hit perhaps by a hammer, but the basic result is the same: they both begin to vibrate. The rate and amount of vibration is critical to our perception of the sound. If it is not fast enough or strong enough, we won't hear it. But if the vibration occurs at least twenty times a second and the molecules in the air are moved enough (a more difficult phenomena to measure), then we will hear sound. To understand the process better, let's take a closer look at a guitar string.

When the pick hits the string, the entire string moves back and forth at a certain rate of speed (Figure 12). This speed is called the frequency of the vibration. Because a single back and forth motion is called a cycle, we use a measure of frequency called cycles per second, or cps. This measure is also known as hertz, abbreviated Hz. Like that of other bodies, the frequency of the string is often very fast, so it is useful to use the abbreviation kHz to measure frequency in thousands of vibrations per second. A frequency of 2 kHz then, signifies a frequency of 2,000 cycles per second, meaning the string goes through its back and forth motion 2,000 times per second. The actual distance the string moves is called its displacement, and is proportional to how hard we pluck it. The actual measurement used for this distance is not particularly important for our purposes, but we will often refer to the amplitude or strength of the vibration.

As the string moves, it displaces the molecules around it in a wave-like pattern, i.e., while the string moves back and forth, the molecules also move back and forth. The movement of the molecules is propagated in the air; individual molecules bump against molecules next to them, which in turn bump their neighbors, etc., until the molecules next to our ears are set in motion. At the end of the chain, these molecules move our eardrum in a pattern analogous to the original string movement, and we hear the sound. This pattern of motion, which is an air pressure wave, can be represented in many ways, for example as a mathematical formula, or graphically as a waveform. Figure 13 below shows the movement of the string over time: the segment marked "A" represents the string as it is pulled back by the pick; "B" shows it moving back towards its resting point, "C" represents the string moving through the resting point and onward to its outer limit; then "D" has it moving back towards the point of rest. This pattern repeats continuously under the friction of the molecules in the air gradually slows the string down to a stop. In order for us to hear the string tone, the pattern must repeat at least twenty times per second. This threshold, 20 cps, is the lower limit of human hearing. The fastest sound we can hear is theoretically 20,000 cps, but in reality, it's probably closer to 15 or 17,000 cycles.

The vibration pattern of a plucked string over time.

Fig. 13 -The vibration pattern of a plucked string over time. Gradually, the motion will die out.-

If this back and forth motion were the only phenomena involved in creating a sound, then all stringed instruments would probably sound much the same. We know this is not true, of course, and alas, the laws of physics are not quite so simple. In fact, the string vibrates not only at its entire length, but at one-half its length, one-third, one-fourth, one-fifth, etc. These additional vibrations occur at a rate faster than the original vibration, (known as the fundamental frequency), but are usually weaker in strength. Our ear doesn't hear each vibration individually however. If it if did, we would hear a multi-note chord every time a single note were played. Rather, all these vibrations are added together to form a complex or composite waveform that our ear perceives as a single tone (Figure 14).

Vibrations occurring at different frequencies are added together to form a complex tone

Fig. 14-The making of a complex waveform. Vibrations occurring at different frequencies are added together to form a complex tone.-

This composite waveform still doesn't account for the uniqueness of the sound of different instruments, as there is one more major factor in determining the quality of the tone we hear. This is the resonator. The resonator in the case of the guitar is the big block of hollow wood that the string is attached, i.e., the guitar body. This has a major impact on the sound we perceive when a guitar is played as it actually enhances some of the vibrations produced by the string and diminishes or attenuates others. The ultimate effect of all the vibrations occurring simultaneously, being altered by the resonator, adds up to the sound we know as guitar.

Recording a Sound

So what has all this got to do with digital audio? What is it we need to record from all of this motion in the air? It is the strength of the composite pressure wave created by all the vibrations that we must measure very accurately and very often. That is the basic principle behind digital audio. When a microphone records a guitar playing, a small membrane in the mic (called the diaphragm) is set into motion in a pattern identical to the guitar wave's own pattern. The diaphragm moves back and forth, creating an electrical current that is sent through a cable. The voltages in the cable are also "alternating" in strength at a very rapid rate: strong, weaker, weak, strong again. When the cable arrives at our measuring device, called an analog to digital (A/D) converter, the device measures how strong the signal is at every instant and sends a numeric value to a storage device, probably the hard drive in your computer. The A/D converter, along with its counterpart, the digital to analog (D/A) converter that turns the numbers back into voltages, is typically found as a component of your sound card, or as a stand-alone device.

There are several important aspects of this measuring process that we need to discuss. First is the rate at which we choose to examine the signal coming into the converter. It is a known fact of physics that we must measure or sample the signal at a rate twice as fast as the highest frequency we wish to capture. Let's say we are trying to record a moderately high note on a violin. Let's also assume that the fundamental frequency of this tone repeats 440 times per second (the note would be an "A," of course), and that we want to capture all vibrations up to five times the rate of the fundamental, or 2,200 cycles per second. To capture all the components of this note and convert the resulting sound into numbers, we would have to measure it 4,400 times per second.

But humans can hear tones that occur at rates well up into the tens of thousands of times per second, so our system must be capable of much better than that! In theory, we might want to capture an extremely high sound, for example one that actually contains a frequency component of 20,000 cps. In that case, our measurements must occur 40,000 times per second, which in fact, would allow us to capture every possible sound that any human might be able to hear. Because of some complex laws that digital audio obeys however, we use a rate of 44,100 measurements or "snapshots" of a sound per second in our professional equipment. This sampling rate, abbreviated 44.1 kHz (44.1 kilohertz) is one aspect of what we call CD-quality recording, as it is the same rate that commercial CDs use. Other common sampling rates are 11kHz, 22kHz, and for some professional equipment, 48kHz.

The other important issue is how accurate our measuring system will be. Will we have 20 different values to select from for each measurement? How about 200 or 2,000? How accurately do we need to represent the incredible variety of fluctuations in a pressure wave? Think about the different types of time pieces you know about. If your digital watch shows you minutes and seconds, that's adequate for most purposes. If you are doing scientific measurements of time, then you might need more accuracy, perhaps minutes, seconds, tenths, hundredths and even thousandths of seconds. Soundwaves actually encompass an infinite range of strengths, but we must draw the line somewhere, or else we would need gigantic hard drives just to store the information for a short amount of sound. The music industry has settled on a system that provides 65,536 different values to assign to the amplitude (strength) of a waveform at any given instant. In a certain sense, that number represents a compromise, as we will definitely not capture every possible value that the amplitude can take. However, our ears can live with that compromise, and in any event, using a more sophisticated measuring system is simply not worth the extra cost in computing and storage resources.

Obviously you are wondering, "Why in the world did they choose 65,536?" The answer is simply because it is 216, that is, 2 to the 16th power (2 multiplied by itself sixteen times). This is the largest number we can express in the binary numbering system if we use 16 bits, or 16 places. Recall from your high school math that the binary numbering system uses only two digits, 0 and 1, and that this is what computers use as well. A string of sixteen 1's in the binary system produces the number 65,535 in decimal, and a string of 16 0's is, of course the decimal number 0. So from 0 through 65,535 we have 65,536 different numbers that we can express using 16 bits. Computers actually think in terms of 8 digit strings, which you will remember are called bytes. Therefore, if we use numbers that are two bytes long to represent every different value in our system, we have the total range described above. One byte, or a string of 8 bits, would allow us to represent the numbers 0 through 255, and MIDI is quite happy with that range, but there is so much more detail in the digital audio world that our system must be far more sophisticated.

If you've followed the discussion up until now, you should have a pretty good idea of what is on a compact disc. It's a massive amount of numbers, each two bytes long, that represent the fluctuating amplitude of the pressure wave in front of the microphone that made the recording. No matter if the sound was an orchestra, a guitar or a car horn, the CD simply contains measurements for the pattern of motion produced by that sound. We can use our hard drives to record the information in the same form as that on a CD, or if we wish, we can use a somewhat less accurate representation. For example, if we choose not to capture the data as accurately as the CD, we might only use eight bits, or one byte, for each amplitude value. Such a measuring system has only what we call 8-bit accuracy or resolution. This will have a significant impact on the quality of our representation, but it may be adequate for the purpose at hand. Or we might wish to look at the sound and take a measurement only every eleven or twenty-two thousands times a second, i.e., an 11k or 22 kHz sampling rate, realizing that we will miss some detail, in particular the high end (upper frequencies) in the sound. In truth, that rate may be good enough to represent certain types of sound, for example the frequencies produced by the human voice are much lower than those produced by a cymbal, so we might be able to get the whole picture by looking at the voice at a lower rate. The decision regarding how accurate we need to be will be determined by the material we are recording and the amount of storage space we have available to hold the recording. These choices are usually made from within our audio software, so perhaps it's time to turn or attention to the PC.

Digital Audio Software

There are several common varieties of software used to manipulate digital audio data on a computer. The most popular is wave editing software, which is often included as part of the software packaged with sound cards. This type of software allows someone to work with a graphic representation of sound, the waveform, and cut, copy and paste it with the ease of a word processor (Figure 15). The software also typically includes a number of editing features that allow additional processing of the material; this processing can be used to create special effects, such as dogs barking backwards, and gun shots being stretched to one hundred times their length. Features of this type fall into the category of signal processing, or digital signal processing (DSP) functions. Professional versions of waveform editors often cost several hundred dollars, but offer the user tremendous flexibility in the type of manipulations they can perform. By the way, on the IBM-compatible platform, digital audio files are typically called Wave files and carry the extension, .WAV. On the Macintosh, the standard audio file type is the AIFF file.

Usually, wave editing software can accommodate no more than a single, stereo file, though a new category, called multi-track software, lets the user work with several stereo files at once. After being manipulated and edited, these files are mixed together into a single composite stereo file that is sent to the left and right channel outputs of a sound card. In many cases, the multi-track software doesn't offer a full range of editing options; most often it is the signal processing functions that are omitted, but the ability to mix many different layers of audio is very appealing.

One other type of editing software is used with dedicated hard-disk recording systems. These professional products are very sophisticated, and often very expensive. Their key advantage is that they provide extensive editing capabilities, such as those needed to make commercial audio recordings, and often include storage devices devoted to holding large amounts of high quality audio. They also provide multiple tracks of digital audio, in some cases up to ten or even twelve simultaneous tracks on a single PC, as well as multiple audio outputs. This makes them well suited for the production of radio and television commercials, where a vocal narration, sound effects and music soundtrack are often combined.

Sound Cards

Far less expensive than the dedicated hardware described above are the massively popular sound cards found in nearly every PC today. Much of the success of these products can be attributed to the fact that IBM-compatible computers never enjoyed the quality of sound production that the Macintosh(TM) had from its inception. When card maker Creative Labs reached the consumer with its industry standard Sound Blaster(TM) card, they found a huge untapped market that is now quite saturated with products.

Sound cards typically serve several important functions. First, they contain a synthesizer that uses either frequency modulation (FM) synthesis to produce sound, or that stores actual recorded audio data in wavetables for use in playback. FM is a somewhat dated method of synthesis that uses one or more wave(s), called the modulator, to alter the frequency and amplitude of another, called the carrier. The range of sounds that can be produced is limited, though often adequate for simple sound effects or other game sounds. While the FM-style card has nearly disappeared from the market, most software manufacturers must include support for it in their products because of the vast number of cards that are still installed in computers.

Nearly all newer cards use the preferable wavetable approach because it provides far more realistic sound. Wavetables are digital recordings that exist in some type of compressed form in the card's ROM (read only memory). These sounds can never be erased, but can be altered in numerous ways as they playback. For example, a trumpet sound could be reversed, or a piano could be layered with a snare drum. Depending upon the programmability provided by the manufacturer, this type of card can be quite flexible in the sounds it makes. Most wavetable cards, regardless of their manufacturer, offer a General MIDI soundset, which makes them compatible with many popular multimedia programs. Despite what their ads may claim, sound cards vary tremendously in quality, even those that use the same playback method. Magazine reviews and roundups are a good source of information for evaluating a card's characteristics.

Most cards also contain a MIDI interface for MIDI input and output, plus the digital to analog (D/A) and analog to digital (A/D) converters described above. While all MIDI interfaces are essentially created equal, there can be major differences among the converters on these cards. Many cards claim "CD Quality Sound," which simply means they can record and playback audio at a sampling rate of 44.1 kHz using 16-bit resolution. Unfortunately, the personal computer was not originally intended to be a musical instrument, and the high level of electronic activity inside its case can cause interference problems with some cards. With properly built cards, these problems can be avoided, and most users won't experience any difficulties.

Putting it Altogether

MIDI and digital audio have coexisted in separate worlds until very recently. Now, using an entirely new class of software, we have the potential to work with both types of data within a single program. This new category, called simply, integrated MIDI and Digital audio software, solves many of the most nagging problems desktop musicians have had for years. The capabilities it offers greatly facilitate the integration of "real world" audio with the "virtual" world of MIDI tracks. Before we discuss this software, let's look at the way things used to work. Here's how musicians combined audio and MIDI in the past.


For many years, in home and professional music studios around the world, musicians have employed elaborate and somewhat complex means to join live audio with MIDI music. Guitarists, vocalists, drummers and others have used different synchronization techniques to mix their live playing with the music produced by their MIDI software. Typically, a musician would record live audio onto a tape recorder, then use the tape recorder to send information to the computer which told it when to start and stop playing. In this way, the music on the tape and the sequenced music could be perfectly aligned.

The information sent by the tape recorder in this case is known as SMPTE time code, and is actually an audio signal recorded (or "striped") on the tape. SMPTE (pronounced "simp-tee") serves as a timing reference for both the tape and the computer running the MIDI software. In essence, this code tells the software "what time it is," i.e., where into the music it should be. If a MIDI drum part must start exactly one minute after the music on the tape recorder begins, then the sequencer will watch the time pass from the beginning of the tape (time 00:00), until it reaches time 01:00, at which point it begins to play. Sequencers can jump instantly to any time point that's required, so the sequencer will simply wait for its "cue" then start playing.

SMPTE, which stands for the Society of Motion Picture and Television Engineers, was initially created by the NASA space agency for use with its tracking stations. It provided an absolute timing reference that allowed the agency to keep track of when transmissions occurred. Like a digital clock, SMPTE works on a 24 hour cycle, and the precision it provides is considerable: a normal SMPTE time represents hours, minutes, seconds, and "frames," (Figure 16). The "frames" designation is important to the television and movie industry for tracking time in film and video productions. A frame in television occurs 30 times a second, while in film it represents an interval of 1/24th or 1/25th of a second, so SMPTE can measure time quite accurately. Because most professional video equipment is SMPTE-compatible, musicians creating audio for video productions can also use it to synchronize their music with the various types of video equipment they commonly work with. When scoring for films, it is an invaluable way for the composer to know exactly when a sound effect or music cue must begin and end.

SMPTE time code

Fig. 15-An example of SMPTE time code, showing time in hours, minutes, seconds, and frames.-

Integrated Software

Rather than deal with the intricacies of SMPTE, today's musician can work with integrated software to combine audio and MIDI tracks with great precision. New programs like Cakewalk Pro Audio represent digital audio data in the same form as MIDI data, and allow the user to manipulate the two with ease. Once audio files are recorded onto disk, they can be aligned for playback along with the MIDI information, and what's more, numerous tracks of audio can be performed simultaneously. If synchronization with an external device is needed, the entire project can still be controlled by that device. Thus, the best features of multi-track audio software can now be found integrated with the advanced options of MIDI sequencers.

The number of audio tracks that can be mixed together in an integrated program, or in a stand-alone audio editor for that matter, is very much a function of the computer hardware being used for the task. In the IBM world, the processor (CPU) speed, access or "seek" time of the hard drive, and available system RAM are among the key components to evaluate. In the early years of desktop multimedia, software leader Microsoft produced a "multimedia" specification that described the minimal requirements for work of this type. That spec has been modified to keep up with enhancements in today's computers, and has, as of this writing, reached "Level III" status. This calls for a computer with a Pentium 75 MHz or better processor, at least 8 MEGS of RAM, a 540 MEG hard drive, a quad-speed CD-ROM player, a sound card that uses wavetable synthesis, and a video card that is MPEG 1 (a form of compression) compliant. Keep in mind that any component of a system can slow the process: a fast CPU with an inadequate hard drive can bring a system to its knees, for example. It's important that all the pieces of the system are well balanced and in good working order.

Here's a tip to keep in mind: one of the easiest and most effective tasks you can do to prepare your system for recording or playing audio is defragmenting your hard drive. A fragmented drive contains pieces of files spread over different physical locations, and makes the job of streaming data to and from that disk very difficult. Use one of the cleanup programs, such as defrag, which comes with your operating system, before making recordings. Also if possible, devote a separate drive partition to digital audio. When you first setup your computer, you can create partitions easily using DOS's fdisk program, but later, you'll have to backup your drive and reformat it.


We hope you've enjoyed this initial presentation of the ins and outs of desktop music and that it will encourage you to experiment on your own. Much of today's software is very powerful, though manufacturers have done a good job in making it easy to use, and you've got many hours of pleasure and excitement to look forward to. Of course the more you can learn about desktop music, the more you will get out of your equipment, so keep your eyes on the numerous books and magazines devoted to the subject, and consider subscribing to some of the multimedia newsgroups on the Internet. There's a whole world of music waiting for you, right on your desktop. 

Continue forward for a Glossary of terms.

<< previous | next >>

Cakewalk // Support // Knowledge Base // Desktop Music Handbook - Digital Audio Section
Copyright © 2024 Cakewalk, Inc. All rights reserved
Sitemap | Privacy Policy | Legal