cMP² | CMP / 02Upsampling

Chapter 2 - A look at upsampling

Background - Understanding Audio Data

Sound is a series of pressured air waves (compressions and rarefactions or, more usually, oscillations) which cause the ear drum to vibrate. The vibrations are interpreted by the brain as sound. Human hearing has an impressive ~132db dynamic range and a sensitivity ranging from 20 Hz to 20 kHz – we can hear sounds as soft as the rubbing together of fingers at about 3 db or as loud as a a jet engine at well over 110 dB.

The basic principle that Thomas Edison first used in recording sound remains in use to this day – he saw that it could be captured by emulating the ear drum. In 1877, he succeeding in recording oscillations caused by his voice on a rotating metal cylinder wrapped with tin foil while he recited ‘Mary had a little lamb’. They were played back using a second diaphragm-and-needle unit.

Variations in the energy levels of a sound cause varying intensities of the oscillations in the ear drum. These are perceived as changes in loudness (amplitude). Variations in the frequency of the oscillations (how many there are in a second) are perceived as changes in pitch.

In an analogue recording system, a sound’s varying amplitudes and frequencies, its waveform, are changed (modulated) into continuous variations in voltage. An audio reproduction system follows the waveform and amplifies it to drive speaker diaphragms and recreate the original sound.

In digital recording, however, sounds are not recorded continuously, they are sampled at discrete points in time. It falls to the reproduction system to infer what happens between the samples, a process known as interpolation. How often samples are taken is known as the sampling frequency (Fs). How high Fs needs to be to enable accurate reproduction is well understood mathematically: frequencies up to but not including Fs/2 (half the sampling frequency or the Nyquist limit) can be captured.

The Nyquist limit is not rocket science: as a signal’s frequency increases, there are fewer samples available to describe its waveform. As a minimum, slightly more than two per cycle are needed. If frequencies higher than Fs/2 are sampled, the result is distortion manifested below Fs/2 as ‘alias’ images of the frequencies above it. Digital recording must therefore filter out sounds higher than the upper frequency limit to prevent ‘aliasing distortion’. The process is known as low-pass, anti-alias filtering.

The sampling frequency used on a CD is 44.1 kHz – it can reproduce frequencies up to 22.05 kHz. At that rate, a sine wave at 44.1 Hz has 1,000 samples describing its wave cycle, a 441 Hz wave has 100 samples, 4,410 Hz has ten – and 22,050 Hz has two. Frequencies equal to or higher than 22.05 kHz are filtered out.

Each sample point records a voltage level. For CDs, this is a 16-bit integer number called the resolution which ranges from -32768 to +32767. Each integer change represents a different voltage level, e.g. a nominal 4v output ranges from -2v to +2v with 65,536 possible levels and anything in between mapped in a linear fashion. At a 24-bit resolution, 16.8 million voltage levels and correspondingly greater precision are available. It has been suggested that human hearing can detect changes in sounds equivalent to a resolution of about 22 bits.

Temporal (time) information is implied by Fs: CDs can only store varying voltage levels at discrete time intervals of 1/Fs. The waveform is re-created at this sampling rate. Digital to Analogue Converters (DACs) combine digital data with a clock signal to create a continuous analogue voltage waveform.

Aside: The original audio data is created by ADCs (Analogue-to-Digital Converters). Here, very high values for Fs (where Fs/2 is much larger than any frequency in the waveform) followed by an optimal level of downsampling using a digital filter yield the best results.

Upsampling - A Frequency Domain Perspective

The process of creating more samples from a given (bandlimited) input sample stream has many names: upsampling, oversampling, resampling, reconstruction filters and so on. Whatever it may be called, what is essential is the accuracy of new signal amplitudes being calculated.

In the frequency domain perspective we have the pass band (and associated ripple noise), the steep transition band (~2kHz for CD) and the stop band. There's a tradeoff between the transition band "window" and the amount of computations needed (each halving in transition band requires nearly 2x more computation). Gentler transitions (i.e. less computations) affect the high frequency response in the pass band which is not desired. Of course it doesn't end here...

To understand signal amplitude accuracy, we need a time domain perspective.

Upsampling - A Time Domain Perspective

Sound as we experience it is a realtime event and not a Fourier Transformation.

Often reconstruction filters are designed for frequency response to reduce modulation (or beat frequencies). Such designs does not necessarily guarantee accuracy in the time domain, i.e. new calculated samples result in a waveform that does not follow the original input samples.

This is interpolation error which is audible. Many interpolation techniques exist. What Bandlimited Interpolation offers is accurate signal amplitude calculations at any desired time interval thus reducing the dependency on hard-wired reconstruction filters down stream.

Regardless of whether upsampling is done carefully by design or happens more or less by accident, upsampling is inherent in digital music reproduction. Any DAC (be it an ‘off the shelf’ design like Burr Brown’s PCM 1792 or a proprietary chip like the dCS Ring DAC) or a dedicated DSP implements it with varying degrees of success. Methodologies range from the downright crude to the sophisticated.

(Ironically, while much is written about Signal-to-Noise Ratios (SNR), Total Harmonic Distortion (THD), resolution, sampling rates, phase, filters and so on, the quality of the original data and subsequent calculations is often ignored even though audio systems derive their analogue signal from it.)

Above shows the waveform of a one millisecond digitised sample of real sound. This is not a convincing analogue waveform: sound is a continuously varying signal, not a bunch of dots hanging in space. However, CDs offer only a series of discrete samples – how they are to be joined together (upsampled) to create a continuous waveform is left entirely open. In the case shown here, a major assumption has been made: the sample points have been joined together with straight lines.

This is called linear interpolation (y = mx where m is the gradient). Even if a thousand points were added between adjacent samples (giving a 44.1 MHz sampling rate), the result would be essentially the same. Some CD players and DACs, including high-end ones, use this approach.

Early ladder-type DACs maintained a constant voltage at the last sample point. As a new sample point arrives, the voltage is adjusted to this new level. (Hence the term ‘ladder’ where voltages are either stepped up or down.) This type of interpolation is a basic form of upsampling. A few CD players and DACs (again, including high-end ones) still use the technique.

Another interpolation technique is known as sinc interpolation. The results of using the Secret Rabbit Code (SRC) ‘Best Sinc’ interpolator at 24/96 on the previous 1ms yields:

When overlaid onto the original waveform, it can be seen to be more ‘analogue like’ but it still has difficulties at peaks and troughs. Note the interpolation errors (the deviations of the red line from the curve) when using 44.1kHz. There are many of these even during this one millisecond sample. Over a second, the number increases a thousand fold: the audibility of such interpolation errors is best described as ‘digital artefacts’.

SRC remains the preferred upsampler for this application, based as it is on bandlimited interpolation. For a given set of bandlimited samples, this provides a mathematically solid foundation for the most accurate calculation of the original analogue waveform within well defined error margins.

Upsampling to SRC 24/192 gives:

Clearly at 24/192 we get more detail at peaks and troughs bringing us closer to the original analogue waveform. This is preferred as subsequent methods of how these samples are joined (either linear or other but not ladder) has less impact. Many modern DACs do provide support for 24/192k SPDIF input.

Manufacturers of DACs and upsamplers rarely publish details of the interpolation techniques they employ (of which there are many) but, as a general guide, avoid DACs or upsamplers that:

Function as an integer multiple of the original Fs (e.g. 44.1 to 88.2). Correct upsampling provides data at any timing point to give a continuous waveform;
Implement in integer – such algorithms give significant rounding errors compared to those using real numbers: e.g. a 1/3 in integer processing is rendered as zero whilst, in real numbers, it is 0.33333, i.e. values after the decimal point are rounded up or down.