5.
Digital Image
Processing Fundamentals
There’s
more to it than meets the eye.
–
19th century proverb
Digital
image processing is electronic data processing on a 2-D array of numbers. The
array is a numeric representation of an
image.
A
real
image is formed on a sensor when an energy emission strikes the sensor with
sufficient intensity to create a sensor output. The energy emission can have
numerous possible sources (e.g., acoustic, optic, etc.). When the energy
emission is in the form of electromagnetic radiation within the band limits of
the human eye, it is called visible light [Banerjee].
Some objects will reflect only electromagnetic radiation. Others produce their
own, using a phenomenon called
radiancy.
Radiancy
occurs in an object that has been heated sufficiently to cause it to glow
visibly [Resnick].
Visible light images are a special case, yet they appear with great frequency
in the image processing literature.
Another
source of images includes the
synthetic
images of computer graphics. These images can provide controls on the
illumination and material properties that are generally unavailable in the real
image domain.
This
chapter reviews some of the basic ideas in digital signal processing. The
review includes a summary of some mathematical results that will be of use in
Chapter 15. The math review is included here in order to strengthen the
discourse on sampling.
5.1. The
Human Visual System
A
typical human visual system consists of stereo electromagnetic transducers (two
eyes) connected to a large number of neurons (the brain).
The neurons
process the input, using poorly understood emergent properties (the mind).
Our discussion will follow the eye, brain and mind ordering, taking views with
a selective focus.
The
ability of the human eye to perceive the spectral content of light is called
color vision. A typical human eye has a spectral response that varies as a
function of age and the individual. Using clinical research, the CIE
(Commission Internationale de L’Eclairage)
created a statistical profile of human vision called the
standard
observer.
The response curves of the standard observer indicate that humans can see light
whose wavelengths have the color names red, green and blue. When discussing
wavelengths for visible light, we typically give the measurements in
nanometers.
A nanometer is
meters and is abbreviated
nm.
The wavelength for the red, green and blue peaks are about 570-645 nm, 526-535
nm, and 444-445 nm. The visible wavelength range (called the mesopic
range) is 380 to about 700-770 nm [Netravali]
[Cohen].
Fig.
5-1.
Sketch
of a Human Eye Fig.
5-1 shows a sketch of a human eye. When dimensions are given, they refer to the
typical adult human eye unless otherwise stated. Light passes through the
cornea and is focused on the retina by the lens. Physiological theories use
biological components to explain behaviour. The optical elements in the eye
(cornea,
lens
and retina)
form the primary biological components
of a photo sensor. Muscles are used to alter the thickness of the lens and the
diameter of the hole covering the lens, called the iris.
The iris diameter typically varies from 2 to 8 mm. Light passing through the
lens is focused upon the retina.
The retina contains two types of photo sensor cells: rods
and cones.
There
are 75 to 150 million rod cells
in the retina. The rods contain a blue-green absorbing pigment called rhodopsin.
Rods are used primarily for night vision (also called the scotopic range)
and typically have no role in color vision [Gonzalez and Woods]. Cones
are used for daylight vision (called the photopic range). The tristimulus
theory of color perception is based upon the existence of three types of cones:
red, green and blue. The pigment in the cones is unknown [Hunt].
We do know that the phenomenon called adaptation
(a process that permits eyes to alter their sensitivity) occurs because of a
change in the pigments in the cones [Netravali].
The retina cells may also inhibit each another from creating
a
high-pass filter for image sharpening. This phenomenon is known as lateral
inhibition
[Mylers]. The
current model for the retinal cells
shows a cone cell density that ranges from 900
to 160,000
[Gibson]. There are 6 to 7 million cone cells,
with the density increasing near the fovea.
Further biological examination
indicates that the cells are imposed upon a noisy hexagonal array
[Wehmeier].
Lest
one be tempted to count the number of cells in the eye and draw a direct
comparison to modern camera equipment, keep in mind that even the fixated eye
is constantly moving. One study showed that the eyes perform over 3 fixations
per second during a search of a complex scene [Williams].
Further more, there is nearly a 180-degree field of view (given two eyes).
Finally, the eye-brain interface enables an integration between the
sensors’ polar coordinate scans, focus,
iris
adjustments and the interpretation engine. These interactions are not typical
of most artificial image processing systems [Gonzalez
and Woods]. Only recently have modern camcorders taken on the role of
integrating the focus and exposure adjustment with the sensor.
The
optic nerve has approximately 250,000 neurons connecting to the brain. The
brain has two components associated with low-level vision
operations: the lateral geniculate nucleus and the visual cortex.
The cells are modeled using a circuit that has an inhibit input,
capacitive-type electrical storage and voltage leaks, all driving a comparitor
with a variable voltage output.
The
capacitive storage elements are held accountable for the critical fusion
frequency response of the eye. The critical fusion frequency is the rate of
display whereby individual updates appear as if they are continuous. This
frequency ranges from 10-70 Hz depending on the color [Teevan]
[Netravali].
At 70 Hz, the 250,000-element optic nerve should carry 17.5 million neural
impulses per second. Given the signal-to-noise ratio of a human auditory
response system (80 dB), we can estimate that there are 12.8 bits per nerve
leading to the brain [Shamma].
This gives a bit rate of about 224 Mbps. The data has been pre-processed by the
eye before it reaches the optic nerve. This preprocessing includes lateral
inhibition between the retinal neurons. Also, we have assumed that there is
additive white Gaussian noise on the channel, but this assumption may be
justified.
Physiological
study has shown that the response of the cones is given by a Gaussian
sensitivity for the cone center and surrounding fields. The overall sensitivity
is found by subtracting the surrounding response from the center response. This
gives rise to a difference of Gaussian expression which is discussed in Chap.
10. Further, the exponential response curve of the eye is the primary reason
why exponential histogram equalization was used in Chap. 4.
5.2. Overview
of Image Processing
An
image processing system consists of a source of image data, a processing
element and a destination for the processed results. The source of image data
may be a camera, a scanner, a mathematical equation, statistical data, the Web,
a SONAR system, etc. In short, anything able to generate or acquire data that
has a two-dimensional structure is considered to be a valid source of image
data. Furthermore, the data may change as a function of time.
The
processing element is a computer. The computer may be implemented in a number
of different ways. For example, the brain may be said to be a kind of
biological computer that is able to perform image processing (and do so quite
well!). The brain consumes about two teaspoons of sugar and 20 watts of power
per hour. An optical element can be used to perform computation and does so at
the speed of light (and with very little power). This is a fascinating topic of
current research [Fietelson].
In fact, the injection of optical computing elements can directly produce
information about the range of objects in a scene [DeWitt and Lyon]. Such
computing elements are beyond the scope of this book. The only type of computer
that we will discuss in this book is the digital computer. However, it is
interesting to combine hybrid optical and digital computing. Such an area of
endeavor lies in the field of
photonics. The
output of the processing may be a display, created for the human visual system.
Output can also be to any
stream.
In Java, a stream is defined as an uninterpreted sequence of bytes. Thus, the
output may not be image data at all. For example, the output can be a
histogram, a global average, etc. As the output of the program renders a higher
level of interpretation, we cross the fuzzy line from image processing into the
field of
vision.
As an example, consider that image processing is used to edge detect the image
of coins on a table. Computer vision is used to tell how much money is there.
Thus, computer vision will often make use of image processing as a sub-task.
5.2.1. Digitizing
a Signal
Digitizing
is a process that acquires quantized samples of continuous signals. The signals
represent an encoding of some data. For example, a microphone is a pressure
transducer that produces an electrical signal. The electrical signal represents
acoustic pressure waves (sound).
The
term
analog
refers to a signal that has a continuously varying pattern of intensity. The
term
digital
means that the data takes on discrete values. Let
s(t)
be a continuous signal. Then, by definition of continuous,
(5.1) We
use the symbol
R
to denote the set of real numbers. Thus
,
which says that
R
is the set of all
x
such that
x
is a real number. We read (5.1) saying, in the limit, as
t
approaches
a,
such that
a
is a member of the set of real numbers,
.
The expression
is read as “the set of all
x’s
such that
P(x)
is true” [
Moore
64]. This
is an
iff
(i.e., if and only if) condition. Thus, the converse must also be true. That is,
is
not continuous iff there exists a value,
a
such that:
(5.2) is
true.
For
example, if
has multiple values at
a,
then the limit does not exist at
a. The
analog-to-digital conversion consists of a sampler and a quantizer. The
quantization is typically performed by dividing the signal into several uniform
steps. This has the effect of introducing
quantization
noise
.
Quantization noise is given, in dB, using
(5.3) where
SNR
is the signal-to-noise ratio and
b
is the number of bits. To prove (5.3), we follow [
Moore]
and assume that the input signal ranges from -1 to 1 volts. That is,
(5.3a) Note
that the number of quantization intervals is
.
The least significant bit has a quantization size of
. Following [
Mitra],
we obtain the bound on the size of the error with:
(5.3b) The
variance of a random variable,
X,
is found by
where
is a probability distribution function. For the signal whose average is zero,
the variance of (5.3b) is
(5.3c). The
signal-to-noise ratio for the
quantization
power
is
(5.3d) Hence
the range on the upper bound for the signal-to-quantization noise power is
(5.3). Q.E.D.
In
the above proof we assumed that uniform steps were used over a signal whose
average value is zero. In fact, a
digitizer
does
not have to requantize an image so that steps are uniform. An in-depth
examination of the effects of non-linear quantization on SNR is given in
[Gersho]. Following Gersho, we generalize the result of (5.3), defining the SNR
as
(5.3e) where
and
is the
mean-square
distortion
defined by the inner product between the square of the quantization error for
value
x
and the probability of value
x.
The inner product between
e
and
p
is given by
(5.3f). where
(5.3g). The
inner product is an important tool in transform theory. We will expand our
discussion of the inner product when we touch upon the topic of sampling.
We
define
Q(x)
as the quantized value for
x.
Maximizing
SNR
requires
that we select the quantizer to minimize (5.3f), given
a
priori
knowledge of the PDF (if the PDF is available). Recall that for an image, we
compute the PMF (using the
Histogram
class)
as well as the CMF. As we shall see later, (5.3f) is minimized for
k-level
thresholding (an intensity reduction to
k
colors) when the regions of the
CMF
are
divided into
k
sections. The color is then remapped into the center of each of the CMF
regions. Hence (5.3f) provides a mathematical basis for reducing the number of
colors in an image provided that the
PDF
is
of zero mean (i.e, no DC offset) and has even symmetry about zero. That is
.
Also, we assume that the quantizer has odd symmetry about zero, i.e.,
.
A
simple zero-memory 4-point
quantizer
inputs
4 decision levels and outputs 4 corresponding values for input values that
range within the 4 decision levels. When the decision levels are placed into an
array of
double
precision
numbers, in Java (for the 256 gray-scale values) we write:
public
void
thresh4(double
d[]) {
short
lut[] =
new
short[256];
if
(d[4] ==0)
for
(
int
i=0; i < lut.length; i++) {
if
(i < d[0]) lut[i] = 0;
else
if
(i < d[1])
lut[i]
= (
short)d[0];
else
if
(i < d[2]) lut[i] = (
short)d[1];
else
if
(i < d[3]) lut[i] = (
short)d[2];
else
lut[i] = 255;
System.out.println(lut[i]);
}
}
We
shall revisit quantization in Section 5.2.2.
Using
the Java AWT’s Image class, we have seen that 32 bits are used, per pixel
(red,
green,
blue
and alpha).
There are only 24 bits used per color,
however. Section 5.2.2 shows how this relates to the software of this book.
Recall
also that the digitization process led to sampling an analog signal. Sampling a
signal alters the harmonic content (also known as the spectra)
of the signal. Sampling
a continuous signal may be performed with a pre-filter and a switch. Fig. 5-2
shows a continuous function,
,
being sampled at a frequency of
.
Fig.
5-2.
Sampling
System
The
switch in Fig. 5-2 is like a binary amplifier that is being turned on and off
every
seconds. It multiplies
by an amplification factor of zero or one. Mathematically, sampling
is expressed as a
pulse
train,
,
multiplied by the input signal
,
i.e., sampling is
.. To
discuss the pulse train mathematically, we must introduce the notation for an
impulse. The
unit
impulse,
or
Dirac
delta,
is a generalized function that is defined by
(5.4) where
is arbitrarily small. The Dirac delta has unit area about a small neighborhood
located at
.
Multiply the Dirac delta by a function and it will
sift
out the values where the Dirac delta is equal to zero:
(5.5) This
is called the
sifting
property
of the Dirac delta.
In fact, the Dirac
delta is equal to zero whenever its argument is non-zero. To make the Dirac
activate, given a non-zero argument, we bias the argument with an offset,
.
A pulse train is created by adding an infinite number of Dirac deltas
together:
(5.6) (5.7) To
find the spectra of (5.7) requires that we perform a
Fourier
transform
.
The Fourier transform,
just like any transform, performs a correlation between a function and a
kernel.
The kernel of a transform typically consists of an
orthogonal
basis
about which the reconstruction of a waveform may occur. Two functions are
orthogonal if their inner product
=0. Recall that the inner product is given by
(5.7a) From
linear algebra, we recall that a collection of
linearly
independent
functions
forms a
basis
if every value in the set of all possible values may be expressed as a linear
combination of the basis set. Functions are linearly independent
iff
the sum of the functions is non-zero (for non-zero co-efficients). Conversely,
functions are linearly dependent
iff
there exists a combination of non-zero coefficients for which the summation is
zero. For example:
(5.7b) The
ability to sum a series of sine and cosine functions together to create an
arbitrary function is known an the
super
position
principle and applies only to periodic waveforms. This was discovered in the
1800’s by Jean Baptiste Joseph de Fourier
[Halliday]
and is expressed as a summation of sine and cosines, with constants that are
called
Fourier
coefficients. (5.7c) We
note that (5.7c) shows that the periodic signal has discrete spectral
components. We find the Fourier coefficients by taking the inner product of the
function,
f(x)
with the basis functions, sine and cosine. That is:
(5.7d) For
an elementary introduction to linear algebra, see [Anton].
For a concise summary see [Stollnitz].
For an alternative derivation see [Lyon and Rao]. It
is also possible to approximate an
aperiodic
waveform. This is done with the
Fourier
transform
.
The Fourier transform uses sine and cosine as the basis functions to form the
inner product, as seen in (5.7a):
(5.8). By
Euler’s identity,
(5.9) we
see that the sine and cosine basis functions are separated by being placed on
the real and imaginary axis.
Substituting
(5.7) into (5.8) yields
(5.10) where
(5.11) The
term
(5.12) defines
a convolution. We can write (5.10) because multiplication in the time domain is
equivalent to convolution in the frequency domain. This is known as the
convolution theorem. Taking the Fourier transform of the convolution between
two functions in the time domain results in
(5.13) which
is expanded by (5.8) to yield:
(5.13a) Changing
the order of integration in (5.13a) yields
(5.13b) with
(5.13c) and
(5.13d) we
get
(5.14). This
shows that convolution in the time domain is multiplication in the frequency
domain. We can also show that convolution in the frequency domain is equal to
multiplication in the time domain. See [Carlson]
for an alternative proof.
As
a result of the convolution theorem, the Fourier transform of an impulse train
is also an impulse train,
(5.15) Finally,
we see that sampling
a signal at a rate of
causes the spectrum to be reproduced at
intervals:
(5.16) (5.16)
demonstrates the reason why a band limiting filter is needed before the
switching function of Fig. 5-2. This leads directly to the sampling theorem
which states that a band limited signal may be reconstructed without error if
the sample rate is twice the bandwidth. Such a sample rate is called the
Nyquist
rate
and is given by
.
5.2.2. Image
Digitization
Typically,
a camera is used to digitize an image. The modern CCD cameras have a photo
diode arranged in a rectangular array. Flat-bed scanners use a movable platen
and a linear array of photo diodes
to perform the two-dimensional digitization.
Older
tube type cameras used a wide variety of materials on a photosensitive
surface. The materials vary in sensitivity and output. See [Galbiati]
for a more detailed description on tube cameras.
The
key point about digitizing an image in two dimensions is that we are able to
detect both the power of the incident energy as well as the direction.
The
process of digitizing an image is described by the amount of spatial resolution
and the signal -to-noise ratio (i.e., number of bits per pixel) that the
digitizer has. Often the number of bits per pixel is limited by performing a
thresholding.
Thresholding (a topic treated more thoroughly in Chap. 10) reduces the number
of color values available in an image. This simulates the effect of having
fewer bits per pixel available for display. There are several techniques
available for thresholding. For the grayscale image, one may use the cumulative
mass function for the probability of a gray value to create a new look-up
table. Another approach is simply to divide the look-up table into uniform
sections. Fig. 5-2 shows the mandrill before and after thresholding operation.
The decision about when to increment the color value was made based on the CMF
of the image. The number of bits per pixel (bpp), shown in Fig. 5-2, ranging
from left to right, top to bottom, are: 1 bpp, 2 bpp, 3 bpp and 8 bpp. Keep in
mind that at a bit rate of 28 kbps (the rate of a modest Internet connection
over a phone line) the 8 bpp image (128x128) will take 4 seconds to download.
Compare this to the uncompressed 1 bpp image which will take 0.5 seconds to
download. Also note that the signal-to-noise ratio for these images ranges from
10 dB to 52 dB.
Fig.
5-3.
Quantizing
with Fewer Bits Per Pixel
The
code snippet allows the cumulative mass function
of the image to bias decisions about when to increment the color value. The
input to the code is the number of gray values,
k.
There
are several methods to perform the quantization. The one shown in Fig. 5-3 is
useful in edge detection
(a topic covered in Chap. 10). The
kgreyThresh
method follows:
public
void
kgreyThresh(double
k) {
Histogram
rh =
new
Histogram(r,"red");
double
cmf[] = rh.getCMF();
TransformTable
tt =
new
TransformTable(cmf.length);
short
lut[] = tt.getLut();
int
q=1;
short
v=0;
short
dv = (
short)(255/k);
for
(
int
i=0; i < lut.length; i++) {
if
(cmf[i] > q/k) {
v += dv;
q++;
//(k
== q+1)||
if
(q==k) v=255;
}
lut[i]=v;
}
tt.print();
}
5.2.3. Image
Display
One
display device that has come into common use is the cathode-ray tube (CRT).
The cathode ray tube displays an image using three additive colors: red, green
and blue. These colors are emitted using phosphors that are stimulated with a
flow of electrons. Different phosphors have different colors (spectral radiance).
There
are three kinds of television systems in the world today, NTSC,
PAL
and SECAM. NTSC
which stands for National Television Subcommittee,
is used in North America and Japan. PAL stands for phase alternating line and
is used in parts of Europe, Asia, South America
and Africa.
SECAM
stands for Sequential Couleur à Mémorie
(sequential chrominance signal and memory) and is used in France, Eastern Europe
and Russia. The
gamut
of colors and the reference color known as
white
(called white balance) are different on each of the systems.
Another
type of display held in common use is the computer monitor.
Factors
that afflict all displays include: ambient light, brightness (black level) and
contrast (picture). There are also phosphor chromaticity differences between
different CRTs. These alter the color gamut that may be displayed.
Manufacturers’
products are sometimes adopted as a standard for the color gamut to be
displayed by all monitors. For example, one U.S. manufacturer, Conrac, had a
phosphor that was adopted by SMPTE (Society of Motion Picture and Television
Engineers)
as the basis for the SMPTE
C phosphors.
The
CRTs have a transfer function like that of (4.14), assuming the value,
v
ranges from zero to one:
(5.3) Typically,
this is termed the
gamma
of a monitor and runs to a value of 2.2 [Blinn]. As Blinn points out, for a
gamma of 2, only 194 values appear in a look-up table of 256 values. His
suggestion that 16 bits per color might be enough to perform image processing
has been taken to heart, and this becomes another compelling reason to use the
Java
short
for storing image values. Thus, the image processing software in this book does
all its image processing as if intensity were linearly related to the value of
a pixel. With the storage of 48 bits per pixel (for red, green and blue) versus
the Java AWT model of 24 bits per red, green and blue value, we have increased
our signal-to-noise ratio for our image representation by 48 dB per color. So
far, we have not made good use of this extra bandwidth, but it is nice to know
that it is there if we need it.