Intro

Computer vision, a core area of artificial intelligence, involves developing solutions that allow AI applications to interpret and understand the visual world. Although computers lack biological eyes, they can process images from various sources such as cameras or digital media. This capability enables the creation of software that mimics human visual perception.

Image processing

To a computer, an image is essentially a grid of numeric pixel values. In the case of a 7x7 pixel image, this translates into a 7x7 array. Each pixel in this array holds a value ranging from 0 (representing black) to 255 (indicating white), with values in between representing various shades of gray. Thus, the image is visualized as a matrix of grayscale intensity levels, where each element corresponds to a specific pixel’s brightness.

fundamentals-of-computer-vision

The pixel values for this image are organized into a two-dimensional array, representing rows and columns (or x and y coordinates), defining a rectangular grid of pixel values. Such a single-layered array signifies a grayscale image. However, in practice, digital images are typically multidimensional and comprise three layers, known as channels, representing the red, green, and blue (RGB) color hues.

A popular method for conducting image processing tasks involves the application of filters, which alter the pixel values of an image to achieve specific visual effects. A filter is characterized by one or more arrays of pixel values known as filter kernels. For instance, you might specify a filter with a 3x3 kernel, as illustrated in this example.

---111--181---111

The kernel is applied by convolving it across the image, computing a weighted sum for every 3x3 patch of pixels, and assigning the outcome to a new image. To grasp the filtering process better, let’s delve into a detailed example, step by step.

0000000000000022200555005552220055500555222005550055500000000000000

Initially, we apply the filter kernel to the top-left portion of the image. This involves multiplying each pixel value by the corresponding weight value in the kernel and summing up the results.

(((000xxx---111)))(((000xxx-8-1)1))((0(02x5x5--1x1))-1)=-255

The resulting value (-255) is then assigned as the first value in a new array. Next, we shift the filter kernel one pixel to the right and repeat the process.

The filter is applied across the entire image, generating a new array of values. Some of these values might fall outside the pixel value range of 0 to 255, so they are adjusted to fit within that range. Since the filter’s shape prevents calculation at the outer edge of pixels, a padding value (typically 0) is used. The resulting array represents a transformed image, where the filter has emphasized the edges of shapes present in the original image.

fundamentals-of-computer-vision

Machine learning for computer vision

Utilizing filters to apply effects to images is valuable in image processing tasks, akin to those carried out in image editing software. Nonetheless, in computer vision, the objective often entails extracting meaning or actionable insights from images. Achieving this typically involves developing machine learning models trained to recognize features based on extensive datasets of existing images.

Convolutional neural networks (CNNs)

One of the most prevalent machine learning architectures for computer vision is the convolutional neural network (CNN). CNNs utilize filters to extract numerical feature maps from images, which are then fed into a deep learning model to predict labels. For instance, in image classification tasks, the label signifies the primary subject of the image (i.e., what is depicted in the image?). For instance, you could train a CNN model with images depicting various types of fruit (such as apples, bananas, and oranges) so that it predicts the type of fruit in a given image.

Throughout the training process of a CNN, filter kernels begin with randomly generated weight values. As training progresses, the model’s predictions are compared against known label values, and the filter weights are adjusted to enhance accuracy. Eventually, the trained model for fruit image classification utilizes filter weights that most effectively extract features aiding in the identification of different types of fruit.

fundamentals-of-computer-vision

  1. The network is trained using images with known labels (e.g., 0: apple, 1: banana, or 2: orange).
  2. One or more layers of filters are employed to extract features from each image during network traversal. These filters start with randomly initialized weights and yield arrays of numeric values known as feature maps.
  3. The feature maps are flattened into a one-dimensional array of feature values.
  4. The feature values are inputted into a fully connected neural network.
  5. The output layer of the neural network employs a softmax or similar function to generate a result containing a probability value for each possible class, such as [0.2, 0.5, 0.3].

During training, the predicted probabilities are compared to the actual class label. For instance, if an image of a banana belongs to class 1, the expected output should be [0.0, 1.0, 0.0]. The disparity between the predicted and actual class scores determines the model’s loss, and this loss is used to adjust the weights in the fully connected neural network and the filter kernels in the feature extraction layers to minimize it.

This training process iterates over multiple epochs until the model learns an optimal set of weights. Once achieved, these weights are saved, enabling the model to predict labels for new images with unknown classifications.

Transformers and multi-modal models

CNNs have been central to computer vision solutions for an extended period. While their primary application lies in solving image classification tasks, they also serve as the foundation for more sophisticated computer vision models. For instance, object detection models integrate CNN feature extraction layers with the identification of regions of interest in images, enabling the simultaneous detection of multiple object classes within a single image.

Transformers

Advancements in computer vision over the years have largely stemmed from enhancements in CNN-based models. However, in another field of AI—natural language processing (NLP)—the emergence of a different neural network architecture, known as transformers, has facilitated the creation of sophisticated language models. Transformers operate by processing extensive datasets and encoding language tokens (representing individual words or phrases) into vector-based embeddings (arrays of numeric values). Each embedding can be viewed as representing a set of dimensions, with each dimension denoting some semantic attribute of the token. These embeddings are constructed in such a way that tokens frequently used in similar contexts are positioned closer together dimensionally than unrelated words.

For instance, consider a simple illustration where words are encoded as three-dimensional vectors and plotted in a 3D space:

fundamentals-of-computer-vision

Tokens with similar meanings are encoded closely together, forming a semantic language model. This enables the development of advanced NLP solutions for tasks such as text analysis, translation, language generation, and more.

Multi-modal models

The effectiveness of transformers in constructing language models has prompted AI researchers to explore whether the same methodology could be applied to image data. This inquiry has resulted in the creation of multi-modal models, where the model is trained using a vast collection of captioned images without predefined labels. An image encoder is employed to extract features from images based on pixel values, which are then merged with text embeddings generated by a language encoder. This integrated model captures the associations between embeddings of natural language tokens and image features.

fundamentals-of-computer-vision

Demo