What Is Computer Vision?

Computer Vision (CV) is a subfield of artificial intelligence and machine learning that develops techniques to train computers to interpret and understand the contents inside images. This can also be applied to videos, as a video is simply a collection of consecutive images, or ‘frames’. Computer Vision aims to replicate parts of the complexities in human vision system and visual perception by applying deep learning models to accurately detect and classify objects from the dynamic and varying physical world.

The first basic neural networks were developed around the 1950s to detect edges of simple objects and sort them into categories (i.e. circles, triangles, squares and so on). These systems were further developed to help the blind by enabling them to recognise written and typed text and characters using a method known as optical character recognition. By the 1990s, the rise of the Internet meant that unprecedented datasets of millions of images were regularly being shared and generated across the web. These extensive visual datasets enabled researchers to better train their models and develop face recognition programs that helped computers identify specific pictures inside of photos and videos.

Today, the advancements in smartphone technology, social media and their frequent use by billions of users - more than 3 billion images are shared online every day – is continuously generating even greater amounts of visual data than ever seen before. Together with the increased accessibility to large computer power and the innovations in deep learning and neural networks algorithms (i.e. the invention of convolutional neural networks), the availability of such immense amounts of images have brought invaluable opportunities for computers to learn the patterns and characteristics of these images and enhance the accuracy rates for object detection and classification. As a result, computer vision systems have surpassed the accuracy of human vision at certain detection, categorisation and reaction tasks, reaching accuracy rates of 99% in a number of their applications.

How Does Computer Vision Work?

Computer Vision is now able to perform a variety of tasks in a wide range of fields, from self-driving cars to medical diagnosis. Some of these tasks include photo classification, object detection, face recognition and searching image and video content. In order to perform these tasks, computers first need to be able to generate information from images (i.e. “see” the image). Since computers can only operate using numerical values (i.e. bits), they first need to read an image in its most raw numerical form: the matrix of its pixels. This matrix represents the brightness of each pixel in an image, from the darkest black (at value 0) to the brightest white (at value 255).

Images are a made up of thousands of pixels. These pixels are one-dimensional arrays with values from 0 to 255. One single image will contain three different matrices for the three components that represent the three primary colours: red, green and blue (RGB). By combining different brightness levels of the different primary colours (from 0 to 255), a pixel can display alternate colours to those primary ones. For example, a pixel that displays a vivid colour purple will have the values Red=128, Green=0 and Blue=128 (mixing red and blue results in purple), while a vivid yellow pixel in an image will contain values Red=255, Green=255 and Blue=0 (mixing red and green results in yellow). On the other hand, a grayscale image will only contain one single pixel matrix corresponding to the brightness of its black and white colours.

Deep learning algorithms in computer vision make use of these pixel arrays to apply statistical learning methods, such as linear regression, logistic regression, decision trees or support vector machines (SVM). By analysing the brightness values of a pixel and comparing it to its neighbouring pixels a computer vision model is able to identify edges, detect patters and eventually classify and detect objects in an image based on previously learned patterns. These methods often require the model to have already previously processed, stored values and learned patterns (i.e. to have been trained) of similar images containing the object of interest to be detected and tracked in the new, unseen image.

For example, to be able to detect a person in an image, a significantly large number of pre-labelled images of people are uploaded into the system, allowing the model to learn on its own by recognising patters in the features that make up a person. Once a new, not previously seen image is fed to that model, the computer will look for patterns in the colours, the shapes, the distances between the shapes, where objects border each other, and so on. It will then compare them to the characteristics from the images and labels it had previously identified and decide, based on probabilistic rules, whether there is a person or not in this new image. In other words, computer vision systems are able to ingest many labelled examples of a specific kind of data, extract common patterns between those examples and transform it into a mathematical equation that will help classify future pieces of information.

Often, computers require images to be pre-processed prior to applying any detection and tracking models to them. Image pre-processing simplifies and enhances the image’s raw input by changing its properties, such as its brightness, colour, cropping, or reducing noise. This modifies the pixel matrices of the images in a way that a computer can better perform its expected tasks, such as removing a background in order to detect objects in the foreground. This is particularly useful in video footage, where computer vision can track moving objects using a discriminative method to distinguish between objects in the image and the background. By separating the two, it can detect all possible objects of interest for all relevant frames and use deep learning techniques to recognise the specific object to track from the ones detected.

Deep learning models are often trained to automate this process by inputting thousands of pre-processed, labelled or pre-identified images. Training of models can follow a variety of techniques, such as partitioning the images into multiple pieces to be examined separately, using edge detection to identify the edges of an object and better recognise what is in the image, use pattern detection to recognise repeated shapes, colours or other indicators, or even use feature matching to detect matching similarities in images to help classify them. Models may also use X and Y coordinates to create bounding boxes and identify everything within each box, such as a football field, an offensive player, a defensive player, a ball and so on. More than one technique is frequently used in conjunction to improve the accuracy and precision of object detection and tracking in an image or video.

The Applications Of Computer Vision In Sport

In sports, artificial intelligence was virtually unknown less than five years ago, but today deep learning and computer vision are making their way into a number of sports industry applications. Whether it is used by broadcasters to enhance spectator experience of a sport or by clubs themselves to become more competitive and achieve success, the reality is that the industry has substantially increased its adoption of these modern techniques.

Most major sports involve fast and accurate motion that can sometimes become challenging for coaches and analysts to track and analyse in great detail. This is particularly difficult in those situations when the use of wearable tracking equipment and sensors to augment data collection is not an option. In training sessions and certain matches, especially if they are untelevised, performance analyst are only able to obtain a limited number of angles of video footage. This footage is limited to providing visualisation of the player’s movement rather than detailed analysis. The data and insights obtained from the footage requires the analyst to spend numerous hours manually notating and collecting events as they replay the video. Scenarios such as this is where the application of computer vision techniques can bridge that gap between the sporting event and analytical insights by offering novel ways to gather data and obtain valuable analysis through automated systems that locate and segment each player of interest and following them over the duration of the video.

In the context of sports, footage is usually acquired through one or more cameras installed at close proximity of where the event takes place (i.e. the sidelines of a training field or the stands in a stadium during a match). The angle, positioning, hardware and other filming configurations of these cameras can vary greatly from sport to sport, event to event or even within the different cameras used for the same match or training session. This can pose a challenge for certain computer vision applications to accurately detect the precise positioning of objects or their direction of movement as they may fail to understand the varying configurations used to capture the different footage presented to them, where it is for training the models or classifying new, unseen images.

Traditionally, costly camera calibration for multi-camera tracking systems was essential ball and player tracking systems. For fixed-angle cameras, this could be done through scene calibration, where balls were rolled over the ground to account for non-planarity of the playing surface. However, broadcast cameras present additional challenges in that they often change their pan, tilt and zoom. This dynamism needed to be accounted for by using sensors on the camera mounting and lens to measure zoom and focus settings and be able to relate the raw values from the lens encoders to focal length. Gaining access to these advanced filming equipment is not often an option for most Performance Analysis departments within sporting clubs, limiting their capacity to apply advanced tracking of players.

Computer vision has partially solved these limitations. With its application of image processing, computer visions systems are now able to distinguish between the ground, players and other foreground objects. Methods such as colour-based elimination of the ground in courts with uniformly coloured surfaces allow computer vision models to detect the zones of a pitch, track moving players and identify the ball. For instance, colour-based segmentation algorithms are currently being used to detect the grass by its green colour and treat it as the background of the image or video frame, where players and objects move in front of it. Moreover, image differencing and background subtraction methods have also been used on static footage to detect the motion of the segmented foreground players against the image background.

Player Tracking

One of the key aims when applying computer vision in sports is player tracking. This involves the detection of the position of all players at a given moment in time. Player tracking is a pivotal element for coaches to help improve the performance of their teams, allowing them to instantly analyse the ways in which individual players move on the field and the overall formation of their team. Today, the most advanced applications of computer vision in sport use automated segmentation techniques to identify regions that likely to correspond to players.

The results obtained from a computer vision system can be augmented by applying machine learning and data mining techniques to the raw player tracking data. Once key elements in an image or video frame are detected, semantic information can be generated in order to create context on what actions the players are performing (i.e. ball possession, pass, run, defend and so on). These techniques can label semantic events, such as ‘a one-two pass’ in football, and be used for advanced statistical analysis of player and team performance. Suggestions can also be constructed on the optimal positions of players on the pitch and be displayed to coaches in a manner in which they can compare ideal player positioning against their actual positions in a given play. The vast opportunities created from this player tracking technology has the potential to revolutionise training and scouting for players in sports.

Data Collection

The use of action and event recognition techniques aim to localise sets of actions that a player performs in both space and time. These techniques can detect events – such as goals, penalties, near misses, and shots - during video clips by identifying visual information about the environment, such as court colour and lines on the pitch. They then use that information to classify each action into sport-specific groups by assigning them labels (i.e. shot, pass, etc.). Ultimately, action recognition and classification can be used to automatically generate performance statistics in a match or training session, such as shot types, passes or possession. It can also be applied to index videos by predefined themes based on their contents to be able to easily browsed through footage and automatically generate highlights movies.

How Is Computer Vision Used In Different Sports?

In racket and bat-and-ball sports, such as Tennis, Badminton or Cricket, computer vision has been widely used since the mid-2000s. Ball tracking systems attempt to look through each camera image available to identify all possible objects resembling the characteristics of a ball (i.e. searching for elliptical shapes in an expected size range). Once these objects have been detected, they then construct a 3D trajectory of the playing ball by linking multiple frames where the ball was detected to define the ball path across the various camera angles. The results from this system can then be used to instantly determine whether a ball has landed in or out of bounds. The system provide further analysis, such as predicting the path that a cricket ball would have taken if the batsman had not hit it.

An example of the use of computer vision in tennis can be spotted in one of the major tournaments in the sport. In 2017, Wimbledon partnered with IBM to include automated video highlights picking up key moments in the match by simply gathering data from players and fans, such as crowd noise, player movements and match data. Similarly, on the commercial side, a pocket-sized device was designed by Grégoire Gentil that called in and out in a tennis match by using computer vision to detect the speed and placement of a shot and determine whether the ball was out of bounds.

Other major invasion team sports have not been indifferent to the emergence of these new technologies. In football, FIFA certified goal line technology installations in major stadiums using a 7-camera computer vision system developed by Hawk-Eye. It uses a goal detection systems with multiple view high-speed cameras covering each goal area that detect moving objects by sorting potential objects resembling the playing ball based on area, colour and shape. With an accuracy error rate of 1.5cm and a detection speed of 1s, it enables football referees to immediately decide whether or not a ball has crossed the goal line and a goal should be awarded.

Aside from widespread implementations of computer vision, such as FIFA’s goal-line technology, other ad-hoc projects have also attempted to incorporate computer vision into football. In the 2011/2012 football season in Germany, Stemmer Imaging helped Impire develop an automatic player tracking system using two cameras in the press area of any stadium. This reduced the number of operators required to get accurate data without losing the quality of the information.

In American sports, such as the NFL, computer vision has been applied to automatically generate offensive formation labeling by classifying video footage based on the coordinates of players when tracked throughout a particular play. This application has supported coaches and analysts in the evaluation of oppositions’ patterns of play by generating a wealth of data on the most common formations employed by rival teams. Furthermore, the system has provided teams with additional information on oppositions’ tactics, such as the likelihood of passing or running out of each formation, run frequency for each side of the field, split between right guard and right end, frequency of runs up the middle, pass frequency on short routes, and average yard gains between running and passing plays.

Challenges Of Computer Vision

Despite the great potential that computer vision can bring to the world of sport and the field of performance analysis, there are still critical challenges that need to be overcome before that potential can be fully exploited. Some of these challenges relate to the fact that computer vision cannot yet fully compete with the human eye. A system that fully automates video analysis of sports by tracking and labelling players remains a challenge as optical tracking systems cannot yet cope with the varying body posture of a person during sports exercises, as well as the partial or full occlusion of players by equipment or other players during collisions or interactions. Tracking of sports players is also particularly challenging due to the fast and erratic motion, similar appearance of players in team sports, and often close interactions between players.Tracking the ball is a further challenge in team sports, where several players can occlude the ball (i.e. a ruck in Rugby Union), and it is possible that players are in possession of the ball with either their hands or between their feet.

The reason for these to continue to be a challenge within the field of AI and computer vision is that we still do not completely understand how human vision truly works. Even though the field of Biology studies the eye, the visual cortex and the brain, we are still far from fully understanding all the components of such a fundamental function of the human brain. For instance, how the influence of our memory, past experiences and inherited knowledge through billions of years of evolution impacts our perception and our ability to identify elements in our world. This lack of detailed understanding of human vision and our abstract perception makes it difficult to replicate our inherited knowledge of the world through a computer. On top of that, the external dynamism, variance and complexity of our physical world proves an extreme challenge to solve through computers that have to be thoroughly instructed on the types of objects, captured through the lens of a camera, that they must detect. Particularly when they are unable to deviate from what they have been trained to identify.

Nevertheless, the field of AI and computer vision continues its rapid development thanks to heavy investments by key players, such as Google, Intel, Amazon and many others, to continue to advance the computer power, increase datasets and develop new techniques that get closer to our human vision capabilities. Eventually, these advances will inevitably continue to make their way into the world of sport as athletes and teams aim to leverage modern technologies to improve their performance and become even more competitive. As performance analysts continue to support these athletes and coaches in objective evaluation of performance, it is without a doubt that the expansion of computer vision will eventually transform key areas of Performance Analysis in sport.

Citations and further reading:

Brownlee, J. (2019). A gentle introduction to computer vision. Machinery learning mastery. Link to article.
Dickson, B. (2019). What is Computer Vision? TechTalks. Link to article.
Dickson, B. (2020). What is Computer Vision? PC Mag. Link to article.
Kaiser, A. (2017). What is Computer Vision? Hayo. Link to article.
Le, J. (2018). The 5 computer vision techniques that will change how you see the world. Heart Beat. Link to article.
Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos. IEEE transactions on pattern analysis and machine intelligence, 35(7), 1704-1716. Link to paper.
Mihajlovic, I. (2019). Everything you ever wanted to know about Computer Vision. Towards Data Science. Link to article.
Monier, E., Wilhelm, P., & Rückert, U. (2009). A computer vision based tracking system for indoor team sports. In The fourth international conference on intelligent computing and information systems. Link to paper.
Sennaar, K. (2019). Artificial Intelligence in sports – current and future applications. Emerj. Link to article.
Softarex. (2019). Computer vision and machine learning in sports analytics: injury and outcome prediction. Softarex. Link to article.
Thomas, G., Gade, R., Moeslund, T. B., Carr, P., & Hilton, A. (2017). Computer vision for sports: Current applications and research topics. Computer Vision and Image Understanding, 159, 3-18. Link to paper.