Analytics

Artificial Intelligence (AI) in Sports

Dr Patrick Lucey is the Chief Scientist at Stats Perform and has over 20 years of experience working in Artificial Intelligence (AI), in particular face recognition and audio visual speech recognition technology. He also worked at Disney Research (owners of ESPN) where he developed an automatic sports broadcasting system that tracked players in real-time by moving a robotic camera to capture their movements.

Patrick recently talked about the use of Artificial Intelligence in sports, what that means and how we can use AI to help coaches and analysts make better decisions in sport. Artificial Intelligence refers to technology that emulates human tasks, often using machine learning as the method to learn from data how to emulate these tasks. His talk emphasised on the importance of sports data, and provided an overview on the different types of sports data that exist today. Patrick explained what is meant by AI and why is AI needed in sport.

Stats Perform is one of the leaders in data collection in sports, offering a wide range of sports predictions and insights through world-class data and AI solutions. For over 40 years, they have been collecting the world’s deepest sports data, covering over 27,000 live streamed events worldwide with a total of 501,000 matches covered annually from 3,900 competitions. This huge coverage translates into the collection of billions of unique event and tracking data points available in their immense sports databases. To make use of this invaluable dataset, Stats Perform has created an AI Innovation Centre that hired more than 300 developers and 50 data scientists to create a series of AI products with the goal of measuring what was once immeasurable in sport.

Different Types Of Sports Data

Patrick and the Stats Perform AI Innovation Centre have worked on a wide range of different types of data to make predictions on a number of different sports, from football to field hockey, volleyball to swimming using different types of data. There are 3 main types of sports data available: box scores, event data and tracking data. All these types of data facilitate the reconstruction of the story of a match or a particular performance. However, the more granular the temporal and spacial data of a game is, the better the story an analyst can tell.

Box-Score Statistics

The use of high-level box-score statistics (half-time match score, full-time match score, goal scorers, time of goals, yellow cards, etc.) can summarise a 90-minute match of football to provide an idea on how the game was played in just a few seconds. Basic box-score statistics can tell you who won the match, which team took the lead first, when were the goals scored and how close together to each other. Box-score statistics provide a fairly good snapshot of a game and a decent level of match reconstruction.

Box-score statistics Sevilla vs Dortmund (Source: Sky Sports)

Box-score statistics Sevilla vs Dortmund (Source: Sky Sports)

Box-score statistics also offer a more detailed level of information. For example, they can illustrate which team had more shots and the quality of those shots by showing the number of shots and shots on goal. They can also explain the distribution of possession between the teams in the match, which team had more corners, committed more fouls, made more saves and so on. Within a few second they can capture the story of the match, which team dominated or how close was that game.

Detailed box-score statistics Sevilla vs Dortmund (Source: Sky Sports)

Detailed box-score statistics Sevilla vs Dortmund (Source: Sky Sports)

Event Data

Event data, or play-by-play data, provides a bit more detail than box-score statistics by offering additional contextual information of key moments during a match. For examples, play-by-play commentary of a match can offer textual descriptions of what occurred at every minute of the match. Similarly, spacial data of the game (i.e. spacial location of players) can provide visual reconstructions of some of the key events in a match, such as how a particular goal was scored. While it is not the same as watching the video, it is a quick digitised view of the real-world play that can be reconstructed in seconds.

Text commentary of Sevilla vs Dortmund match (Source: Sky Sports)

Text commentary of Sevilla vs Dortmund match (Source: Sky Sports)

Stats Perform, particularly through Opta, is one of the industry leaders in event data collection. They provide event data to sportsbooks through a low latency feed that tells them when a goal, a shot, a dangerous attack or any other key moments occur in close-to-real-time so that the sportbooks can relay that information to their bettors. In these cases, speed of data is crucial, not only to reconstruct a story of what happens on the field through data but to be able to tell that story almost imminently.

Tracking Data

Tracking data is currently the most detailed level of data being captured in sports. It enables the projection of the location of all players and the ball into a diagram of the pitch that best reconstructs a match from the raw video footage of that match. Having a digital representation through tracking data of all players on the entire pitch enables analysts to perform better querying than simply using a video feed that only displays a subsection of the pitch.

Tracking data plotted into a diagram of a football pitch (Source: Patrick Lucey at Stats Perform)

Tracking data plotted into a diagram of a football pitch (Source: Patrick Lucey at Stats Perform)

Sources Of Sports Data

Video Footage

The vast majority of data types are collected via video analysis. Video analysis uses raw match footage as the foundation to either manually observe or automatically capture (i.e. computer vision) key events of the match to generate data from. Today, all three types of sports data (box-score, event data and player tracking data) are fundamentally based on video. However, more recently new technologies have been gradually introduced into various sports to collect great details.

Radio Frequency Identification (RFID)

The NFL is now using Radio Frequency Identification (RFID) wearables implemented on players’ shoulder pads to track x and y coordinates of each player’s location on the field.

Radar

In golf, radar and other sensor technology has also been implemented to track the ball’s trajectory and produce amazing visualisations with very accurate detection of the ball.

GPS Wearables

Football and other team sports use GPS devices that, although not as accurate as RFID, can track additional data from the athlete, such as heart rate and level of exertion. These wearable devices have the advantage that they can be used in a training environment as well as a competitive match.

Market Data (Wisdom Of The Crowds)

Market data in sports usually refers to betting data. It is an implicit way of reconstructing the story of the match that relies on people coming up with their predictions where information can be mined from.

AI-Driven Sports Analysis

Sports analysis has traditionally been based on box-score and event data. All the way from Bill James’ 1981 grassroots campaign Project Scoresheet that aimed to create a network of fans to collect and distribute baseball information to Daryl Morey’s integration of advanced statistical analysis in the Houston Rockets in 2007.

However, in the 2010s, tracking data began to set a new path to new ways of analysing sports. Over the last decade, a new era of sports analysis has emerged that maximises the value of traditional box-score and event data by complementing it using deeper tracking data. The AI revolution in sports thanks to tracking data has focused on three key areas:

  1. Collecting deeper data using computer vision or wearables

  2. Performing a deeper type analysis with that tracking data that humans would not be able to do without AI

  3. Performing deeper forecasting to obtain better predictions

Collecting Deeper Sports Data

The main objective of collecting sports data is to reconstruct the story of a match as closely as possible to the one seen by the raw footage that a human or a camera can see. The raw data collected from this footage can then be transformed into a digitised form so that we can read and understand the story of the match and produce some actionable insights.

The reconstruction of a performance with data usually starts by segmenting a game into digestible parts, such as possessions. For each part of this game, we try to understand what happened in that possession (i.e. what was the final outcome of the possession), how it happened (i.e. describing the events that led to the outcome of that possession) and how well it was done (i.e. how well were the events executed).

Currently, the way play-by-play sports data is digitised from the video footage is through the work of video analysts. Humans watch a game and notate the events that take place in the video (or live in the sports venue) as they happen. This play-by-play method of collecting data produces an account of end of possession events that describes what happened on a particular play or possession. However, when it comes to understanding how that play happened or how well it was executed, human notational systems do not produce the best information to accurately reconstruct the story. Humans have cognitive and subjective limitations when capturing very granular level of information manually, such as getting the precise timeframe of each event or providing objective evaluation of how well a play was executed.

In-Venue Tracking Systems

One way tracking data can be collected is through in-venue systems. Stats Perform uses SportVU, which was deployed a decade ago as a computer vision system that installed 6 fixed-cameras on a basketball court to track players at 24 frames per second. Their newer version of SportVU is now widely deployed in football. SportVU 2.0 uses three 4K cameras and a GPU server in-venue to collect and deliver tracking data at the edge in real-time.

Stats Perform SportVU system on a basketball court (Source: Patrick Lucey at Stats Perform)

Stats Perform SportVU system on a basketball court (Source: Patrick Lucey at Stats Perform)

However, tracking data has a main limitation: coverage. While tracking data provides an immense number of opportunities to do advanced sports analytics, its footprint across most sports is relatively low. This is because for most in-venue solutions a company like Stats Perform requires to be in the venue with all their tracking equipment installed. This is problematic when increasing the coverage of tracking data across multiple events across the world, as it is not realistic to have sophisticated tracking equipment installed in every single pitch, field, court or stadium across the world to cover every single sporting event that takes place every day.

Tracking Data Directly From Broadcast Video

To overcome the limited coverage of in-venue systems, Stats Perform are now focusing their AI efforts in capturing tracking data directly from broadcast video, through an initiative called AutoStats. It leverages the fact that for every sports game being played, there should be at least one video footage of that event being recorded and potentially being broadcasted. The way of getting the best coverage of tracking data is capturing the data directly from broadcasting footage.

PSG attacking play converted to tracking data from broadcast footage (Source: Patrick Lucey at Stats Perform)

PSG attacking play converted to tracking data from broadcast footage (Source: Patrick Lucey at Stats Perform)

This means that the way tracking data is being collected is now evolving away from in-venue solutions to a more widespread approach that uses a broadcast camera. However, the advantage of using in-venue solutions is that you only need to calibrate the camera once. When collecting tracking data off broadcast, you need to calibrate the camera at every frame because it is constantly moving while following the play.

Computer vision systems that collect tracking data directly from broadcasted video footage follow three simple steps:

  1. Transform pixels in the video into dots that represent trajectories of the movement of players and the ball. These dots can then be plotted on a diagram of the field for visualisation.

  2. The trajectories generated from the movement of the dots over a space of time can then be mapped to semantic events in the sport (i.e. a shot on goal).

  3. From the events identified, expected metrics can be derived to explain how well does a player execute on a particular event (i.e. Expected Goals).

Converting Pixels To Dots

Converting video pixels to dots refers the process of taking the video footage of the game and digitally mapping each player movement to trajectories that can be displayed on a diagram of the pitch in the form of dots. The main advantage of this method is the compression of the footage. An uncompress raw snapshot image of a game at 1920x1080px from a single camera angle can be as large as 50MB, which means video footage of that game can be as large as 50MB per frame. If instead of one camera angle you have 6 different camera angles, the data file size multiplies to around 300MB per frame. This is an incredibly high amount of high dimensional data, but not all of it is useful for sports analysis.

Conversion of video footage pixels into dots on a diagram (Source: Patrick Lucey at Stats Perform)

Conversion of video footage pixels into dots on a diagram (Source: Patrick Lucey at Stats Perform)

Instead, tracking data representing players on the court or pitch in the form of dots can substantially reduce the size of each frame. For example, in basketball, 10 players, 1 ball and 3 referees can be plotted with their x, y and z coordinates in a digital representation of the court with a size of 232 bytes per frame. This makes tracking data the master compression algorithm on sports video with compression rates of 1 million to 1.

The advantages of using tracking data instead of raw video footage is that it allows to query the dots instead of the pixels in a way that maintains the interpretability and interactivity from the raw video footage. A game can be clearly reconstructed using dots plotted on a diagram of the field to illustrate how each possession happened without the need of the extra detail available in the video footage in the form of millions of pixels.

The way the conversion from pixels to dots occur is via supervised learning, where the computer learns through machine learning processes to map and predict the input data from the pixels to the desired output of the dots. A number of computer vision techniques can be applied to achieve this goal.

Mapping Dots to Events

Once the dots (coordinates) have been generated from the pixel data of the video, the trajectories (movements) of these dots over specific timeframes can be mapped to particular events. For example, in basketball, you can start mapping these dots in the tracking data to particular basketball-related events that describe how certain outcomes occur in terms of tactical themes, such as pick and roll, type of coverages on pick and roll, did the player do a drive or a post up, off-ball screens, hand off, close out, etc. The dot trajectories are mapped to the semantics of a basketball play, and the players involved in that play, using a machine learning model that does that transformation using pre-labelled data.

Mapping Events to Expected Metrics

Expected metrics explain the quality of execution of certain events. The labels assigned to certain events are often not informative enough to explain that event. Instead, expected metrics transform an outcome label of 0 or 1 (goal or no goal) to a probability of 0 to 100% using machine learning. For example, a shot that goes in goal is considered 100% effective. However, a shot attempt that hits the post might be considered 70% effective, even if it did not end up in a goal. Regardless of the final outcome of that event, expected metrics help to evaluate whether an event was more likely to be 0% (unsuccessful), 100% (successful) or somewhere in the middle (ie. 55% successful). This concept of expected metrics is the basis of the Expected Goals (xG) metric in football. Expected Goals can also be extended to passes to calculate the likelihood of a pass reaching a certain teammate on the pitch.

Expected metrics provide an additional degree of context to each situation. For example, in basketball they use Expected Field Goal percentage (EFG) to explain that if a player misses a 3-point shot, rather than simply classify that player as missing a shot we can assess what is the likelihood that an average league player would have scored that shot from a similar situation. This can provide a measure of talent of a player over the league average and better contextualise his performance.

Limitations of Event and Expected Metrics Data

The main limitation of solely using pre-labelled event and expected metrics data using this supervised machine learning process is that not everything can be digitised. Most analysis conducted today are based on events and expected metrics, but these are semantic layers that have been pre-described or pre-categorised by humans. We have put certain patterns of play or combination of player movements into labelled boxes to make it easy to aggregate and analyse sport events. However, the dots generated from tracking data and their identified trajectories open numerous possibilities to perform further analysis that humans can’t do manually by ignoring these pre-labelled categories of patterns of play or specific player movements.

Performing Deeper Sports Analysis

The more granular the data the better analysis we can conduct of a sport. Tracking data provides that necessary level of granularity to conduct advanced analytics. Some of the key tasks that deeper data and better metrics can do much better than humans is strategy, search and simulation.

Strategy Analysis

Marcelo Bielsa once broke down the way he does analysis at Leeds United. His analysis team watches all 51 matches of their upcoming opponent from the current and prior seasons, each game taking 4 hours to analyse. In that analysis, they look for specific information about the team’s starting XI, the tactical system and formations and the strategic decisions that they make on set pieces. However, it can be argued that this methodology is time-consuming, subjective and often inaccurate. This is where technology can come in and help by making the analysis process more efficient than having a team of Performance Analysts spend 200 hours assessing the next opponent.

The idea is to transition strategy analysis in sports from a traditional qualitative approach to a more quantitative method. Tracking data has hidden structures. The strategies and formations of a team in a match of football is hidden within all the data points collected from tracking data. Insights on things like formation or team structures do not directly emerge from the tracking data without additional work on the data. This is because tracking data is noisy, for reasons such as that players are constantly switching positions on the pitch. But what tracking data allows you to do is to find that hidden behaviour and structure of a team or players and let it emerge.

Visual representation of a noisy tracking dataset of players in a football pitch (Source: Patrick Lucey at Stats Perform)

Visual representation of a noisy tracking dataset of players in a football pitch (Source: Patrick Lucey at Stats Perform)

As a way to better visualise and interpret tracking data, Stats Perform have developed the software solution Stats Edge Analysis to enable the querying of infinite formations based on tracking data. The software shows the average formation of players throughout a match, how often each player is in a certain situation, how a team’s structure evolve when they are attacking or defending or how does the formation compare in different context, situations or playing styles.

Formation analysis in Stats Edge Analysis software (Source: Patrick Lucey at Stats Perform)

Formation analysis in Stats Edge Analysis software (Source: Patrick Lucey at Stats Perform)

Search Analysis

How do we find similar plays in sport? How do we search across the history of a sport to find similar situations to the one we are interested in comparing with? One way is to use sport semantics and search using keywords such as a “3pt shot” play in basketball, a “pick and pop” play or a play “on top of the 3pt line”. However, if we want to know where all the players were located in a play, their velocity or their acceleration, as well as all the events that led up to that point, we would need to use too many words to describe that particular play very precisely. In other words, searching across the history of a sport for a similar play using just keywords does not capture the fine-grained location and motions of players and ball and does not provide a ranking of how similar the found plays are to the original play we want to compare them with.

A solution to this problem is to use tracking data. Tracking data is a low dimensional representation of what we see in video. Therefore, instead of using keywords to find a similar play, we could use a snapshot of a play using tracking data as the input in a visual search query. Users could then interact with a visual search query where they describe the type of play they want to search for and the query tool would then output a set of similar plays ranked by the degree of similarity to the play being queried.

Visual search query of similar plays (Source: Patrick Lucey at Stats Perform)

Visual search query of similar plays (Source: Patrick Lucey at Stats Perform)

This type of visual search tool based on tracking data can offer the possibility of drawing out the play to search for. It can also offer the ability to move players around the court and use expected metrics to show the likelihood of a player scoring from various positions. It can even show the changes in scoring likelihood based on the position of the defensive players relative to the player with the ball.

Play Simulation

Technology in sports is entering the sidelines. The type of technology coaches need to evaluate plays during a game and simulate different outcomes needs to be highly interactive. One way Stats Perform has used tracking data to improve play simulations is through ghosting. The idea of ghosting is to show the average play movements at the same time as the live play represented with dots on a diagram of the field. For example, tracking data can display the home team in one colour (blue) and away team in another colour (red), but additionally it can add a third defensive team in a different colour (white) that represents how the average team in the league would defend that same situation.

Ghosting of an average team in the league (white) defending a situation (Source: Patrick Lucey at Stats Perform)

Ghosting of an average team in the league (white) defending a situation (Source: Patrick Lucey at Stats Perform)

Another way Stats Perform is working with coaches in the sidelines to provide more interactive play simulations is through real-time interactive play sketching. A coach can draw out a play that they want their players to perform on their clipboard and what tracking data and technology can do is to make intelligent clipboards that can simulate how that play drawn by the coach would play out.

Performing Deeper Sports Forecasting

The more granular data available the better we can predict sports performance. Some of the applications of tracking data in forecasting include player recruitment (i.e. which players to buy, trade, draft or offer longer contracts) and match predictions (i.e. accurately predict the final outcome, score and statistics of a match both before the match takes place and in-play).

Player Recruitment

In the NBA, the league has a good level of coverage for tracking data. But what happens when a team wants to recruit someone from college? Tracking data might not exists in college leagues, which forces teams to use a very simplified version of reporting to forecast how that player is going to play once he is recruited onto the team.

This highlights the issue of tracking data coverage. Major leagues have that level of detailed tracking data, but most lower leagues and academy competitions do not. Also, historical matches from major leagues and sports prior to the era of tracking data will not have had the systems and equipment in place at the time to produce highly detailed tracking data. This is where the generation of tracking data through broadcasted video footage can fill that void.

Tracking data using broadcasting footage is the ultimate method to produce detailed recruitment data. Analysts can go back in time and produce data from all the previously untracked players by simply using the footage available from past games. Stats Perform achieves this through AutoStats. AutoStats is a data capture system that can identify where players are located even though the camera is constantly moving by applying continuous camera calibration. It detects body pose of players and can re-identify a player once that player comes back into view after having left the frame. Additionally, AutoStats uses optical character recognition to collect the game and shot clock on every frame, as well as using action recognition to track the duration of player events at a frame-level.

Once that tracking data has been generated from lower leagues or college games, AI-based forecasting can be applied to discover which other professional players is the scouted player of interest most similar to. These solutions can even project a young player’s future career performance. It can use prediction models from historical data of former rookies and their eventual successes to forecast future performances of current prospects.

Given the limited coverage of tracking data in lower and junior leagues, another method to overcome that limitation is to use the already collected event data to maximise the value of the coverage in event data compared to tracking data. Machine learning can define the specific attributes of two players to then compare them with each other. These attributes can be spacial attributes, such as where they normally receive the ball, contextual attributes, such as their team’s playing style (i.e. frequency of counter attacks, high press, crossings, direct plays, build up plays, etc.) and quality attributes, such as expected metrics to capture the value and talent of each player. This method can provide a clear comparison of two different players relative to the context in which they play in. For example, how often is a player involved relative to the playing style of a particular situation.

Taking all this data and the derived attributes from event data, you can then run unsupervised models, such as Gaussian mixture model clustering, to discover groupings of players based on their similarities, and then create a number of unique player clusters that divide pools of players. These clusters can then surface information about the roles that different groups of players play in their teams, whether they are “zone-movers”, “playmakers”, “risk-takers”, “facilitators”, “conductors”, “ball-carriers” or any other clusters that can emerge from applying unsupervised methods. This way, if a team wants to find a player similar to a specific successful player (i.e. players similar to Messi), but with some attributes that are slightly different (i.e. age, league, etc.), they are able to specify that search criteria and find players that fit the profile that they are after.

Sports Performance Analysis - AI in Sports 7.png

Match Predictions

There are a couple of ways that AI can help in match predictions. One of them is implicitly through crowd-sourced data. Prediction markets like betting exchange facilitate a marketplace for customers to bet on the outcome of discrete events. It is a crowd-sourced method, and if there are enough participants to represent the entire collective wisdom of the market, with enough diversity of information and independence of decisions in a decentralised way, it is the best predictor you can get. It is an implicit market as we do not know the reason why people have made their betting choices, therefore it is not interpretable. If enough people are participating in these markets, then all possible information to make a prediction is present in that market. If that is the case, it is not possible to beat the accuracy of that market prediction.

Another method is to use an explicit data-driven approach using only data from historical matches together with machine learning techniques to predict probabilities of match outcomes. This method relies on the accuracy and depth of the data available and can only capture the performance present within the data points collected. The advantage of using a data-driven approach is that it can be interactive and interpretable. Also, it only needs the data feed of events, which makes it scalable. However, since not all data might be captured in the dataset used (i.e. injury data), there may be gaps in the analysis that can affect the predictions made.

Sportsbooks normally use a hybrid approach of crowd-sourced data together with data-driven methods to balance the action on both sides of the wager and also to manage their level of risk. They initialise the market with a data-driven approach and human intuition and then iterate based on volume, other sportbooks line and any unique incentive they want to offer to their own customers.

AI-based solutions and tracking data can be used to support these prediction markets, particularly in those markets with insufficient coverage to achieve crowd wisdom. One way of doing so is through the calculation of win probability. Win probability is extensively used across nearly every sport for media purposes. The current limitation of win probability is that it is based on the likelihood that an average team would win given a particular match situation. However, simply using an average may miss contextual information about the specific strengths of particular teams or players involved. The way to overcome that is to use specific models that incorporate the players, teams and line-ups of the match in question.

Stats Perform uses models that learn compact representations with features such as the specific opponent, players involved and other raw features describing the lineup to improve prediction performance based on the players involved in the game. This allows them to create specific player props that can predict individual player statistics (i.e. expected points scored in basketball) for each player in the lineup and illustrate that player’s future game performance before the game starts.

Sports Performance Analysis - AI in Sports 14.png

Similarly, these predictions can also be made in real-time while a match is being played. For example, using tracking data, in-play predictions in a tennis match can predict who is more likely to win the next point while the rally is taking place. You can even go a level deeper and predict what is the location where the ball will land after the next strike. In football, you could also predict who is the next player who is going to receive a the ball from a pass or where the next shot on goal is going to occur. This is the true value of highly granular levels of data and a data-driven approach to sports analysis.

Contextual Analysis In Sport Using Tracking Networks

Javier Martin Buldu is an expert on the analysis of non-linear systems and the understanding of how complex systems organise themselves, adapt and evolve. He focuses on the application of network science and complex systems theory in the analysis of sports. Buldu’s work is based on the principle that teams are far more than the simple aggregation of their individual players. By collaborating with organisations such as the Centre of Biomedical Technology in Madrid, La Liga, ESADE Business School, IFISC research institute and the ARAID Foundation, he has been able to combine elements of graph theory, non-linear dynamics, statistical physics, big data and neuroscience to construct various networks using positional tracking data of a football match. These networks are then able to explain what happens on the pitch beyond conventional ways of assessing the performance of individual players to understand team behaviours.

What Is Complex System Theory?

A complex system is a system composed by different parts that are connected and interact with one another. This system has properties and behaviours that cannot be explained by simply breaking down the system into its individual parts and analysing each individual part independently. For example, the human brain is a complex system and it has proven extremely challenging for scientists to fully understand how it performs all its functions, from how memory is stored to how cognition appears and disappears during certain illnesses. On the other hand, the human brain’s most fundamental component, the neuron, has been thoroughly studied and documented by science. Scientists have been able to recreate models and simulations of neuron behaviour, understand their shape and how they communicate with other neurons. However, this robust understanding of single neuron behaviour has not been sufficient to allow scientist to comprehend the interplay and interdependencies of the 80 billions neurons that form the human brain and that allows it to perform all of its complex behaviours. Instead, in order to appropriately study the brain, scientist need to pay attention to entire human cognitive system as a whole.

The idea behind complex systems like the human brain is what Buldu wanted to introduce in the analysis of football. While it is interesting to have information about isolated player performance, such as the number of shots, passes or successful dribbles, it is also important to understand the context in which these events take place. Additional insights on the performance of players and teams can be obtained by analysing information about how a player interacted with his teammates and the opposition’s players. Paying attention to individual player performances and aggregating those together is not enough to fully understand how a team behaves during a match.

Instead, a complex system approach to football analysis would, for example, look at the link created between two or more players when they pass the ball between them. A network of these players can then be created by simply leveraging event data collected from notational video analysis to count the number of passes from player A to player B and vice versa. These types of passing networks are increasingly common in football match analysis and team reports, as they clearly illustrate information about how a team played during a match, where its players were most frequently located on the pitch and how they interacted with each other.

Passing Network between FC Barcelona players (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Passing Network between FC Barcelona players (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

However, more complex and informative networks can be developed by leveraging positional tracking data instead of event data. While event data is generated through notational analysis by tagging specific actions, positional tracking data instead describes the position of the 22 players and the ball on the pitch at any moment in time during a match of football. Unfortunately, positional tracking data is challenging to access for most analysts. That is why Buldu collaborated with La Liga to obtain a positional tracking dataset containing Spanish football league matches. To capture this information, La Liga uses Mediacoach, a software that acquires the positional coordinates of players and the ball using a TRACAB optical video tracking system that requires the installations of specialised cameras across the football stadiums. Mediacoach’s system allows them to track a player’s position at 25 frames per second and a precision of 10cm. Thanks to this detailed tracking dataset received from La Liga, Buldu was able to explore the different interactions between players to construct a number of complex tracking networks in football. 

Proximity Networks

The first network that Buldu produced explored the proximity between players on the pitch. He first calculated an arbitrary 360 degrees distance around a player, let’s say a 5m radius, and used it as a threshold to identify any other players that may fall inside that particular player’s area. If another player was located inside of the first player’s surrounding area, a link was then created between those two players. If those two players were from the same team, a positive link was created, while if they were from opposing teams a negative link was assigned to that interaction instead. By increasing or decreasing the radius of the distance surrounding each player (i.e. 5m, 10m or 15m radius), Buldu produced different networks and links between players following this method.

Proximity radius at 5m, 10m and 15m showing links with players of the same team (green) and with opposing players (red) (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Proximity radius at 5m, 10m and 15m showing links with players of the same team (green) and with opposing players (red) (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

The challenge of producing a variety of proximity networks is that they may prove difficult to analyse, as the links identified in a single video frame using a 5m radius around each player may be very different to those found using a 15m radius. On top of that, the analysis should look at how those proximity networks evolve over a number of frames during the match. In order to gather practical insights from these networks, Buldu aimed to study the number of positive and negative links for each of the teams, as well as the organisation of the proximity network structure, its temporal evolution and how they change in relation to the zone of the pitch and the various phases of the game.

Proximity analysis of the 3-player links for all players in a match between Atletico Madrid and Real Valladolid (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Proximity analysis of the 3-player links for all players in a match between Atletico Madrid and Real Valladolid (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

He first counted the number of links between three different players forming a triangle. He then classified each triangle into two categories: positive (all players from the same team) or mixed triangles (at least one player from the opposing team). Buldu was then able to determine which team had dominance over the other at different times of the match by then counting the number of positive triangles and the number of mixed triangles produced with a certain threshold distance. The team with the the highest proportion of positive triangles (i.e. all three players in close proximity to each other forming a triangle were from the same team) was deemed to have been dominant over its opposition.

Marking Networks

The second type of network that Buldu was able to construct with positional tracking data was the time a player was covering an opposing player during a defensive phase of play. Again, by setting an arbitrary threshold distance around a defender, a link between the defender and opposing player can be set by counting the time both players are in close proximity to one another. This process produces a matrix that illustrates the defenders on one of the axis and the attackers on the other axis, and provides a rough idea about the amount of time that each attacking player was being marked and by which defensive player. By interpreting the marking matrix analysts are able to identify the players with the highest accumulated time being marked by a defensive player.

Player marking matrix between Real Madrid (y-axis) and Leganes (x-axis) showing how often each Real Madrid players was marked by a Leganes player (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Player marking matrix between Real Madrid (y-axis) and Leganes (x-axis) showing how often each Real Madrid players was marked by a Leganes player (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Since matrices are the mathematical extraction of a network, this information can be drawn onto a diagram of a football pitch to plot the position of players during defensive actions. The size of each node in this network indicates the time an attacking player was being defended. By using these marking networks, analysts can clearly visualise the interactions and efforts of attacking and defending players during a match of football.

Player marking network between Real Madrid and Leganes (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Player marking network between Real Madrid and Leganes (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Coordination Networks

The third network that Buldu produced evaluated the coordination of movements between players of the same team. The network computed the velocity and direction of movement of two players to measure the alignment of their vectors. When this vector alignment was high, a high value link between these two players was created. When the alignment was low, a lower value connection was also derived from the two players’ movements. This method results in a matrix that illustrates how well players are coordinated with their own teammates. Two different matrices can be produced, one to analyse offensive phases of play and one for defensive phases.

Vector alignment of two attacking players (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Vector alignment of two attacking players (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Similarly to marking networks, coordination network matrices can also be translated into diagrams on a football pitch, where the nodes represent each player on the pitch while the size of each node indicates the amount of coordination the player has with the rest of his teammates. The links between two nodes also indicate the level of coordination between two particular players of the same team.

Movement coordination of each player with the rest of his teammates (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Movement coordination of each player with the rest of his teammates (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

This type of analysis, especially when split between offensive and defensive players, can help analysts better understand the level of coordination between attack and defensive plays. For instance, an analyst or coach may want to see high degrees of coordination when the team defends as a block as well as how that coordination changes during the different phases of the game.

Ball Flow Networks

Lastly, the final network developed by Buldu focused on ball movement between different areas of the pitch. This network was produced by splitting the football pitch into different sections and counting the number of times the ball travelled from one section to another in order to create links between two different sections. This ball flow network can also be visualised on a diagram of a football pitch, with the nodes representing each section of the pitch and links indicating the number of times the ball moved from one section to the next. The size of these nodes indicate the amount of time the ball was being played inside that particular section of the pitch. By constructing an entire ball moving network during a match, analysts can then identify which are the most important sections of the pitch for their teams and assess how to exploit different sections in the opposition’s side in order to create dangerous opportunities.

Ball flow network for a match between FC Barcelona and Espanyol (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Ball flow network for a match between FC Barcelona and Espanyol (Source: Javier Martin Buldu at FC Barcelona Sports Tomorrow)

Buldu’s work provides a great analytical framework to assess the complexities of sports in which a large diversity of factors can influence different outcomes of the game. It is crucial that when analysing a sport, all the available contextual information is analysed from various perspectives that can together provide a more complete evaluation of performance. Researchers, scientists and analysts are increasingly producing exciting work with positional tracking data that can open the door to new sophisticated methodologies and models to help coaches better understand the key influential factors of their team’s performance.

Further Reading:

  • Futbol y Redes Website

  • Buldu, J. M., Busquets, J., & Echegoyen, I. (2019). Defining a historic football team: Using Network Science to analyze Guardiola’s FC Barcelona. Scientific reports, 9(1), 1-14. Link to article.

  • Buldu, J. M., Busquets, J., Martínez, J. H., Herrera-Diestra, J. L., Echegoyen, I., Galeano, J., & Luque, J. (2018). Using network science to analyse football passing networks: Dynamics, space, time, and the multilayer nature of the game. Frontiers in psychology, 9, 1900. Link to article.

  • Garrido, D., Antequera, D. R., Busquets, J., Del Campo, R. L., Serra, R. R., Vielcazat, S. J., & Buldú, J. M. (2020). Consistency and identifiability of football teams: a network science perspective. Scientific reports, 10(1), 1-10. Link to article.

  • Herrera-Diestra, J. L., Echegoyen, I., Martínez, J. H., Garrido, D., Busquets, J., Io, F. S., & Buldú, J. M. (2020). Pitch networks reveal organizational and spatial patterns of Guardiola’s FC Barcelona. Chaos, Solitons & Fractals, 138, 109934. Link to article.

  • Martínez, J. H., Garrido, D., Herrera-Diestra, J. L., Busquets, J., Sevilla-Escoboza, R., & Buldú, J. M. (2020). Spatial and Temporal Entropies in the Spanish Football League: A Network Science Perspective. Entropy, 22(2), 172. Link to article.

Automating Data Collection And Match Analysis From Video Footage

Dr Manuel Stein has spent over 7 years researching and analysing player movement using detailed positional football data. His work has focused on the investigation of real-time skeleton extraction to perform match analysis of player movement with the aim of fostering the understanding of comparative and competitive behaviours in football. He has revolutionised the way match and tactical analysis is performed by teaching computers how to measure key playing aspects of the sport, such as team dominance or a player’s control of space derived directly from video footage. Stein has developed an automatic and dynamic model that takes into account the contextual factors that influence the movement and behaviour of players during a match. This novel player detection system automatically is able to display complex and advanced 5-D visualisations that are superimposed on original video footage.

Generating Data From Match Video Footage

The first step for any meaningful quantitative analysis is to obtain highly detailed data to properly test our assumptions. However, gathering highly detailed sport data may be challenging to obtain unless sophisticated tracking technology is used and the results of such tracking are easily accessible to the analyst. On top of that, when it comes to positional player data in football (i.e. xy coordinates of players on the pitch), gaining access to this level of granular data is especially challenging for most analysts. This is the same problem Stein faced during the initial phases of his research and that led him to develop a method for data extraction on his own using television footage and computer vision techniques.

Identifying Players On The Pitch

Stein’s method of extracting data from television footage started with the detection of each player on the pitch. In order to automatically identify the players, Stein addressed the unique colours that are present on the football pitch, more specifically the colours of the players’ shirts. By picking a player in the video, he constructed a colour histogram that best described the most prominent colours in that player’s shirt. Once those colours were identified, he then automatically searched across the video frame for contours of a minimum size that contained those same colours detected from that player’s shirt to spot all other players with the same colour shirt. The computer then automatically calculated the centroid of each detected area (i.e. the players as well as minor noise) and used the average measurements of human proportions to draw boxes enclosing the entire player on the screen.

Colour-based player detection (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Colour-based player detection (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

This colour-based player detection method enabled Stein to identify all players on the pitch. The additional noise captured on the sidelines and stadium crowd was later removed by using threshold and ignoring areas that only appear on screen for a brief moment of time. However, this colour-based detection approach has certain limitations depending on the match footage. Lighting variations during matches that kick off under sunlight and finish around dusk do not impact colour perception in humans, but they do so for automatic colour-based player detection systems, as towards the end of the match computers will not be detecting the same colours as they did during kick off.

In order to solve this limitation and develop a system that works on all match conditions, Stein explored additional automated real-time methods to simultaneously extract player body poses and movement data directly from the video footage. One of those methods was the use of OpenPose, a well-known and established computer vision system for human body pose detection. However, OpenPose was not a suitable option when working with football footage, as the system struggles to detect small scaled people on the screen and is also unable to be computed in real-time during a match. Instead, Stein developed and trained his own deep learning model completely from scratch.

Body pose detection system (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Body pose detection system (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Stein’s human body detection model uses a skeleton model based on a hierarchical graph structure that represents a body’s pose. Every node on the hierarchical graph corresponds to the position of a body part from the person’s skeleton such as joints, ears, eyes and so on, called key points. The edges of this hierarchical graph represent an anatomical correct connection between two body parts. Stein’s body pose detection process followed two stages: the detection of individual body parts followed by the probabilistic reconstruction of the skeletons by connecting all identified body parts together. The constructed skeletons of the players were then overlayed on the original video footage for easy visualisation. Stein model’s estimation accuracy results outperformed those of OpenPose when estimating the skeletons on medium-scale people from the Microsoft COCO dataset. Moreover, their model architecture is also optimised for real-time and low latency video analysis, unlike OpenPose which struggles to run on resolutions of close to 4k.

Identifying The Ball

The next step was to detect the ball. For that, the model followed a two-step approach: a per-frame candidate detection step followed by a temporal integration phase. It first detected all possible objects on the screen that could potentially be the ball by using a convolutional neural network. The computer detected things such as the penalty spot, the corner kick spot, the centre spot, white football boots or the ball itself as being possible candidates. The next step was to identify an accurate and realistic ball trajectory over a period of time from the previously identified candidates using a recurrent neural network. This enabled the model to specify which one out of the previously detected objects was indeed the ball, as it was moving throughout the footage as a ball would be expected to move. By using this approach, the ball could be tracked even when it was not visible on the video footage. For instance, the computer continued to track the ball even when a player picked it up before a penalty kick and happened to hide it from the camera.

Determining Player And Ball Location On The Pitch

Once both players and the ball have been detected, the following step is to determine their location on the full football pitch. The challenging part in this section is the fact that the camera is continuously focusing on different parts of the pitch rather than the pitch as a whole. To solve this issue, Stein had to produce a static camera shot by creating a panoramic view of the complete stadium using a subset of input frames from the video footage (i.e. all frames from the first two minutes of a match). The overlap of all these snapshots from the video footage was then used to recreate a panoramic view of the pitch that allowed Stein to calculate the pitch’s homography. He was then able to identify how two different images connected together, or detect whether one image was simply a subset of a larger image. The homography calculation then enabled Stein to project each of the frames from the video footage into the panoramic view of the pitch as a unique reference frame and fully visualise where on the full pitch each frame took place.

Projection of frames on the panoramic view of the full pitch (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Projection of frames on the panoramic view of the full pitch (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

With all players and the ball correctly identified and their position accurately projected on a panoramic view, the next step was to project these player locations into a normalised football pitch to start generating usable positional data for further analysis. By providing the system with a standard image of a football pitch, a user can select a minimum of four points both from the panoramic view and their image of the pitch in order for the system to use the homography calculations from the panoramic view and translate them into the standard image of the pitch. This allows the system to automatically plot accurate player positional data on a standard diagram of a football pitch.

Player locations and movements illustrated in real-time on a diagram of the pitch on the top right corner (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Player locations and movements illustrated in real-time on a diagram of the pitch on the top right corner (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Automatically Measuring Contextual Information From Video

Stein took his research further by incorporating the tracking of elements in a match that are not clearly visible to a computer, areas such as the dependencies, influences and interactions between players during the various scenarios of a game. For a fully automated football analysis system to work, this context information that is obvious to humans also needs to be taken into account and measured by the computer. In a dynamic team sport like football, players are more than simple and independently moving dots on a pitch. There is a complex network of interactions and dependencies that dictate how a player reacts to a situation, how they cooperate with teammates and how they attempt to prevent the opposing players’ actions.

Interaction Spaces

One way to automatically measure contextual information from player positional data was to identify the specific regions on the pitch that are controlled by the different players. Stein argued that each player has a surrounding area around them that he fully controls based on his position on the pitch. These control regions are what he called ‘interaction spaces’ on the pitch that a player can reach before any opposing player or the ball could reach that same space. The size and shape of these interaction spaces are influenced by player speeds and directions, as well as the distance between the players and the ball. This is because players further away from the ball may have more time to react.

Interaction spaces for each player (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Interaction spaces for each player (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

On top of that, competition between two opposing players to control a certain zone has also an impact on the shape of these interaction spaces, as players from the opposing team will aim to restrict certain opposing player movements. Therefore, when defining interaction spaces on the football pitch, Stein aimed to consider these interdependencies that may restrict a player from reaching a particular zone before an opposing player to maintain ball possession. This can be seen in the above illustration between the blue team’s defensive line and the red team’s forwards, where players that are close to opposing players may restrict each other’s interaction spaces. Lastly, Stein was able to leverage the pitch visualisations of the previously recorded positional data and enrich it with additional context information that clearly illustrates each interaction space in real-time.

Free Spaces

An alternative way of contextualising automatic tracking data was the inclusion of free spaces. Stein calculated free spaces by segmenting the pitch into grid cells of 1 squared metre. He then assigned each respective cell to the player with the highest probability of reaching that cell in relation to the distance to the cell, their speed and direction of movement. Similarly to interaction spaces, free spaces where the cells from the grid that a player could reach before any other opposing player. Ultimately, free spaces represented the pitch regions a specific team or player owned.

All free spaces identified for a team in blue (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

All free spaces identified for a team in blue (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

To evaluate which free zones were more meaningful for analysing, Stein ranked all free spaces on the pitch by their value in relation to their respective sizes, number of opposing players overlapping such spaces and the distance to the opposing goal.

All high value free spaces shortlisted for a team in blue (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

All high value free spaces shortlisted for a team in blue (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Dominant Regions

Stein expanded his concepts of region control on a football pitch by using similar calculations to those of interaction spaces to create a model that highlights the dominant regions for each team. These dominant regions are calculated by looking at areas on the pitch that can be reached by at least 3 players of the same team simultaneously. Ultimately, they represent the areas in which a particular team has substantially more control over the other.

Dominant zones by players in the blue team (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Dominant zones by players in the blue team (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Cover Shadows

Similarly, Stein extended the concept of interaction spaces to calculate player cover shadows, referring to the area a player can cover in relation to the position of the ball. In other words, a player has full control to prevent a ball from reaching their cover shadow region. Cover shadows can be thought of as a hypothetical light source coming from the ball at a 360 degree angle. These cover shadows represent the regions that the player is able to control before the ball gets to them.

Cover shadows illustrating a player’s area coverage in relation to the ball (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Cover shadows illustrating a player’s area coverage in relation to the ball (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Applications Of This Automated Player Tracking System

When looking at the possible applications of his automated tracking system, Stein had to consider the roles of Performance Analysts and the coaches. For a Performance Analyst, video and movement data are key when analysing the strengths and weaknesses of their team and the opposition. On one side, analysts have a window on their screens with their video analysis software opened, such as SportsCode or Dartfish, to notate events and analyse playing actions. While on the other side, they have another window with the original video footage of the match that they use to verify and interpret any observations captured from their coding. Often what this means is that the analyst is looking at two different windows and comparing them to one another. While this is common practice in the field of Performance Analysis, the exercise of switching focus between two screens may often prove to be an inefficient approach to video analysis. Focusing on two windows simultaneously can prove significantly challenging to the human eye, often leading to a ‘pause and play’ exercise during analysis.

Stein aimed to solve this problem by combining the benefits of the visualisation of the pitch from his new automatic player tracking system with the original match footage. By simply inverting the homography from the abstract pitch into the video footage, he was able to draw visualisations directly on the real pitch. This allowed him to illustrate in real-time different types of analysis, from evaluating offensive free spaces to looking at players’ interaction spaces.

Interaction spaces automatically displayed directly on real match footage (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Interaction spaces automatically displayed directly on real match footage (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Stein’s dynamic and automatic real-time visualisation offered a whole new range of design opportunities for match analysis in football. For instance, the system was able to change a player’s shirt colour based on their behaviour (i.e. based on fatigue). It was also able to illustrate the best passing options available to the player with the ball.

Automatically computed best passing options (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Automatically computed best passing options (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

This novel tracking method provides an invaluable automatic measurement of the context of a match situation. However, similar to any other analytical tools, it needs to be correctly applied in order to make a difference to team and player performance. Aside from the clear operational efficiencies brought by the automation of tedious notational work, the benefits in knowledge acquired from this system needs to be appropriately incorporated into the analysis loop. For instance, data on free spaces can be used to automatically detect suboptimal movements from players and suggest potential improvements for such behaviours. For example, an analysts can select specific situations where there was a shot on goal or dangerous play by the opposition to then identify which of their own players had control over free spaces that could have prevented such occasion. Once a selection of possible players have been identified, analysts can assess which one of those players lost control of their space the fastest and how such player could have kept control over his opponent. The identified player can then receive information about which should have been his optimal position on the pitch and their control of field space in order to reduce the free spaces towards his own goal left to be exploited by their opponents. Stein’s system is able to provide this guidance to analysts, coaches and players by automatically calculating the player’s moving trajectory based on his speed and interactions space and suggest an optimal realistic movement for that player, from the starting position to the optimal point. This means that the system can automatically suggest improvements in collective behaviour based entirely on the contextual information being processed.

Click and drag interactivity (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

Click and drag interactivity (Source: Manuel Stein at FC Barcelona Sports Tomorrow)

The system also offers interactivity, where analysts and coaches can drag and drop players around the pitch to explore the different control spaces the player would benefit from if they were in a different location of the pitch. By moving a player to a different location, the system automatically updates the player’s trajectory and interaction spaces relating to their new location and the other players around him. This gives coaches and analyst the possibility to interact with the analysis and to adapt the system based on their own acquired knowledge of the sport.

Automated systems such as the one developed by Manuel Stein are bringing exciting levels of innovation to the sport by directly integrating data and video together. Thanks to these systems, football experts, coaches and analysts become more aware of the power of analytics once they are shown the context of real world scenarios, which in turn leads to better analytical approaches being developed that are better incorporated into the daily realities of the roles of analysts and coaches. Ultimately, it reduces or completely removes numerous tedious and time consuming work performed by analysts today in a revolutionary way that frees up time away from simple data collection which can in turn be placed in more dedicated and advanced analysis of the sport.

Further Reading:

  • Manuel Stein’s publications

  • Stein, M., Janetzko, H., Breitkreutz, T., Seebacher, D., Schreck, T., Grossniklaus, M., Couzin, I. & Keim, D. A. (2016). Director's cut: Analysis and annotation of soccer matches. IEEE computer graphics and applications, 36(5), 50-60. Link to article

How The NFL Developed Expected Rushing Yards With The Big Data Bowl

Michael Lopez, the Director of Data and Analytics at the NFL, recently discussed at the FC Barcelona Sports Tomorrow conference the way that his Football Operations team and the wider NFL analytics teams leverage a large community of NFL data enthusiasts to obtain a better understanding of the game of American Football. In his talk, Michael walked through the journey that the NFL took to develop expected rushing yards, a concept that began as an initial idea within their Football Operations group and ended up making its way up to the NFL’s Next Gen Stats Group and the media.

What To Analyse With The Data Available In The NFL?

The first step that the NFL Football Operations team took to figure out what should be answered with the use of data is to try to understand what the general public thinks about when they watch an NFL game. To figure this out, they looked at a single example of a running play in a 2017 season game between Dallas and Kansas where the running back, Ezekiel ‘Zeke’ Elliot took 11 yards from a 3rd down and 1 yard-to-go. This run by Zeke Elliot eventually allowed Dallas to successfully move further down the field and score points.

Sports Performance Analysis - NFL Big Data Bowl.gif

Statisticians at the NFL then tried to understand what can be learned from a play like this one by breaking down the play to obtain as many insights on the teams involved, the offence, the defence, and even the ball carrier. An initial eye test by simply looking at the video footage told the analysts that in this particular play Zeke Elliot - the ball carrier - had a significant amount of space in front of him to pick up those 11 yards. But how could data be applied to this play to tell a similar story? To do so, NFL analysts first needed to take a look at the data and information that was being collected from that play, to understand what was available to them and the structure of the datasets that will allow them to come up with possible uses for that data.

There are three types of data being collected and used by the NFL analytics teams: play level data, event level data and tracking level data. Each one of these types of data present different levels of complexity, with some having been around for longer than others.

  • Play Data:  

    This data contains the largest amounts of historical records and includes variables like the down, distance, yard line, time on the clock, participating teams, number of time outs and more. It also includes some outcome variables like number of yards gained, passer rating to evaluate QBs, win probability and expected points. 

  • Event Data:

    This data is generated from notating video footage. It is usually performed by organisations such as Pro Football Focus or Sports Info Solutions by leveraging their football expertise. These companies tag events using video analysis software and collect data points such as the offensive formation, number of defenders in the box, defenders closer to the line of scrimmage, whether a cover scheme was man versus zone, the run play called and so on.

  • Tracking Data:

    This type of data refers to 2-D player location data that provides the xy coordinates as well as the speed and direction of players. It is usually captured at 10fps using radio frequency identification (RFID) chips located on each player’s shoulder pads as well as the ball. It tracks every player during every play of every game. This is the most novel type of data being collected by the NFL. Player tracking data was only started to be shared with teams from the 2016 season onwards.

2D Player Tracking Data (Source: Mike Lopez at FC Barcelona Sports Tomorrow)

2D Player Tracking Data (Source: Mike Lopez at FC Barcelona Sports Tomorrow)

The sample sizes of data that is available for NFL analysts to come up with new metrics varies for each one of these data types. When it comes to play data, there is an average of 155 plays per game, and 256 games played in a single season. This means that for the longest time in the sport, analysts have had a maximum of almost 40,000 plays per season to figure out the answers to NFL analytics questions. A similar scenario is true with event data, where the dataset available to NFL analysts will be a multiple of the number of observations you are producing through the notation of events from a maximum of 40,000 plays per season.

A very different scenario occurs with player tracking data, where the sample size is substantially larger. With 2-D player location of each player being tracked at 10fps on plays that usually last 7 seconds, the data collected jumped from those 155 observations (plays) per game in play-level data to between 200,000 and 300,000 observations for a single game for tracking-level data. This brought a more complex dataset to the sport and opened the door to new questions and metrics to be explored by NFL analysts.

Applying The Available Data To The Analysis Of The Game

There are various approaches that the NFL analyst could have taken to evaluate the running play by Dallas where Zeke Elliot gained 11 yards. Ultimately, they wanted to figure out what was the likelihood of Zeke Elliot picking up those 11 yards in that running play.

One of these approaches was to assign a value to the play to evaluate how the running back performed by using metrics like yards taken, win probability or expected points. By using this play level data, analysts would be merely calculating the probability of those 11 yards being achieved using simple descriptive metrics, such as the fact that it was a 3rd down and 1 yard-to-go in a certain location of the field during the first minutes of a scoreless match. If they then compared Zeke Elliot’s outcomes based on similar plays, all of these metrics would have shown positive values, as gaining yards would have had an increase in both the team’s win probability and expected points. Zeke Elliot’s 11 yard run may have well been above average when you describe plays using play level data. However, this approach would be missing the amount of credit that the running back, the offensive line and the offensive team should really receive from this outcome given the specific situation they faced.

Another approach was to leverage event level data to provide additional context of the play. This type of data could have helped understand Zeke Elliot’s performance by providing additional variables, such as the number of defenders in the box or the play options available, which would have allowed to compare the probability of taking 11 yards against other plays with similar characteristics. However, these approach may have also shown positive results due to the relative large yardage gain Zeke Elliot achieved for the run. Moreover, appropriately describing the situation only using event data may be challenging or inaccurate as it is conditioned to the video analyst’s level of football expertise and ability to define the different key elements of the play.

Instead, NFL analysts decided to make use of the 2D player tracking data for that play to come up with the spatial mapping on the field. By having a spatial mapping of the field, analysts could visualise the direction and speed in which each player was moving during the duration of the run, as well as what percentage of space on the field was owned by different players of each team. This gave analysts an idea of the areas that were owned by the offence and the ones owned by the defence, providing them with better understanding of the amount of space in front of the running back, Zeke Elliot, to take on extra yardage. The information obtained from the spatial mapping could then be used to calculate yardage probabilities given the extra condition of space to more accurately assess how well the offensive team performed.

Spatial Mapping of Zeke Elliot’s Run (Source: Mike Lopez at FC Barcelona Sports Tomorrow)

Spatial Mapping of Zeke Elliot’s Run (Source: Mike Lopez at FC Barcelona Sports Tomorrow)

In this diagram above, it is clear that the offense owned most of the space in front of Zeke Elliot, not only 11 yards ahead but even 15 yards in front of the running back, with defenders nowhere close to him. As oppose to evaluating the play with play or event level data, using tracking data raised further questions on the performance of Zeke Elliot on that play, as it may not be as positive as the other approaches may have suggested given the amount of space he had in front of him.

Following this example, NFL analysts next tried to answer the question of how to leverage player tracking data more widely to better understand what happens during plays. The NFL Football Operations analysis team wanted to learn more about how this data could be used to compare the performance of players given the positioning, direction and speed of all 23 players on the field. More specifically, it involved understanding the probability distribution of all possible yardage increments - i.e. the running back taking or losing 5, 10, 15, 20 yards and so on – to obtain a range of outcomes with their likelihoods that would then allow analysts to compare different performances in different plays. A probability distribution that is based on yardage increments could then be explored further to provide analysts additional insights on first down probability, touchdown probability or even probability of losing yardage on a given play in spatial mapping terms. Ultimately, this probability distribution could be turned into an expected yards metrics for running backs by multiplying each yard by the probability of reaching that yardage and summing up all the values together.

Leveraging AWS And The Wider NFL Community

The main goal of the NFL Football Operations team was to better understand player and team performance by leveraging the new xy spatial data from player tracking to come up with new metrics, such as expected yards, touchdown probability or run play. The NFL Football Operations team worked closely with the NFL’s Next Gen Stats Group to understand the value that such metric will provide to the sport and define a roadmap of how to go about developing such metrics. Sunday Night Football and other media broadcasters also showed a strong interest in using this new metrics to better evaluate performances on air.

In their first attempt at producing new metrics from player tracking data, NFL analysts partnered with data scientists from Amazon Web Services (AWS) to figure out how this large dataset of player tracking data could be used to come up with new football metrics. Unfortunately, after trying a wide set of tools, ranging from traditional statistical methods to gradient boosting and other machine learning techniques, the NFL Football Operations and AWS partnership never produced results that were satisfactory enough to be used by NFL Next Gen Stats Group or the media. While they learned about possible application of the spatial ownership distribution on the field, when it came down to validating the results against the one example of Zeke Elliot’s 11 yard running play, the results did not provide enough confidence to be used for the wider analysis of the sport. The AWS-NFL data science collaboration had reached a dead-end in their analysis.

In order to unblock this situation and produce a metric from tracking data that would match what was seen in the video footage, the NFL Football Operations team leveraged the crowd sourcing wisdom in football statistics through the Big Data Bowl, an event they organise since 2019 and that was also sponsored by AWS. The Big Data Bowl is an annual event that serves as a pipeline for NFL club hiring, as it helps identify qualified talent that can support the NFL’s Next Gen Stats domain in analysing player tracking data. Since player tracking data has not been around for a long time, this event enabled the NFL to understand what the right questions to ask from this data are and how to go about answering them. The Big Data Bowl also serves core NFL data analytics enthusiasts who want extra information on the sport by helping them understand more about the NFL through more intuitive metrics that more clearly reflect how fans think about the game. For the past couple of years, this event has also proven to be a great opportunity for NFL innovation, as it has successfully tapped into the global data science talent to solve problems that a team of data scientists at AWS and the NFL could not resolve on their own. The first Big Data Bowl in 2019 saw 1,800 people sign up to take part, with 100 final submissions from having completed the task given. Out of these pool of analysts and data scientists, 11 went on to be hired by NFL teams and vendors. The winner of the 2019 competition is now an analyst for the Cleveland Browns.

Source: NFL Football Operations

Source: NFL Football Operations

The success of the Big Data Bowl 2019 edition meant that the NFL Football Operations would decide to take advantage of the Big Data Bowl 2020 event to develop their highly anticipated expected yards metric from the 2D player tracking data. Instead of trying to figure out the metric internally on their own, they took a ‘the more the merrier’ approach to exploit the opportunities available from the analytics talent across the world. The NFL Football Operations team shared the exact player tracking data with the participants in the event, who were given the task of predicting where the running back would be after a handoff play, such as the one earlier discussed between Dallas and Kansas. By receiving this player tracking data, participants now had valuable data points specifying the positions of all the players on the field, their speed, the number of players in front of the running back, who those players were, and more. All they needed to do is to come up with a method that would allow the NFL to understand whether Zeke Elliot’s performance was above or below average.

The competition launched in October 2019, when data was shared and released by the NFL. There were a total of 2,190 submissions for the event, with participants from over 32 countries. The launch was followed by a 3-month model building phase to allow teams to develop their algorithms. These algorithms were later evaluated in real time during the 5-week model evaluation phase of the competition. This model evaluation phase tested each algorithm’s predictions using out-of-sample data and compared the results with the true outcomes. The competition used Kaggle as their main data science platform to encourage interactions and communication across teams through forums. It also provided a live leaderboard where teams could see how well their algorithms were performing against other teams. Team scores were completely automated based on how accurate the algorithms were against real data. The winning team was a team called ‘The Zoo’, formed by two Austrian data scientists, who came up with a 5 dimensional convolutional neural network containing only five inputs: the location of the defenders, the routed distance between defenders and the ball carrier, the routed speed of the defenders and ball carrier, the routed distance between all offensive players and all defensive players, and the routed speed of all offensive players and all defensive players. They eventually presented their model in the NFL Scouting Combine event that was attended by more than 225 teams and club officials. They also received a cash prize of $75,000.

The winning team’s model results significantly outperformed those of the rest of participants. The calibration of their model showed an almost perfectly calibrated model where their predicted number of yards closely matched the observed number of yards from an out of sample dataset. Their model was able to take data from a carry and predict the yardage that carry would achieve, not only for small gains of 3 to 5 yards but also for longer yard gains of 15 to 20 yards, which are rarer in the sport. Thanks to their model, an expected yards metric could be produced for every running play. This now provides a valuable tools to assess performance of running plays such as the one by Zeke Elliot. For example, when a player takes 29 yards from a run, if the model calculated an expected yardage gain of 25 yards for that run given the spacing the running back had at the handoff, that player should only get credited for having achieved 4 yards above the average. This new way for interpreting a 29 yards run would not have been possible unless a model successfully conditioned its probability calculation based on the space available to the running back to determine whether that player has performed above or below expectation.

Winning team’s calibration plot (Source: NFL Football Operations)

Winning team’s calibration plot (Source: NFL Football Operations)

The benefits of the Big Data Bowl format was that unlike hackathons, where participants may only get one or two weekends to produce something of value, this type of event enabled enough time for the teams to navigate the complex player tracking data set and come up with actionable insights. The NFL was then able to immediately obtain and share the new derive metrics with the media and their Next Gen Stats group to be used for their football analytics initiatives. Thanks to this approach, clubs can now better evaluate their running backs. Moreover, other industries, such as the growing betting industry in the USA may also benefit from the development of expected yards for their betting algorithms. Lastly, expected yards are now being widely used by NFL broadcaster to show whether running backs are performing well or not during the duration of a game. Metrics like this one would not have been possible without the NFL tapping to a global talent pool of data scientist to help them come up with this novel expected yards metric.

The NFL is continuing to run their Big Data Bowl this year, with their 2021 edition being a lot more open ended than previous editions. This time the task focuses on defensive play. They are sharing pass plays from the 2018 season and are asking participants to come up with a model that defines who are the best players in man coverage, zone coverage, how can the model identify whether the defence is man or zone, how to predict whether a defender will get a penalty and what types of skills are required to be a good defensive player. It leaves the interpretation and approach to the participants to define and allows them apply the right conditioning to the data provided. This approach of opening your data to the public in order to push data innovation forward has proven successful and would be interesting to see if other sports will adopt similar initiatives.

Collecting Sports Data Using Web Scraping

What Is Web Scraping?

Web scraping is the process of automatically extracting data and collecting information from the web. It could be described as a way of replacing the time-consuming, often tedious exercise of manually copy-pasting website information into a document with a method that is quick, scalable and automated. Web scraping enables you to collect larger amounts of data from one or various websites faster.

The process of scraping a website for data often consists on writing a piece of code that runs automatic tasks on our behalf. This code can either be written by yourself or executed through a specialised web scraping program. For example, by simply writing a few basic lines of code, you can tell your computer to open a browser window, navigate to a certain web page, load the HTML code of the page, and create a CSV file with the information you want to retrieve, such as a data table.

These pieces of code - called bots, web crawlers or spiders - use a web browser in your computer (i.e. Chrome, Firefox, Safari, etc) to access a web page, retrieve specific HTML elements and download them into CSV files, Excel files or even upload them directly into a database for later analysis. In short, web scraping is an automated way of copying information from the internet into a format that is more useful for the user to analyse.

The process of web scraping follows a few simple steps:

  1. You provide your web crawler a page’s URL where the data you are interested in lives.

  2. The web crawler starts by fetching (or downloading) a page’s HTML code - the code that represents all the text, links, images, tables, buttons and other elements of the website page you want to get information from – and store it for you to perform further actions with it.

  3. With the HTML code fetched, you can now start breaking it down to identify the key elements you want to save into a spreadsheet or local database, such as a table with all its data.

For example, you can use web scraping to collect the results of all Premier League matches without having to manually copy-paste every results from a web page with such information. A web crawler can do this task automatically for you. You would first provide your web crawler or web scraper tools the URL of the page you want to scrape (i.e. https://www.bbc.co.uk/sport/football/premier-league/scores-fixtures). The web crawler will then fetch and download the HTML code from the URL provided. Finally, based on the specific HTML elements you requested the web crawler to retrieve it would export those elements containing match information into a downloadable CSV file for you in milliseconds.

What Is Web Scraping Used For?

Web scraping is widely used across numerous industries for a variety different purposes. Businesses often use web scraping to monitor competitor’s prices, monitor product trends and understand the popularity of certain products or services not only within their own website but across the web. These practices extend to market research, where companies seek to acquire a better understanding of market trends, research and development, and understanding customer preferences.

Investors also use web scraping to monitor stock prices, extract information about companies of interest and keep an eye on the news and public sentiment surrounding their investments. This invaluable data helps their investment decisions by offering valuable insights on companies of interest and the macroeconomic factors affecting such enterprises, such as the political landscape.

Furthermore, news and media organisations are heavily dependent on timely news analysis, thus they leverage web scraping to monitor the news cycle across the web. These media organisations are able to monitor, aggregate and parse the most critical stories thanks to the use of web crawlers.

The above examples are not exhaustive, as web scraping has dramatically evolved over the years thanks to the ever-increasing availability of data across the web. More and more companies rely on this practice to run their operations and perform thorough analysis.

What Scraping Tools Are There?

Websites vary significantly in their structure, design and format. This means that the functionality needed to scrape may vary depending on the website you want to retrieved data from. This is why specialised tools, called web scrapers, have been developed to make web scraping a lot easier and more convenient. Web scrapers provide a set of tools allowing you to create different web crawlers, each with their own predefined instructions for the different web pages you want to scrape data from.

There are two types of web scrapers: pre-built software and scraping libraries or frameworks. Pre-built scrapers often refer to browser extensions (i.e. Chrome or Firefox extensions) or scraping software. These type of scraping tools require little to no coding knowledge. They can be directly installed into your browser and are very easy to use thanks to their intuitive user interfaces. However, that simplicity also means their functionality may be limited. As a result, some complex website may be difficult or impossible to scrape with these pre-built tools. Some examples of scraping apps and extensions include:

Scraping frameworks and libraries offer the possibility of performing more advanced forms of scraping. These scraping frameworks, such as python’s Selenium, Scrapy or BeatifulSoup, can be easily installed in your computer using the terminal or command line. By writing a few simple lines of code, they allow you to extract data from almost any website. However, they require intermediate to advance programming experience as they are often run by writing code in a text editor and executing the code through your computer’s terminal or command line. Some example of open-source scraping frameworks include:

Scraping Best Practices. Is It Legal?

Web scraping is simply a tool. The way in which web scraping is performed determines whether it is legitimate web scraping or malicious web scraping. Before undertaking any web scraping activity, it is important to understand and follow a set of best practices. Legitimate web scraping ensures that the least amount of impact is caused to the website where the data is being scraped.

Legitimate scraping is very commonly used by a wide variety of digital businesses that rely on the harvesting of data across the web. These include:

  • Search engines, such as Google, analyse web content and rank it to optimise search results.

  • Price comparison sites collect prices and product descriptions to consolidate product information.

  • Market research companies evaluate trends and patterns on specific products, markets or industries.

Legitimate web scraping bots clearly identify themselves to the website by including information about the organisation or individual the bot belongs to (i.e. Google bots set their user agents as belonging to Google for easy spotting). Moreover, legitimate web scraping bots abide by a site’s scraping permissions. Websites often include a robots.txt file appended to their URLs describing which pages are permitted to be scraped and which ones disallow scraping. Examples of robots.txt permissions can be found in https://www.bbc.co.uk/robots.txt, https://www.facebook.com/robots.txt and https://twitter.com/robots.txt. Lastly, legitimate web scraping bots only attempt to retrieve what is already publicly available, unlike malicious bots that may attempt to access an organisation’s private data from its nonpublic database.

On the other side of legitimate web scraping there are certain individuals and organisations that attempt to illegally leverage the capabilities of web scraping to directly undercut competitor prices or steal copyrighted content. This may often cause financial damage to a website’s organisation. Malicious web scraping bots often ignore the robots.txt permissions, therefore extracting data without the permission of the website owner. They also impersonate legitimate bots by identifying themselves as other users or organisations to bypass bans or blocks. Some examples of malicious web scraping include spammers that attempt to retrieve contact and personal detailed information of individuals to later send fraudulent or false advertising to a large number of user inboxes.

This increase in illegal scraping activities have significantly damaged the reputation of web scraping over the years. Substantial controversy has been drawn to web scraping, fueling a lot of misconceptions surrounding the practice of automatic extraction of publicly available web data. Nevertheless, web scraping is a legal practice when performed ethically and responsibly. Reputable corporations such as Google heavily rely on web scraping to run their platforms. In return, Google provides considerable benefits to the websites being scraped by generating large amounts of traffic to such websites. Ethical and responsible web scraping means the following:

  • Read the robots.txt page of the website you want to scrape and look out for disallowed pages (i.e. https://www.atptour.com/robots.txt).

  • Read the Terms of Service for any mention of web scraping-related restrictions.

  • Be mindful of the website’s bandwidth by spreading your data requests (i.e. setting a delay and interval of 10-15 seconds per request instead of hundreds at once).

  • Don’t publish any content that was not meant to be published in the first place by the original website.

Where To Find Sports Data

A league’s official website is a good starting point to gather basic sports data about a team’s or athletes performance stats and start building a robust sports analytics dataset. However, nowadays, many unofficial websites developed by sports enthusiasts and media websites contain invaluable information that can be scraped for sports analysis.

For example, in the case of football, the Premier League website’s Terms & Conditions permits you to “download and print material from the website as is reasonable for your own private and personal use”. This means that you may scrape their league data to obtain information about fixtures, results, clubs and players for your own analysis. Similarly, BBC Sports currently permits the scraping of its pages containing league tables and match information.

The data obtained from the Premier League and BBC Sports websites can later be easily augmented by scraping additional non-official websites that offer further statistics on match performances and other relevant data points in the sport. Some example websites include:

The same process applies to any other sports. However, the structure and availability of statistics in different official sport websites significantly vary from sport to sport. The popularity of the sport also dictates the number of non-official analytical websites offering relevant statistics to be scraped.

Scraping Example: Premier League Table

Below is a practical example on how to scrape the BBC Sports website to obtain the Premier League table using various scraping methods. The examples are designed as of the structure of BBC’s website at the time the article is published. Possible future changes by the BBC to their Premier League table page could mean that the HTML of the page slightly changes, therefore the scraping code in the example below may required some readjustment to reflect those design changes.

Using Web Scraper (Google Chrome extension)

1. Install Web Scraper (free) in your Chrome browser.

2. Once installed, an icon on the top right hand side of your browser would appear. This icon opens a small window with instructions and documentation on how to use Web Scraper.

 
Sport Performance Analysis - Web Scraping 1.png
 

3. Go to the BCC Sports website: https://www.bbc.co.uk/sport/football/tables

4. Right click anywhere on the page and select “Inspect” to open the browser Dev Tools (or press Option + ⌘ + J on a Mac, or Shift + CTRL + J on a Windows PC).

 
Sport Performance Analysis - Web Scraping 2.png
 

5. Make sure the Dev Tools sidebar is located at the bottom of the page. You can change its position under options and Dock side within the Inspect sidebar.

6. Navigate to the Web Scraper tab. This is where you can use the newly installed Web Scraper tool.

 
Sport+Performance+Analysis+-+Web+Scraping+3.jpg
 

7. To scrape a new page, you first need to create a new web crawler or spider by selecting “Create new sitemap”.

 
Sport+Performance+Analysis+-+Web+Scraping+4.jpg
 

8. Give the new sitemap a comprehensive name, in this case “bbc_prem_table” and then paste the URL of the web page you want to obtain data from: https://www.bbc.co.uk/sport/football/tables. Then click on “Create sitemap”.

 
Sport Performance Analysis - Web Scraping 5.png
 

9. Now that the spider is created, you would need to specify the specific elements of the page you would like data to be extracted from. In this example, we are looking to extract the table. To do so, click on “Add a new selector” to specify the HTML element that the web crawler needs to select and look for data in.

 
Sport+Performance+Analysis+-+Web+Scraping+6.jpg
 

10. Give the selector a lowercase name under “Id” and set the Type as a “Table”, since we will be extracting data from a table element within the HTML code of the page.

 
Sport Performance Analysis - Web Scraping 7.png
 

11. Under the Selector field, you would need to specify the specific element on the page that you would like to target. Since we have already specified in the field above that the element is a Table, by using the option “Select” and then clicking on the league table on the BBC page, Web Scraper will auto-select the right elements for us to target. Once you click on “Select” under the “Selector” field, hover over the table until it turns green. Once you are certain that the table is correctly highlighted, click on it until it turns read and the input bar reads “table”. Then press “Done selecting!” to confirm your selection.

 
Sport Performance Analysis - Web Scraping 8.png
 

12. The table header and row fields should now be automatically populated by Web Scraper, and a new field called Table columns should have appeared at the button of the window. Make sure the columns have been correctly captured from the table and change the column names to lowercase, since Web Scraper does not allow for uppercase characters.

 
Sport Performance Analysis - Web Scraping 9.png
 

13. Above the Table columns. Check the box for “Multiple” items so that the web crawler extracts more than one row of data from the table, rather than just the data for the first row (first team).

14. Now that the selector is correctly configured, click on “Save selector” to confirm all the settings and create the selector.

15. You are now ready to scrape the table. Go to the second option of the top menu (Sitemap + name of your new sitemap) and select “Scrape”. Leave the intervals and delay to 2s (2000ms) and select “Start scraping”. This will open and close a new Chrome window where your web crawler will attempt to extract the data.

 
Sport Performance Analysis - Web Scraping 10.png
 

16. Once the scraping is done. Click on “refresh” next to the text “No data scraped yet”. This will display the data scraped.

 
Sport Performance Analysis - Web Scraping 11.png
 

17. To download the data to a CSV file. Select the second option on the top menu once again and click on “Export data as CSV”. This will download a file with the Premier League data you have just scraped from BBC Sports.

 
Sport Performance Analysis - Web Scraping 12.png
 

Using Python’s BeautifulSoup

1. Open your computer’s command line (Windows) or Terminal (Mac).

2. Install PIP to your computer by typing the below line in your command line. PIP is a python package manager that allows you to download and manage packages that are not already available with the standard python installation.

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

3. Install Python, BeautifulSoup and Requests packages. These packages are required to write and execute the python code that will perform your scraping. Enter the following lines and press Enter, one by one, in your command line or terminal:

pip install python
pip install requests
pip install bs4

4. Open a text editor. This is where you will write your scraping code. If you don’t already have a text editor in your computer, consider downloading and installing Atom or SublimeText.

5. Create a new file and name it, for example, “prem_table_spider.py”. The “.py” extension at the end of the file name tells you text editor that it is a python file. Save the file to your Desktop for easier access later on.

6. The first lines of code refers to the package imports necessary to run the remaining of the script you will write. The packages needed in this case are “requests“ to get the HTML from the BBC page, “bs4” to use the tools provided by BeautifulSoup to select elements within the downloaded HTML, and “csv” to create a new CSV file where the data will be exported to.

import requests
from bs4 import BeautifulSoup
import csv

7. The next line of code will create a blank CSV file to store the collected data. Use the “csv.writer” function to create the file and give the file a name (i.e. prem_table_bs) and a mode of write (“w”) to enable python to write into this newly created file.

output_file = csv.writer(open('prem_table_bs.csv', 'w'))

8. After the CSV file is created, we will then want the code to create some table headers for the data we are going to be exporting. Use a “writerow” function that adds a new row of data to the CSV file. This row of that will simply be the header names that are shown in the league table from the BBC page.

output_file.writerow(['Position', 'Team', 'Played', 'Won', 'Drawn', 'Lost', 'For', 'Against', 'GD', 'Points'])

9. Now that the file is setup, the next steps will consist on writing the actual web scraping code. The first step is to provide the web crawler with the URL of the page we want to extract information from. This is done using the requests package. We use the “requests.get” function with the URL as an argument to extract the HTML from the BBC Sport football tables page. We save the results of this request as a variable called “result”.

result = requests.get("https://www.bbc.co.uk/sport/football/tables")

10. From the “result” obtained when getting the page’s HTML, we are only interested in its content. The get function offers other elements, such as headers or response status codes, which will not be of use for us in this example. To specify that we only want to work with the content, we save the “content” from the “result” into a new variable labelled “src” (source) for later use.

src = result.content

11. We have successfully extracted the HTML code from the BBC Sports page and saved it into a variable “src”. We can now start using BeautifulSoup on “src” to select the specific elements from the page that we want to extract (i.e table, table rows and table data). First, we need to tell BeautifulSoup to use the “src” variable we’ve just created containing the HTML content from the BBC Sports page by writing the following line. This line of code will set a new BeautifulSoup HTML parser variable called “soup” that uses the “src” contents:

soup = BeautifulSoup(src, 'html.parser')

12. Now the BeautifulSoup is connected to the BBC page’s HTML from the “src” variable, we can breaking down the HTML elements inside of “src” until we find the data we are after. Since we are looking for a table, this will involve selecting the <table> HTML element, extracting the <tr> (table rows) and then gathering each <td> (table data) from each row.

First, we set a new variable called “table” that represents all the <table> elements from the page. Since we use the “find_all_” function, we will receive a list of all tables. However, since there is only one table on the BBC’s page, that list will only contain one item. To retrieve the league table from the “table” list we need to set a new variable called “league_table” refers to the first item from such list (at index 0).

table = soup.find_all("table")
league_table = table[0]

13. With the league table now selected, we can now extract each row of data by running a new “find_all” function from the league_table that looks for all HTML elements with the tag <tr> (table row). Each row of the table will be a different team therefore we can label this new list of table rows “teams”.

teams = league_table.find_all("tr")

14. Finally, we can now create a for loop that iterates through every row in the table and extracts the text from every column item (<td> or table data). On every loop, python will assign the values of each <td> element in the row to a specific variable (i.e. the first element at index 0 will be league position of the team). After every loop (row) is processed, a new row of data will be written in the CSV file that was set up at the start of the code. Save the file. This is your completed scraping code.

for team in teams[1:21]:

    stats = team.find_all("td")

    position = stats[0].text
    team_name = stats[2].text
    played = stats[3].text
    won = stats[4].text
    drawn = stats[5].text
    lost = stats[6].text
    for_goals = stats[7].text
    against_goals = stats[8].text
    goal_diff = stats[9].text
    points = stats[10].text
    
    output_file.writerow([position, team_name, played, won, drawn, lost, for_goals, against_goals, goal_diff, points])
Sport Performance Analysis - Web Scraping 13.png

15. To run the code, open your command line or terminal once again. Navigate to the Desktop where you code file was saved. You can navigate backwards through your directories by typing “cd ..” in the command line, and navigate into a specific directory by typing the name or path of the directory after “cd” (i.e. “cd name_of_folder”). Once you are located in your Desktop directory (the name of the directory appears on the left hand side of each command line), you can run the web crawler file using the following command:

python prem_table_spider.py

Once run, you should find a new CSV file inside your Desktop folder that contains the Premier League table data you have just scraped.

Citations

  • Imperva (2020). Web scraping. Imperva. Link to article.

  • Perez, M. (2019). What is Web Scraping and What is it Used For? Parsehub. Link to article.

  • Rodriguez, I. (). What Is Pip? A Guide for New Pythonistas. Real Python. Link to article.

  • Scrapinghub. (2019). What is web scraping? Scrapinghub. Link to article.

  • Toth, A. (2017). Is Web Scraping Legal? 6 Misunderstandings About Web Scraping. Import.io. Link to article.

Computer Vision In Sport

What Is Computer Vision?

Computer Vision (CV) is a subfield of artificial intelligence and machine learning that develops techniques to train computers to interpret and understand the contents inside images. This can also be applied to videos, as a video is simply a collection of consecutive images, or ‘frames’. Computer Vision aims to replicate parts of the complexities in human vision system and visual perception by applying deep learning models to accurately detect and classify objects from the dynamic and varying physical world.

The first basic neural networks were developed around the 1950s to detect edges of simple objects and sort them into categories (i.e. circles, triangles, squares and so on). These systems were further developed to help the blind by enabling them to recognise written and typed text and characters using a method known as optical character recognition. By the 1990s, the rise of the Internet meant that unprecedented datasets of millions of images were regularly being shared and generated across the web. These extensive visual datasets enabled researchers to better train their models and develop face recognition programs that helped computers identify specific pictures inside of photos and videos.

Today, the advancements in smartphone technology, social media and their frequent use by billions of users - more than 3 billion images are shared online every day – is continuously generating even greater amounts of visual data than ever seen before. Together with the increased accessibility to large computer power and the innovations in deep learning and neural networks algorithms (i.e. the invention of convolutional neural networks), the availability of such immense amounts of images have brought invaluable opportunities for computers to learn the patterns and characteristics of these images and enhance the accuracy rates for object detection and classification. As a result, computer vision systems have surpassed the accuracy of human vision at certain detection, categorisation and reaction tasks, reaching accuracy rates of 99% in a number of their applications.

How Does Computer Vision Work?

Computer Vision is now able to perform a variety of tasks in a wide range of fields, from self-driving cars to medical diagnosis. Some of these tasks include photo classification, object detection, face recognition and searching image and video content. In order to perform these tasks, computers first need to be able to generate information from images (i.e. “see” the image). Since computers can only operate using numerical values (i.e. bits), they first need to read an image in its most raw numerical form: the matrix of its pixels. This matrix represents the brightness of each pixel in an image, from the darkest black (at value 0) to the brightest white (at value 255).

Images are a made up of thousands of pixels. These pixels are one-dimensional arrays with values from 0 to 255. One single image will contain three different matrices for the three components that represent the three primary colours: red, green and blue (RGB). By combining different brightness levels of the different primary colours (from 0 to 255), a pixel can display alternate colours to those primary ones. For example, a pixel that displays a vivid colour purple will have the values Red=128, Green=0 and Blue=128 (mixing red and blue results in purple), while a vivid yellow pixel in an image will contain values Red=255, Green=255 and Blue=0 (mixing red and green results in yellow). On the other hand, a grayscale image will only contain one single pixel matrix corresponding to the brightness of its black and white colours.

Deep learning algorithms in computer vision make use of these pixel arrays to apply statistical learning methods, such as linear regression, logistic regression, decision trees or support vector machines (SVM). By analysing the brightness values of a pixel and comparing it to its neighbouring pixels a computer vision model is able to identify edges, detect patters and eventually classify and detect objects in an image based on previously learned patterns. These methods often require the model to have already previously processed, stored values and learned patterns (i.e. to have been trained) of similar images containing the object of interest to be detected and tracked in the new, unseen image.

For example, to be able to detect a person in an image, a significantly large number of pre-labelled images of people are uploaded into the system, allowing the model to learn on its own by recognising patters in the features that make up a person. Once a new, not previously seen image is fed to that model, the computer will look for patterns in the colours, the shapes, the distances between the shapes, where objects border each other, and so on. It will then compare them to the characteristics from the images and labels it had previously identified and decide, based on probabilistic rules, whether there is a person or not in this new image. In other words, computer vision systems are able to ingest many labelled examples of a specific kind of data, extract common patterns between those examples and transform it into a mathematical equation that will help classify future pieces of information.

Sports Performance Analysis 12.png

Often, computers require images to be pre-processed prior to applying any detection and tracking models to them. Image pre-processing simplifies and enhances the image’s raw input by changing its properties, such as its brightness, colour, cropping, or reducing noise. This modifies the pixel matrices of the images in a way that a computer can better perform its expected tasks, such as removing a background in order to detect objects in the foreground. This is particularly useful in video footage, where computer vision can track moving objects using a discriminative method to distinguish between objects in the image and the background. By separating the two, it can detect all possible objects of interest for all relevant frames and use deep learning techniques to recognise the specific object to track from the ones detected.

Deep learning models are often trained to automate this process by inputting thousands of pre-processed, labelled or pre-identified images. Training of models can follow a variety of techniques, such as partitioning the images into multiple pieces to be examined separately, using edge detection to identify the edges of an object and better recognise what is in the image, use pattern detection to recognise repeated shapes, colours or other indicators, or even use feature matching to detect matching similarities in images to help classify them. Models may also use X and Y coordinates to create bounding boxes and identify everything within each box, such as a football field, an offensive player, a defensive player, a ball and so on. More than one technique is frequently used in conjunction to improve the accuracy and precision of object detection and tracking in an image or video.

The Applications Of Computer Vision In Sport

In sports, artificial intelligence was virtually unknown less than five years ago, but today deep learning and computer vision are making their way into a number of sports industry applications. Whether it is used by broadcasters to enhance spectator experience of a sport or by clubs themselves to become more competitive and achieve success, the reality is that the industry has substantially increased its adoption of these modern techniques.

Most major sports involve fast and accurate motion that can sometimes become challenging for coaches and analysts to track and analyse in great detail. This is particularly difficult in those situations when the use of wearable tracking equipment and sensors to augment data collection is not an option. In training sessions and certain matches, especially if they are untelevised, performance analyst are only able to obtain a limited number of angles of video footage. This footage is limited to providing visualisation of the player’s movement rather than detailed analysis. The data and insights obtained from the footage requires the analyst to spend numerous hours manually notating and collecting events as they replay the video. Scenarios such as this is where the application of computer vision techniques can bridge that gap between the sporting event and analytical insights by offering novel ways to gather data and obtain valuable analysis through automated systems that locate and segment each player of interest and following them over the duration of the video.

In the context of sports, footage is usually acquired through one or more cameras installed at close proximity of where the event takes place (i.e. the sidelines of a training field or the stands in a stadium during a match). The angle, positioning, hardware and other filming configurations of these cameras can vary greatly from sport to sport, event to event or even within the different cameras used for the same match or training session. This can pose a challenge for certain computer vision applications to accurately detect the precise positioning of objects or their direction of movement as they may fail to understand the varying configurations used to capture the different footage presented to them, where it is for training the models or classifying new, unseen images.

Traditionally, costly camera calibration for multi-camera tracking systems was essential ball and player tracking systems. For fixed-angle cameras, this could be done through scene calibration, where balls were rolled over the ground to account for non-planarity of the playing surface. However, broadcast cameras present additional challenges in that they often change their pan, tilt and zoom. This dynamism needed to be accounted for by using sensors on the camera mounting and lens to measure zoom and focus settings and be able to relate the raw values from the lens encoders to focal length. Gaining access to these advanced filming equipment is not often an option for most Performance Analysis departments within sporting clubs, limiting their capacity to apply advanced tracking of players.

Computer vision has partially solved these limitations. With its application of image processing, computer visions systems are now able to distinguish between the ground, players and other foreground objects. Methods such as colour-based elimination of the ground in courts with uniformly coloured surfaces allow computer vision models to detect the zones of a pitch, track moving players and identify the ball. For instance, colour-based segmentation algorithms are currently being used to detect the grass by its green colour and treat it as the background of the image or video frame, where players and objects move in front of it. Moreover, image differencing and background subtraction methods have also been used on static footage to detect the motion of the segmented foreground players against the image background.

Player Tracking

One of the key aims when applying computer vision in sports is player tracking. This involves the detection of the position of all players at a given moment in time. Player tracking is a pivotal element for coaches to help improve the performance of their teams, allowing them to instantly analyse the ways in which individual players move on the field and the overall formation of their team. Today, the most advanced applications of computer vision in sport use automated segmentation techniques to identify regions that likely to correspond to players.

The results obtained from a computer vision system can be augmented by applying machine learning and data mining techniques to the raw player tracking data. Once key elements in an image or video frame are detected, semantic information can be generated in order to create context on what actions the players are performing (i.e. ball possession, pass, run, defend and so on). These techniques can label semantic events, such as ‘a one-two pass’ in football, and be used for advanced statistical analysis of player and team performance. Suggestions can also be constructed on the optimal positions of players on the pitch and be displayed to coaches in a manner in which they can compare ideal player positioning against their actual positions in a given play. The vast opportunities created from this player tracking technology has the potential to revolutionise training and scouting for players in sports.

Data Collection

The use of action and event recognition techniques aim to localise sets of actions that a player performs in both space and time. These techniques can detect events – such as goals, penalties, near misses, and shots - during video clips by identifying visual information about the environment, such as court colour and lines on the pitch. They then use that information to classify each action into sport-specific groups by assigning them labels (i.e. shot, pass, etc.). Ultimately, action recognition and classification can be used to automatically generate performance statistics in a match or training session, such as shot types, passes or possession. It can also be applied to index videos by predefined themes based on their contents to be able to easily browsed through footage and automatically generate highlights movies.

How Is Computer Vision Used In Different Sports?

In racket and bat-and-ball sports, such as Tennis, Badminton or Cricket, computer vision has been widely used since the mid-2000s. Ball tracking systems attempt to look through each camera image available to identify all possible objects resembling the characteristics of a ball (i.e. searching for elliptical shapes in an expected size range). Once these objects have been detected, they then construct a 3D trajectory of the playing ball by linking multiple frames where the ball was detected to define the ball path across the various camera angles. The results from this system can then be used to instantly determine whether a ball has landed in or out of bounds. The system provide further analysis, such as predicting the path that a cricket ball would have taken if the batsman had not hit it.

An example of the use of computer vision in tennis can be spotted in one of the major tournaments in the sport. In 2017, Wimbledon partnered with IBM to include automated video highlights picking up key moments in the match by simply gathering data from players and fans, such as crowd noise, player movements and match data. Similarly, on the commercial side, a pocket-sized device was designed by Grégoire Gentil that called in and out in a tennis match by using computer vision to detect the speed and placement of a shot and determine whether the ball was out of bounds.

Other major invasion team sports have not been indifferent to the emergence of these new technologies. In football, FIFA certified goal line technology installations in major stadiums using a 7-camera computer vision system developed by Hawk-Eye. It uses a goal detection systems with multiple view high-speed cameras covering each goal area that detect moving objects by sorting potential objects resembling the playing ball based on area, colour and shape. With an accuracy error rate of 1.5cm and a detection speed of 1s, it enables football referees to immediately decide whether or not a ball has crossed the goal line and a goal should be awarded.

Aside from widespread implementations of computer vision, such as FIFA’s goal-line technology, other ad-hoc projects have also attempted to incorporate computer vision into football. In the 2011/2012 football season in Germany, Stemmer Imaging helped Impire develop an automatic player tracking system using two cameras in the press area of any stadium. This reduced the number of operators required to get accurate data without losing the quality of the information.

In American sports, such as the NFL, computer vision has been applied to automatically generate offensive formation labeling by classifying video footage based on the coordinates of players when tracked throughout a particular play. This application has supported coaches and analysts in the evaluation of oppositions’ patterns of play by generating a wealth of data on the most common formations employed by rival teams. Furthermore, the system has provided teams with additional information on oppositions’ tactics, such as the likelihood of passing or running out of each formation, run frequency for each side of the field, split between right guard and right end, frequency of runs up the middle, pass frequency on short routes, and average yard gains between running and passing plays.

Challenges Of Computer Vision

Despite the great potential that computer vision can bring to the world of sport and the field of performance analysis, there are still critical challenges that need to be overcome before that potential can be fully exploited. Some of these challenges relate to the fact that computer vision cannot yet fully compete with the human eye. A system that fully automates video analysis of sports by tracking and labelling players remains a challenge as optical tracking systems cannot yet cope with the varying body posture of a person during sports exercises, as well as the partial or full occlusion of players by equipment or other players during collisions or interactions. Tracking of sports players is also particularly challenging due to the fast and erratic motion, similar appearance of players in team sports, and often close interactions between players.Tracking the ball is a further challenge in team sports, where several players can occlude the ball (i.e. a ruck in Rugby Union), and it is possible that players are in possession of the ball with either their hands or between their feet.

The reason for these to continue to be a challenge within the field of AI and computer vision is that we still do not completely understand how human vision truly works. Even though the field of Biology studies the eye, the visual cortex and the brain, we are still far from fully understanding all the components of such a fundamental function of the human brain. For instance, how the influence of our memory, past experiences and inherited knowledge through billions of years of evolution impacts our perception and our ability to identify elements in our world. This lack of detailed understanding of human vision and our abstract perception makes it difficult to replicate our inherited knowledge of the world through a computer. On top of that, the external dynamism, variance and complexity of our physical world proves an extreme challenge to solve through computers that have to be thoroughly instructed on the types of objects, captured through the lens of a camera, that they must detect. Particularly when they are unable to deviate from what they have been trained to identify.

Nevertheless, the field of AI and computer vision continues its rapid development thanks to heavy investments by key players, such as Google, Intel, Amazon and many others, to continue to advance the computer power, increase datasets and develop new techniques that get closer to our human vision capabilities. Eventually, these advances will inevitably continue to make their way into the world of sport as athletes and teams aim to leverage modern technologies to improve their performance and become even more competitive. As performance analysts continue to support these athletes and coaches in objective evaluation of performance, it is without a doubt that the expansion of computer vision will eventually transform key areas of Performance Analysis in sport.

Citations and further reading:

  • Brownlee, J. (2019). A gentle introduction to computer vision. Machinery learning mastery. Link to article.

  • Dickson, B. (2019). What is Computer Vision? TechTalks. Link to article.

  • Dickson, B. (2020). What is Computer Vision? PC Mag. Link to article.

  • Kaiser, A. (2017). What is Computer Vision? Hayo. Link to article.

  • Le, J. (2018). The 5 computer vision techniques that will change how you see the world. Heart Beat. Link to article.

  • Lu, W. L., Ting, J. A., Little, J. J., & Murphy, K. P. (2013). Learning to track and identify players from broadcast sports videos. IEEE transactions on pattern analysis and machine intelligence35(7), 1704-1716. Link to paper.

  • Mihajlovic, I. (2019). Everything you ever wanted to know about Computer Vision. Towards Data Science. Link to article.

  • Monier, E., Wilhelm, P., & Rückert, U. (2009). A computer vision based tracking system for indoor team sports. In The fourth international conference on intelligent computing and information systems. Link to paper.

  • Sennaar, K. (2019). Artificial Intelligence in sports – current and future applications. Emerj. Link to article.

  • Softarex. (2019). Computer vision and machine learning in sports analytics: injury and outcome prediction. Softarex. Link to article.

  • Thomas, G., Gade, R., Moeslund, T. B., Carr, P., & Hilton, A. (2017). Computer vision for sports: Current applications and research topics. Computer Vision and Image Understanding159, 3-18. Link to paper.

What is Performance Analysis in Sport?

Since the early-2000s, the analysis of performance in sport has seen a dramatic transformation in both its methods (i.e. incorporating advanced statistical modelling and new analytical frameworks) and technologies (i.e. GPS tracking, time-lapsed notational analysis software and a large variety of tracking sensors and other tracking equipment). What started as shorthand notations with pen and paper has since evolved to advanced computerised systems and technologies that collect vast amounts of performance-related data.

The rise in lucrative financial opportunities in most major sports thanks to the ever-growing revenues from broadcasting deals and the rising global audiences have inevitably raised the stakes of winning. Consequently, sporting organisations are now turning to more scientific, evidence-based approaches when managing their institutions and developing their athletes. Standards in elite sports to achieve and maintain success are continuously being raised, placing increasing pressure on clubs, coaches and athletes to develop more efficient training structures, enhance athlete development processes and gain better understanding on the factors that determine success in major tournaments.

The highly competitive environment with constantly narrowing margins have triggered the emergence of Performance Analysis as an independent, yet interdisciplinary, backroom function that specialises on the objective, and most often quantitative, evaluation of performance. This relatively new field aims to support coaches in identifying key areas of performance requiring attention, evaluating the effectiveness of tactical and technical performance, as well as the strengths and weaknesses of upcoming oppositions. Its purpose is to provide valid, accurate and reliable information to coaches, players and any relevant stakeholders to augment their knowledge on a particular area of the sport.

Traditionally, Sports Performance Analysis has been defined as an observational analysis task that goes from data collection all the way to the delivery of feedback, and aims to improve sports performance by involving all coaches, players and analysts themselves. The observation of performance is carried out either live during the sporting event or post-competition through video footage and gathered statistics. Performance Analysts can now be spotted in stadiums, whether in the coaching box or a separate good viewing location within the stands, notating events and actions from the match using specialised software, such as SportsCode, Dartfish or Nacsport. In this process, they develop statistical reports that can be sent in real-time to the devices used by coaches (i.e. iPhones or iPads) and display to them a summary of key performance metrics, as well as short video feeds of key highlights. However, the additional time available in post-match analysis allows for a more detailed evaluation of performance using additional complementary sources of data. The data used during post-match analysis can come from sources beyond the analyst’s observations, such as qualitative data, video sequences and even measurements athletes’ exertion, heart rate, blood lactate levels, acceleration, speed and location metrics collected through wearable devices. Some of these data will often be sourced internally within the club but external sources, such as that of data provides like Opta, are often utilised across multiple sports to complement internal databases. Training sessions are also subject to analysis, with continuous monitoring of players to inform debriefing sessions by coaches and help plan the next session.

Research in the field has also emerged as its own specialised field. The International Journal of Performance Analysis in Sport now regularly publishes studies on key sports analysis research areas, such as the identification of key performance indicators, injury prevention through work-rate analysis and physical analysis, movement analysis, coaches’ behaviours and feedback processes, effectiveness of technique and tactics, normative profiling, overall match analysis and even the analysis of referees’ performance.

Performance Analysis As Its Own Backroom Function

Over the last two decades, Performance Analysis has established itself in many top sporting clubs and organisations as a pivotal element in the extrinsic feedback process that coaches use to accelerate the learning process and assist athletes reach their optimal performance levels. It is now considerate its own separate function within the backroom staff of a team, having differentiated itself from other sports science disciplines its core focus on quantitative performance evaluation, yet with a high degree of cross-functional aspects requiring it to maintain a close relationship with wider sports science disciplines. For instance, a work-rate analysis performed by a Strength & Conditioning department may complement the work of a Performance Analyst team on informing player selection based on both performance metrics and player fitness.

The Purpose Of Performance Analysis In Sport

The large volume of quantitative and qualitative information produced from the complex and dynamic situations in sport needs to be carefully disseminated and clearly presented – using clear visuals such as tables, charts or special-purpose diagrams of the playing surface - to allow coaches to obtain quick insights on areas requiring their attention. Performance Analysis enhances the coach’s ability to ‘feed-forward’. It aims to anticipate an opposition’s strengths and weaknesses by performing thorough opposition analysis to produce acquired knowledge that allows the team to rehearse appropriate plays and improve those individual skills that would aid to outperform the upcoming opponent.

The insights generated through Performance Analysis work such as opposition analysis help coaches make informed decisions on tactical choices and squad selection that would better exploit the weaknesses and overcome the strengths of a given opponent. Traditionally, these decisions were made in its entirety following a coach’s acquired wisdom through years of experience in the sport, often having previously played at elite levels themselves. However, studies have repeatedly proven that coach recall capacity of critical incidents that take place in a sporting event is limited to between 42% and 59% of events. On top of that, the events that are remembered are prone to incompleteness, emotional bias, inaccuracy and misinterpretation due to the natural flaws in human perception and cognitive capacity. Cover for these limitations in an increasingly more competitive environment coaches have turned to technology and analytics to have immediate access to both objective information of past events as well as instant video footage to review specific events they wish to recall and re-evaluate. For this, most top-level coaches now benefit from their own Performance Analysts departments that provide them with the necessary data collection, data manipulation, analytical and video analysis skills to allow them to take advantage of the vast amounts of information generated from their sport, yet receive those key elements most important to them in a clear, timely and concise manner.

The Scope Of Performance Analysis In Sport

Technical Analysis

The development of better athletes, from elite levels to grassroots programs, has been a key focus of the field of Performance Analysis in Sports over recent years. The mechanical detail of skills performed by athletes are carefully analysed to detect flaws in technique, monitor progress and identify changes during preparation or even assess rehabilitation from injury. The effectiveness in which an athlete performs specific skills or a broader passage of play is measured, compared and classified, either positively or negatively, against a predetermined expected outcome. For example, a coach may expect a minimum passing completion rate from its midfielders or a minimum speed from its wingers in football. Often, these measurements are presented as ratios or percentages of successfully performed skills, such as the percentage pass completion or tackling success. They are then used to develop performance profiles of players that are used to benchmark and compare them against teammates or rival players.

Tactical Analysis

Similarly, tactical analysis carried out by Performance Analysts help coaches better understand the impact of their tactical decisions. It can also help identify specific tendencies and preferred tactical setups by opposing teams. By leveraging the latest video analysis and player tracking technologies, Performance Analysts are now increasingly more capable of evaluating patterns of play in conjunction with skills performed, location on the field, timings and players involved to draw an accurate representation of tactical variances given particular match scenarios.

Physiological Analysis

Player movements are also carefully assessed to ensure they achieve positions of advantage, as well as desired velocities, distance covered and speed ranges. This line of work by Performance Analysts is closely complemented with the work by a Strength & Conditioning team. The aim is to enable the athlete to achieve their optimal physical condition by providing performance analysis on areas relating their strength, power, endurance, agility, stability and mobility. Injury prevention is also a priority, especially in sports with intense physical contact where likelihood of injury is high. GPS trackers and other wearable technologies are combined with video analysis to understand the physical efforts that players go through during training and matches and allow coaches to better manage the intensity of sessions.

Psychological Analysis

Psychological training is a key element of the coaching process when it comes to mentally preparing athletes to the pressures of a sport and the challenging conditions that may impact their motivations and ambitions of reaching their desired goals. Performance Analysts are able to support coaches through the evaluation of an athlete’s discipline, exertion, efforts and other fluctuations of work-rate that could be associated to mental factors it an attempt to minimise effects of negative mental influences and positively influence athletes. Most often, Performance Analysts use their video analysis abilities to create motivational clips and video highlights that can support coaches with the mental preparation of their teams and athletes.

Equipment And Technologies In Performance Analysis In Sport

Today, most Performance Analysis departments at elite clubs start their analytical process by recording video footage of training sessions and competitive events. Often, more than one HD camcorder is set up at high viewpoints on the sidelines of training pitches or stadiums to collect footage in various angles, whether is at a closer angle capturing just a few players or a wider angle of the full sections of the pitch. In some instances, drones are also used to capture an even wider angle from above the players on the pitch to be able to clearly identify gaps during plays or structural setups and formations. Certain actions during training sessions may also allow for the Performance Analyst to get physically closer to the play and use a handheld camera, such as a GoPro, to capture an additional angle that shows closer movements and player technique. The footage from the camcorders is captured into SD cards inside the cameras or directly into a laptop using media management software, such as Media Express from BlackMagic Design. Often both are used in conjunction to act as a backup of each other. Alternatively, Performance Analysts may also obtain video feeds for certain matches or competitive events that are broadcasted from the broadcasters themselves, freeing up their time to perform additional real-time data collection and analysis during the event.

Once the video footage is gathered, Performance Analysts leverage the capabilities of time-lapsed computerised video analysis software, such as SportsCode, Dartfish or Nacsport, to notate key events and actions and generated meaningful data for later analysis. These solutions allows them to replay the training session or match and tag key events to construct a database with frequency counts, length of specific actions and supportive contextual information of each individual action (i.e. whether a tackle was successful or a missed opportunity). Coaches and players can later go through the coded timeline of the event and view specific video highlights automatically generated by the software. Analysts would then export the frequency data into data manipulation and analysis software, often being Microsoft Excel, and perform further analysis on the data and combine it with historical datasets, data from wearable tracking devices - often players wear GPS trackers, such as Catapult, StatSports or Playertek - or even data obtained from external sources and data providers, such as Opta.

The insights generated from the analysis are then delivered to the interested parties, coaches or players. The method of delivery varies greatly from club to club and depends greatly on the audience receiving the information. Summary reports may be printed and distributed amongst players and coaches with key statistics and areas requiring attention. In other occasions, data visualisation software such as Tableau may be used to interactively display charts and other visuals of team and player performance. Most often, coaches and players get a great deal of value from watching replays and highlights of the areas being analysed. Therefore, analysts often create short highlights clips using video editing tools, such as CoachPaint, KlipDraw, Adobe After Effects or Premiere Pro, or simply Apple’s iMovie application, to produce a combination of notated footage that clearly displays the information they want to portray to the coaching staff and team.

What Is Next For Performance Analysis In Sport?

As technology continues to evolve and data-related solutions increasingly bring new functionality to the field, the field of Performance Analysis will continue to grow. New technologies will bring new opportunities for sporting organisations to become even more competitive and better maximise their athlete’s potential. Inevitably, as a club’s main goal is to outperform and outsmart its competitors, this will continue to raise the standards of success in all major sports, where investment in solutions and human resources that allow them to exploit these new opportunities will continue to increase overtime, given that the financial incentives of winning will remain lucratively attractive to owners and investors.

However, further advances in technology and the sophistication of processes will also bring new complexities to the environment that Performance Analysts will operate in. This will place additional pressures to the skills demanded in the field, where not only a good acumen of a sport and coaching processes will be needed, but also highly technical skills to effectively navigate a growing data ecosystem will be essential. Inevitably, some of the current manual and repetitive tasks will be automated using modern solutions. For instance, analysts often make use of video analysis software to manually code every single event as it takes place in the footage. However, computer vision could eventually replace these repetitive and labour-intensive tasks during data collection from video footage by automatically detecting and tracking players and moving objects (i.e. the ball) in the field and performing frequency counts using pre-programmed functions. This automation enable clubs to free-up resources from the Performance Analysis departments and allow analysts to reallocate their time into generating insights through deeper analysis of the collected data.

The field of Performance Analysis is, today, at its early stages. Different sports are at different stages in their adoption of this new and critical function inside their backroom teams. Some are not yet considering Performance Analysis a priority when hiring and developing such teams. The novelty of the field, a limited understanding of its use and benefits by owners and club decision-makers, as well as the competitive labour market, where wealthy companies from other industries are also interested in hiring individuals with an analytical and technical skillset, has challenged the consolidation of Performance Analysis in certain sports. However, not all sporting clubs and institutions have been slow at their incorporation of specialised analysis of performance. Wealthier and more established clubs have been able to experiment and appreciate the benefits of investing in the skillsets that have allowed them to better understand key factors of success and develop their athlete’s performance through acquired knowledge that has placed them above their rivals. These innovative actions taken by top-tier teams have usually had an effective trickle-down effect on the rest of clubs within a sport, where the rest of rivals follow suit in order to remain competitive. As the field continues to grow in line with technology, we will undoubtably see an exciting evolution in the composition and structures of coaching teams and sporting organisations as a whole.

Citations and useful resources:

  • Laird, P., & Waters, L. (2008). Eyewitness recollection of sport coaches. International Journal of Performance Analysis in Sport8(1), 76-84.

  • McGarry, T., O'Donoghue, P., & de Eira Sampaio, A. J. (Eds.). (2013). Routledge handbook of sports performance analysis. Routledge.

  • O'Donoghue, P. (2009). Research methods for sports performance analysis. Routledge.

  • O'Donoghue, P. (2014). An introduction to performance analysis of sport. Routledge.

The Increasing Presence Of Data Analytics In Golf

Dating back to the 15th century, golf is one of the most traditional sports in the world. Even in its modern form, it continues to maintain most of its original characteristics and etiquette from centuries ago. However, golf has not been immune to the technological revolution that has seen many individual and team sports adopt the latest data technologies to optimise performance and enhance entertainment value for fans.

In today’s golf, every single aspect of the game, from a player’s swing to their round strategy and even the equipment they use is being transformed through scientific advances, data analysis, machine learning and cloud technologies. Impressively, this highly traditional sport has rapidly embraced data analytics as a means to provide a deeper understanding and enjoyment of the game. As a sport with one of the tightest of margins amongst its elite players, where one single dropped shot can cost you a tournament, golfers have turned to technology to develop an intelligent and information rich training regime and strategy to improve their chances of winning.

The Largest Golf Database By PGA Tour

One of the first developments that triggered the data revolution in golf dates back to 2003, when PGA Tour partnered with CDW to create an advanced ball-tracking system: ShotLink. The concept of ShotLink was first designed in 1983 as an electronic scorecard to catalogue historical data. However, technological advancements allowed CDW and PGA Tour to develop an improved system that aimed to break down every detail of every stroke taken by every player to facilitate the analysis of each player’s round and overall performance. The objective was not only to help players improve their game through data, but was also considered as an attempt by the Tour to help make the sport more accessible to modern players and fans.

Since its launch, ShotLink has dramatically evolved over the years to the point that it can now laser map each golf course and create a digital image of each hole to calculate exact locations and distances between any two coordinates, such as the location of all players and their distance to green. The system has been continuously upgraded in line with its increasing adoption by most of golf’s data ecosystem, through apps, devices, software and consultancy agencies available today.

One of the latest improvements PGA Tour has made to its data collection system is the installation of three fixed, high-resolution cameras that replaced the human-operated laser on every green to capture the ball in motion. Thanks to ShotLink, PGA Tour have managed to develop a database of 174 million shot attributes and 80,000 hours of video over the past 20 years in operation. But once the data had been collected, practical insights needed to be produced from the large number of individual data points gathered over the years. To make sense of such large dataset, they partnered with Microsoft to leverage artificial intelligence through Azure cloud-based services and create a Content Relevancy Engine (CRE) that processed ShotLink’s immense database to find the most relevant, most interesting stats that are contextual.

Today, ShotLink is used in 93 events per year. Its data feeds are accessed by broadcasters as well as top-flight players, who use the statistics from the system to analyse, compare their performance against competitors and improve their play. But not only players have benefited from the introduction of this high-tech system. Through ShotLink, PGA has managed to enhance viewers entertainment experience when watching the sport by making the ball highly visible through television.

The statistics captured through ShotLink have also been turned into into eye-opening insights that have increased the level of engagement from most golf fans. By having unprecedented data available for analysis, PGA Tour was able to uncover valuable insights relating to the different patterns of play amongst top PGA players. Some of these interesting insights included:

  • Winning players tend to make a higher number of putts between 11 and 20 feet away.

  • A third of all putts are over 20 feet of distance, with better golfers often leaving themselves 3 feet or closer on the first putt.

  • 99% of PGA players make puts within 3 feet distance.

  • Top golfers rarely go three-putt or over.

  • Hitting the fairway means the PGA golfer will under par on the hole.

  • Top players average under par after hitting the rough, which adds 0.25 of a shot to the hole.

  • The most frequent approach shot distance range is 150-175 yards. From there, 71% of PGA golfers hit the green from the fairway; but need to be between 75-100 yards to hit 71% from the rough.

  • Golfers gain shot advantage instead of losing it if they aim 25, 30 or 35 yards back to avoid the rough or other hazards.

  • Golfers should always aim for the green instead of laying-up on a par 5 that has no water or hazards around the green. This allows them to hit their third shot from within 50 yards of the hole, increasing their chances of cutting their putting distance and error rates in half.

  • An improvement of a half-stroke per round increases a player’s earning potential by 73 percent.

Development Of Data Gathering Systems, Devices And Smart Equipment

The technological revolution in golf has brought new devices and systems that can now provide statistical analysis to enhance training, playing and viewing experience of the sport.

One of the most crucial and difficult aspects of golf is the swing. It is considered one of the most complex sequence of movements in any sport, with muscle groups of the whole body involved to provide the millimetric, biomechanical prerequisites to transfer the swing energy efficiently and accurately to the golf ball. Therefore, it is not surprising that swing sensors, grip guides, shot trackers, laser rangefinders, and even virtual caddies, that help inform and improve the swing in varying circumstances have increasingly become more predominant amongst professional and amateur golfers to help them achieve the perfect swing.

Some of these devices include systems like TrackMan or K-Motion, which monitor granular variations in motion using a combination of HD cameras and microwave transmissions that reflect back from a moving golf club and golf ball and capture data of what happens at the exact moment of contact with the ball. Others, such as inertial sensors and depth cameras for 3D analysis like Golf Integrated, have been used to evaluate the swing of golfers in relation to their joint length and initial posture. These systems are able to display many factors of the golfer’s swing, such as club head launch speed, distance carried and ball spin. With the captured movement, they provide expert interpretative biomechanical reporting on body, arm, hand and club motions, as well as balance and weight distribution, during each golf swing.

Additionally, systems that use highspeed, high-resolution cameras, such as Foresight Sports’ GC2 Smart Camera System, are also able to measure club performance and ball launch data, such as ball speed, total spin, launch angle, deviation angle and spin tilt axis, to determine the ball trajectory, peak height, angle, distance in relation to initial launch condition and total final distance including bounce and roll. In combination with Foresight Sports’ HMT Head Measurement Technology, Foresight’s Sports’ devices can measure the delivery of the club head in terms of path, face plane, closure rate, velocity and impact location of the golf ball. All these data points are intuitively displayed in Foresight Sports’ Performance Fitting app using illustrated depictions of ball flight and club head data.

Traditional golf equipment is also experiencing significant change with the incorporation of analytics and technology into its manufacturing. Cobra Golf’s KING F8 club lines developed clubs with connected smart grips powered by an embedded Arccos computer sensor that tracks and analyses a golfer’s performance through shot tracking, distance calculation and location. These clubs come with their own smartphone app that uses GPS to track positioning and displays multiple analytics on the golfer’s performance, such as strokes gained and handicap breakdowns for driving, approach, chipping, sand and putting. Golf balls are also getting smarter. Coach Labs’ GEN i1 and i2 smart golf ball and OnCore’s Genious Ball now contain nine-axis sensor and on-board MCU that acts like a miniature launch monitor to measure initial direction, speed, impact force and ball rotation during putting and direction, spin rate, distance and speed in full swing and transmits the data to a smartphone app.

Amateur players have also seen their golfing experience expand thanks to technology. For instance, recreational players can now enhance their playing skills and enjoyment of the game through systems such as virtual caddies. Arccos Golf developed an Arccos Caddie solution that uses wireless club mounted sensors that attach directly to the player’s golf clubs, as well as using GPS trackers from smartphones, to collect player performance data in real time. The system can track which clubs the player uses, where they hit the ball and how many shots it took to complete each hole, broken down into driving, approach, chipping, sand, and putting. Arccos Caddie uses Microsoft’s Azure Machine Learning to leverage artificial intelligence against the 120 million shot data and 368 million geotagged data stored in its system from 40,000 golf courses to provide golfers with specific advice on how far to hit each shot, which club to use and how to make corrections as they play their round. It also offers golfers their optimal strategy off the tee after considering their likely shot distance as impacted by wind, weather, elevation and other factors. It can also calculate for them their expected score and odds of making par, their likelihood of hitting the fairway, and their chances of missing to either side. For example, it can detect a player’s tendency to miss fairways to the left with the 3-wood, or even a glaring inability to hit the green with the 8-iron.

Generating Valuable Information By Contextualising The Data Collected

Sensors, GPS, cameras and other tracking devices are unable to paint a complete picture of a player’s performance without the underlying analytics to tell the story. Even though increasing amounts of raw data points, such as swing speed, can now be captured with these new devices, analytics is pivotal to generate value and context from such vast data.

In 2017, GolfTEC tested 13,000 pro golfers and amateurs across 48 different body motions per swing using motion sensors, cameras and monitors in a study the labelled as SwingTRU Motion Study. The study aimed to define what makes a great golfer. They found that the difference between a competent golfer and a top one can be summarised in their hip sway and shoulder tilt at the top of the swing and then point of impact, as well as the hip turn at the point of impact and the shoulder bend at the finish of the swing. By statistically correlating these factors to better performance, GolfTEC developed a benchmark in which golfers can compare themselves in these different areas and make improvements.

Moreover, USGA is making use of its database of 2 million golfers and 50 million scores collected through the Golfer Handicap Information Network by developing an algorithm that creates a professional-style benchmarking ability at the recreational level to allow golfers at all levels to compare their game against others and gain insight into how they are playing. For example, this system enables amateur golfers to compare their Saturday’s round on a relative basis among the 150 others who played the same course that day.

Furthermore, there are numerous in-depth golf analytics websites, such as GOLFstats.com or the official PGA Tour website, that have emerged to take advantage of the technological wave in golf and provide data accessibility. These websites provide fans and players access to vast amounts statistics on professional golfers and tournaments at an incredible level of granularity (i.e. their longest driving average or the number of fairway hits). Additionally, the Canadian site DataGolf.org has made available a live statistical model that displays the probabilities of every player’s winning changes for every PGA Tour and PGA European Tour as they happen. By mid-2018, their predictive model was outperforming most major betting companies. They present their data through outstanding data charts and other visualisations, including historical numbers dating back to 1990.

Other websites and mobile apps, such as ShotByShot.com, Arccos 360, Anova or Golfmetrics, have also started to leverage the use of advance analytics to improve amateur golfers’ game. Any player can now have access to the right tools that allow them to easily and accurately track different data points of their game, from driving, approach shots, sand shots to putting. These apps statistically break down a player’s game to help them identify the areas that most significantly improve their overall performance. They aim to accurately pinpoint a player’s strengths and weaknesses in driving, approach shots, short game and putting, and in more detailed subcategories using the strokes gained metric popularised by Mark Broadie. Through these apps, a player would enter their scores in the app, which in return will calculate their strokes gained values and compare them against golfers at various levels. The website or app will record and analyse the player’s data, determine the relative handicaps of their game and then identify the highest improvement priority and contributing factors to improve their game.

Data Analytics Agencies Are Supporting Golfers Make Sense Of Their Performance Data

Performance Analysis agencies and consultancies, such as Golf Data Lab or TeeBox Golf, have started to emerge in professional golf. These agencies often provide golfers with tailored technical support and produce objective analysis of their game to identify trends and assess strengths and weaknesses. Teams of analysts record a golfer’s round and provide them, or their caddy, a detailed breakdown of their performance with comparisons against previous rounds and other competitors. Some of the statistics collected and analysed by these agencies include:

  • Driving accuracy to fairway

  • Par 3, 4 and 5 accuracy analysis

  • Long, medium and short iron approaches

  • Short game analysis (<50 yards)

  • Putting analysis (including data such as conversion per distance, 3 putt frequencies, tap-in rates and missed putts analysis)

  • Clubs used and club efficiency

  • Shots type

  • Dropped shots analysis

  • Comparisons with PGA averages

  • Drive versus approach analysis

  • Strike quality examination

  • Directional tendencies

Consultancies like 15th Club, an unofficial stats partner for the 2016 European Ryder Cup team, have now established themselves as key influencers in the European game, from informing qualification process and captain’s picks to the partnerships and singles order. Through their valuable application of data intelligence, they have become another crucial voice in preparing every member of the European Tour and defining their training structures. They now work with over 40 professional golfers, who have seen an average increased in earning of $600,000 by simply improving their stroke by +0.15-0.25 per round. Similar to ShotLink in America, 15th Club uses GPS, lasers and cameras operated by a group of people to collect all the necessary data points to build their algorithms and models. Additionally, they offer a visualisation platform, Waggle, for players to access their performance data. Some of the statistics available in Waggle include strokes gained against the field, top three and bottom three strokes as well as other traditional stats.

New science-based and statistical data-driven golf training centres, such as Every Ball Counts, have been recently established to help elite pros and serious amateur golfers through demanding physical and mental training sessions. Aside from leveraging various of the technologies previously mentioned, Every Ball Counts also developed an algorithm with Harvard University that takes a player’s ShotLink data and looks at 900 data points calculates 19 different metrics to formulate a game plan on how to improve a golfer’s game.

New Metrics Are Leaving Traditional Statistics Behind

One of the most popularised metrics that has appeared from the analytics revolution in golf in recent years is strokes gained. The strokes gained metric was developed in 2011 by Mark Broadie, writer of the 2014 best seller Every Shot Counts, as an attempt to modernise more traditional golfing stats previously employed, such as driving distance or putts per round. One of the issues with traditional statistics that Broadie discovered was relating to the counting of the number of putts per round. This conventional metric did not take into account the distance of each putt. In other words, players who hit their approach shots closer to the hole may have fewer putts per green in regulation than a player who is a superior putter but doesn’t hit his approach shots as close. Instead, strokes gained adjusts for the initial distance of the putt and other relevant factors to illustrate a more accurate representation of the golfer’s skill level.

To calculate strokes gained, an analysis was performed on ShotLink’s database composed of 15 million shots from players across every PGA tournament to determine the value of each shot by benchmarking it to the average of historical shots with those similar characteristics. It is a model that predicts the probability of a golfer’s score for each hole on a shot-by-shot basis. Mark Broadie applied mathematical techniques of simulation to analyse different strategies using different clubs and targets off the tee. He simulated thousands of shots and played the hole thousands of times using different strategies to identify the most effective one. He also applied dynamic programming by optimising the sequence of play in a hole and coming up with the best strategy on the tee by working backwards off the green to determine what should be the target on the first shot.

Since its development, strokes gained has allowed golfers to better understand where they gain or lose ground. Mark Broadie started discovering aspects of the game that contradicted common beliefs. For example, he found that putting is only 15 percent of the shots difference between better players and average players, with the biggest difference actually being in ball striking, especially the number of penalised shots that those with high handicaps hit. In essence, long game is the separator between the best pros and average pros, since it explains about two-thirds of the scoring differences. Putting at 27 feet or 30 feet distance on the green does not matter as much as a shot in the bunker or the shot that lands on the green instead of the rough. The distribution of the importance of each type of shot that Broadie found suggested that approach shots accounted for 40% of the players’ scoring advantage, while driving was responsible for 28%, short game for 17% and putting covered the remaining 15%.

Data Analytics In Course Management

Aside from the direct benefits to a golfer’s play, courses all around the world have also made use of technology to improve their grounds. Data systems are allowing golf clubs to track every single shot played on their course in relation to handicap, age, gender, weather conditions, pace of play, tee usage and pin locations and provide them a detailed understanding of the interaction between players and the various features of their golf course. The aim is to efficiently improve golfer experience by increasing playability, course strategy or difficulty, environmental impact or pace of play, while reducing maintenance costs through reductions in redundant water, chemical and fertiliser usage, green, fairway, tee sizes and bunker volume and size in areas of little to no play. Companies like Golf Course Architecture are also providing golf-course operators with smartwatches that are worn by members to track every shot hit and its location, while golfers get all their statistics in real-time as they play.

How Are Pro Tour Golfers Applying Data To Their Play?

In recent years, a new generation of professional players have employed statisticians and data analysts to analyse the vast amounts data available and identify their strengths and weaknesses against those of their opponents in order to improve their performance and define winning strategies. One of these golfers is Rory McIlroy, who has made heavy use of the 32,000 data points per event that ShotLink System captures to benchmark himself against everyone else, particularly using statistics such as strokes gained.

In 2012, Dustin Johnson found immediate results when discovering through data analysis that he ranked 166 in wedge game. After identifying his specific area of weakness and fine-tuning his wedges using a high-tech Trackman device to monitor and improve the accuracy of his short game, he managed to improve his approach shots from 50-to-125 yards. By 2016, he had become fourth in the ranking.

Other golfers like Brandt Snedeker also embraced technology as early as 2011, when he became the first tour player to hire a full-time analyst. By 2015, using radar technology to track swing, he determined that his best swing launched the ball at 12 degrees with a spin rate of 2,400 revolutions per minute. He then used this information as a baseline when testing and acquiring new equipment that incorporated the latest advances in design and verify whether it improved his performance.

Other examples include Danny Willett, when in 2016 he made use of 15th Club to gain access to a team of golf professionals, data experts and software engineers who analysed ball locations at Augusta National and helped him plot his winning strategy during the 2016 Masters Tournament. The strategy consisted on taking advantage of Willet’s great wedge game between 75 to 100 yards on par 5s when his tee shots went wrong. He went on to win the tournament by making 11% of shots above par compared to the 26% field average.

Luke Donald, through his golf coach Pat Goss and the help of Mark Broadie, also rose through the ranks by taking advantage of analytics and the strokes gained formula to understand where to improve and inform the design of practices to improve specific statistics. These statistics showed Goss that even though Donald did not drive the ball far, he was very good at short game and putting. It allowed him to define a winning strategy where Luke Donald had to get almost a full shot in putting and the rest from the short game inside 100 yards and from iron play, and just break even with driving.

Today, data analysts in golf are becoming as important to tour pros as swing instructors and fitness trainers. They parse statistics to create better training plans and arm the golfers with game plans for each week. As data gets more complex and margins tighter, data analytics and the integration of technology in the sport will continue to rise and gain in importance. Golfers seem to have understood and accepted that and appear to be embracing the ever-growing technological revolution in sport.

Citations:

  • Chansanchai, A. (2018). PGA TOUR launches a new solution that gives golf fans more personalized content experiences. Microsoft News. Link to article.

  • Cloke, H. (2019). Data-driven design. Golf Course Architecture. Link to article.

  • Corcoran, M. (2019). Wise guys: Data Golf is taking analytics to a whole new level (pay attention, gamblers). Golf.com. Link to article.

  • Dusek, D. (2018). By the numbers: Analytics become an increasingly important part of golf. Golf Week Digital Edition. Link to article.

  • Greenberg, N. (2018). PGA Tour is embracing artificial intelligence, and it could change how you watch golf. The Washington Post. Link to article.

  • Kramer, S. (2018). This Is How Technology Meets Golf. Forbes. Link to article.

  • Lisota, K. (2016). How Dustin Johnson used data and analytics to become one of the best golfers in the world. Geekwire. Link to article.

  • Martin, S. (2015). Q&A with the godfather of golf analytics. PGA Tour Online. Link to article.

  • Morgan, T. (2016). Data analytics in golf: How a revolution in preparation is changing the sport. International Business Times, Sport, Golf. Link to article.

  • Ray, S. (2017). Don’t let the pencils fool you: Golfers are teeing up a tech revolution. Microsoft News. Link to article.

  • Schupak, A. (2017). Pro Golfers Find Winning Rounds From Numbers Crunching. The New York Times. Link to article.

  • Tour Insider. (2019). The World Of Golf Analytics. Tour Insider Today’s Golf. Link to article.

  • Wacker, B. (2019). Why a little stat analysis goes a long way on the PGA Tour. Golf Digest, Golf World. Link to article.

  • Wooden, A. (2019). The Secret To The Perfect Golf Swing Is Hidden In The Numbers. Intel Technology Cloud Analytics Hub. Link to article.

  • Woodie, A. (2017). Optimize Your Golf Game with Advanced Analytics. Datanami. Link to article.

  • Wong, W. (2015). Golf Gets into the Swing of Analytics. BizTech Business Intelligence. Link to article.

Impact of Data Analysis And Technology in Rugby Union

In August 1995, the International Rugby Board declared Rugby Union a professional sport. As we approach the 25th anniversary of the professionalisation of Rugby Union, it is worth reflecting back on the evolution of the sports during the last two and a half decades. The sport has experienced incredible change, with multi-billion worldwide audiences, broadcasting agreements and lucrative contracts for players, coaches and clubs. This rise in popularity led to the rise of the standards to performance demanded at an elite level. Competitive margins became tighter as athlete development, the coaching processes and overall club management became more complex. Incentives of winning to attract sponsors and broadcasters became a major focus and so did the efforts of clubs to acquire an extra competitive edge over their opponents. This added complexity triggered the emergence of new backroom functions that dealt with areas from physiological, psychological or biomechanical aspects affecting players (i.e. Strength & Conditioning coaches or Team Psychologists) to those providing an objective evaluation of performance and addressing the need of a better understanding of the determinants of success in the game (i.e. Performance Analysts).

Emergence of the Use of Technology and Data

Over the years, advancements in technology and data management processes in all top sports have led the way in better defining individual and team performances, and Rugby Union is no exception. Coaches and other backroom staff can now be seen in the stands with a wide variety of computers and technology monitoring all aspects of the match in great detail. Different camera angles, data and analysis are now available to them right there and then to make instant decisions, as well as post-match reviews.

Sports Performance Analysis in Rugby

VIDEO ANALYSIS TECHNOLOGY

Amongst the many new practices emerging from the use of technology, the introduction of video analysis in the coaching process has enabled for dynamic and complex situations in sports to be quantified in an objective, reliable and valid manner. Time-lapsed software packages like SportsCode have enabled Performance Analysts to analyse match or training footage by manually tracking event frequencies and creating datasets for later analysis. Thanks to SportsCode and other videoanalysis software, these datasets are also linked to video footage for better contextualisation during review.

RESHAPING BACKROOM STAFF PROFILES

The ways in which the collected data is used is also evolving from basic visualisations, historicals and dashboards to more complex prescriptive approaches that provide more informed recommendations and can predict possible outcomes. This change is being driven by a new generation of Sport Scientists and Performance Analysts who have come into rugby with an increasingly stronger background in data and analytics. With the hand of coaches willing to listen to data, they are changing the culture within clubs into a more evidence-based approach to performance. These analysts not only analyse all aspects of their team’s performance but also aim to detect the strengths and weaknesses of their next opposition for coaches to use in their game plan. Thanks to the latest technologies and availability of data through third party providers like Opta, they can now perform incredibly detailed analysis, such as an opponent’s key player’s kicking game (i.e. the types of kicks, when he made them, from which part of the field and the distance he tended to get) or even identifying who are the key players in an opposition’s running game.

IMPROVED TRACKING EQUIPMENT AND DEVICES

In today’s modern rugby, all leading rugby union clubs use data to monitor fitness, prevent injuries and track player’s positions through devices such as wearable GPS trackers. The data captured from these technologies have played a key role in preventing player injuries. GPS technology company Catapult - which develops wearable devices sewn into the back of players’ shirts - recently aimed to deepen the use of data in rugby by launching a unique set of algorithms engineered to quantify key technical and physical demands in the sport. They achieve this by automatically detecting scrums, kicks and contact involvements in Rugby Union players. This data providing insights on the physical demands imposed to players gives coaches crucial information to manage the load given to players during training and matches to better maintain adequate levels of fitness while preventing injuries from physical overexertion. Coaching staff can now see the levels of effort put in during training sessions and, by monitoring the players’ thresholds, they can better design training sessions to keep the players fresh for the games. One of the benefits from Catapult’s Rugby Suite is the measurement of contact involvement duration (i.e. the time a player takes to get back to feet, also known as Back In Game Time). This allows strength & conditioning coaches to identify player fatigue levels and their intent when returning to the defensive or offensive line.

Source: Business Insider - Credit: Harlequins/Catapult

Source: Business Insider - Credit: Harlequins/Catapult

INNOVATIVE TECHNOLOGY TO ENSURE PLAYER WELLBEING

Another key area strongly impacted by technology is concussions. Concussions are a growing issue in the sport, leading to players eventually suffering from chronic traumatic encephalopathy, a degenerative brain condition with symptoms similar to Alzheimer’s. This has been a focus of technological developments aiming to better prevent and monitor them across various contact sports. Historically, pitch-side doctors rely on player honesty for their risk assessment when deciding whether the player should return to play. However, companies like OPRO+ are now building impact sensors into the personalised gumshields frequently worn by players to protect their teeth. By having impact detection technology closer to the centre of the skull doctors can paint a more accurate picture of the forces involved in each impact. OPRO+ can transmit impact data to a laptop in real-time so that pitch-side doctors can assess whether a player requires further assessment. This has proven particularly important in training sessions, where 20% of head injuries take place, although most of them go unseen. Thanks to this technology, coaches are now able to assess the forces exerted by players during drills and adjust the practice accordingly to avoid undetected head injuries. This type of tracking technology could eventually help develop a digital passport of historical head impact data for individual players, which can help them lengthen their careers by preventing early retirement due to poorly treated head injuries.

Further advancements in the use of technology to prevent concussions were introduced as recent as five years ago across the world of rugby. In 2015, World Rugby also introduced a cloud-based technology developed by CSx into the Head Injury Assessment (HIA) process. This system collected neurocognitive information that medical staff can review to determine if a player suffered a concussion. They transferred the data on the players involved, incidents and medical assessments to the data analytics platform Domo via an API, where the various datasets would be joined up in one single consolidated platform for further analysis. This new technical process introduced by World Rugby brought the estimated number of players allowed to continue to play after being concussed down from 56% to just 7%, while the chances of being removed of the game without being concussed only increased from 3% to 5%.

Source: The Times

Source: The Times

How Are Unions And Clubs Managing The Relationship With Data And Technology In The Sport?

Successful rugby unions like New Zealand Rugby have started considering the balance between data and intuition. Their performance analysis department now operates in a highly dynamic technological environment where it provides its teams the ability to quickly analyse data for performance insights. The All Blacks turned to SAS in 2013, when they adopted SAS Visual Analytics as their main reporting tool. It enabled them to obtain a formal data management process that consolidated all real-time match data, post-match data and data retrieved through third party data providers in one unified and centralised platform.

New Zealand Rugby manages the relationship between players and technology by adopting the philosophy that when it comes to match play players are considered the ones in control of the game, as they are the ones that see, hear or feel what’s happening on the field. Technology is considered a supportive tool in the background to help inform decisions by bringing context and evidence to conversations, but not take over them.

As per England Rugby Union, head coach Eddie Jones addressed the significance of data prior to travelling to the 2019 World Cup in Japan. He suggested that data has had a key role for him in seeing what is important and deciding where to invest in to build the strength of your squad. England Rugby benefits from an extensive analytics team that provides post-match analysis but also real-time tactical suggestions to coaches during matches. The department implemented a philosophy of always looking for the winning edge. For instance, they aim to discover winning trends such as the now well-established theory that the use an effective kicking game tends to lead to more successful match outcomes, a theory now considered a basic principles in the sport.

Moreover, Rugby Australia also entered the world of data analytics by partnering with Accenture to develop a bespoke high-performance unit (HPU) analytics platform using Accenture's Insights Platform (AIP) that consolidated all their data activities. The system placed sports data at the core of all team’s management processes. As data ecosystems have become more complex with numerous sources and purposes for different datasets, Rugby Australia was able to integrate data, deliver insights and enable users in a single platform that provides a smarter and more automated approach that has led to a more effective way to manage their data assets. Insights are now available to players and staff via a mobile app that provides clear visibility of a particular player’s performance and health as well as allowing deep-dive exploration into highly detailed statistics about daily performance.

The growth of data management systems and processes has also extended beyond unions. Overtime, media, consultancies, tech companies and clubs themselves are beginning to gather larger amounts of data of the game in an attempt to develop big data capabilities. For instance, Accenture and RBS developed an analytical package for the 2017 Six Nations tournament that contained six million data points per match. IBM and the RFU also performed a similar exercise by developing a predictive analytics software, TryTracker, to forecast the outcome of a game by mining data from historical rugby matches obtained from Opta.

However, when it comes to professional clubs, data is increasingly more custom-made by the clubs themselves to tailor for particular coaching philosophies and needs, as well as team-specific insights. Most clubs will receive data from third-party providers like Opta at a certain level of granularity, but will then gather their own internal data often at a much deeper level. They create their own datasets where they might even analyse the technique of every single player in the team individually. For example, teams may track a more detailed view of their defense, detailing the dominance of each tackle. Coaches can also have an input in data captured by providing their expert insights as additional data points. Analysts will incorporate the couches’ perceived effectiveness or quality of a given action by a player as categorical data variable to the dataset (i.e. positive or negative movements according to effectiveness in performing a set of moves).

Have Data and Technology Been Fully Accepted In The Sport?

In December 2019, a study by Andrew Manley and Shaun Williams from the University of Bath triggered a new debate of whether the essence of the sport (i.e. enjoyment of the players) seen during the amateur era and the early professional years has been lost. Players are, allegedly, increasingly concerned about the use of modern technology to provide clubs with greater surveillance and pressure to perform over them.

OVER-EXPOSURE THROUGH TECHNOLOGY

The qualitative study by Manley and Williams interviewed 10 professional rugby players and asks them about their experience with data and technology at their club. Like many others at an elite level, their club used a series of devices such as laptops, camcorders, GPS devices, heart rate monitors, body fat recordings, mood score sheets, iPhones/iPads and mobile apps to map, track and monitor individual performances and player wellbeing. Data from these devices was collected by analysts and matched against the team’s key performance indicators. Analysts and coaches would then assess each player’s performance and set appropriate improvement plans. Once collected and validated, the data was published in the club’s mobile app for players to access it. According to the players interviewed, the open exposure of individual statistics created a climate of fear of public embarrassment when failing to meet personal performance indicators.

A CHANGING CULTURE

The club had also developed a global Work Efficiency Index for each player that was derived from 70 different variables describing a players positive and negative actions and physical condition. The use of this new metric by the club extended all the way to contract negotiations. This raised serious concerns from players, who often failed to understand how to improve their Work Efficiency Index, thus became suspicious that the results were being manipulated to suit the management’s rhetoric at any given time. Players started to obsess over this metric, prioritising it above their individual impact to the overall team performance. On the field, they also became risk adverse to avoid negatively impacting their specific stats defined by the clubs. They feared being called-out by coaches and judged by teammates during post-match reviews. Even then, performing well in individual stats had other negative effects on team dynamics. Players with positive individual stats had incentives to take it easy and ignore the additional contribution they could bring to the team after they ticked all the boxes.

Sports Performance Analysis - Rugby 3.jpg

INVASIVENESS OF CONSTANT MONITORING

Players also found the introduction of technology to be invasive in nature. The mobile app used by their club involved continuous monitoring of their activities and sent frequent notifications and reminders to players’ phones. Some of the features of this app included the monitoring of weight management. The club had even introduced fines if players failed to meet their body weight targets set in the system. Additionally, the new machine mentality at the club had coaches increasingly turning to technology to zoom in on the deficiencies affecting individual and team performance as a response to the pressures of a growing fan-base and increasing commercial interests of owners and sponsors that demand an acceleration to title success. Players felt that the excessive use of technology had introduced a Big Brother surveillance on players and was used as a coercive method of ensuring that players meet institutional objectives. Data and technology had simply become standard practice in elite coaching of modern rugby. However, players felt that these unrelenting practices of constantly monitoring had harmful consequences to their playing and private lives, as well as relationship with coaches, which had not yet been addressed. In their interviews, they argued that technology has enable coaches to formalize a regime of power, with the risks of turning the humanistic approach of coaching into pure data engineering.

Sports Performance Analysis in Rugby 1.png

PURISTS VERSUS OBJECTIVISTS

Other critics of the use of technology argue that Rugby Union is losing its way due to data. According to them, individual wizardry and innate empathy in the sport created from the unpredictability in the game is suppressed by those digital data profiles created by analysts and coaches that players are constantly trying to meet. The researchers in the study argued that data is taking away intelligence, creativity and human connection from the sport through mechanistic and restrictive routines imposed to players. As players become more risk adverse, predictable and formulaic, a culture without instinct, emotion and unpredictability is introduced in the sport, inevitably becoming less attractive to fans. This culture, according to researchers, encourages individualism over team dynamics and incites anxiety amongst players by throwing large amounts of data at them to pressure them to perform to the stats. This has become detrimental to their enjoyment and performance in the game.

While having recently praised the significance of data in achieving success, England coach Eddie Jones also expressed his concerns regarding the production of player at grassroots levels that lacked dimension. He stated that academy players are now coached to regimentally follow a game plan rather than react to dynamic and unpredicted events in a game. They are decision followers rather than decision makers. The study claimed that the surge in technological practices, to the detriment of players and the game, has also been accelerated by the new generation of head coaches entering top division clubs. These group of coaches are former players who have only known Rugby Union as a professional sport and who feel the need to keep up with technology not to fall behind. They prioritise control over players through procedural management at the expense of educational aspects of the job.

Sports Performance Analysis - Rugby 2.jpg

DATA OWNERSHIP AND SECURITY

Data ownership has also become a key concern to players. Even prior to the launch of GDPR regulations, legal proceedings had been discussed between players and their respective clubs on this matter. Their main concern was relating data accrued by clubs and unions through GPS units, and other performance measurement devices, relating to a player’s medical history, such as injuries. They wanted to prevent clubs from using their data without their consent, or even selling it to third parties, which could have detrimental effects to their careers and future earnings. The International Rugby Player Association addressed this issue by pressing on the efforts to make personal statistical data relating to the player to be owned by the player themselves, who should also receive any benefits that may arise from the commercialisation of such data.

Player statistics may not only be used in contract negotiations by the player’s current club but also by clubs interested in incorporating them in the near future. For instance, if a player’s performance in training has statistically declined (i.e. speed tests, work-rate or lifting in the gym) that information could be valuable to a club interested in signing said player. However, the information at the club’s disposal may lack completeness and paint an imprecise picture of the player’s true value. For example, there is a lack of measurement of soft skills a player can bring to a team, such as leadership and motivational impact on the rest of his teammates. Additionally, the security of their private and confidential data stored at the club is also an area of concern. As larger amounts of complex player data is gathered and stored in the club’s systems, the risk to data breach is also increased, particularly those of phishing or hacking attacks. This means that clubs and backroom departments have to now face structural and procedural challenges relating the way they manage and secure their vast amounts of data collected and have sufficient know-how to identify and prevent any serious security gaps.

Teething Problems Of A Rapidly Growing Field

The experiences described by the players interviewed in the study reflect the eagerness in today’s big data society to make use of the ever-evolving technological advancements. Everything is turned into data in order to be objectively understood. However, one of the most important conclusions in the study is that a lot of the data used in professional Rugby Union lacked relevance. Instead of aiming to capture as many variables as technology allows, a fewer amount of data should be made available to players that is substantially more meaningful to them. That is not to say that conclusions should be drawn from insufficient data samples. Another important issue in the application of analytics in Rugby Union, particularly at an international level where fewer matches are played, has been the generation of insights from too sparse and small sample sizes that are insufficient to make predictions. Focus should be placed in collecting and analysing large enough sample of the data identified as being truly meaningful for player and team development towards achieving excellence in the game.

Sports Performance Analysis in Rugby 7.png

Practical applications should be place at the core of any consideration for using data and technology. There have been numerous studies made on different aspects of the game, but more often than not these have dubious practical applications or mere usefulness in coaching practices. For instance, a study concluding that shared experience by players within the same team is correlated to better outcomes may have minor practical applications to coaches, as it is rare or difficult to buy shared experience and there is little a coach can do in that regard. Instead, analysts should look at performance patterns and trends rather than one-dimensional statistics, such as ratios or frequency counts. For example, analytical studies should aim to identify trends that develop before tackles are missed so we can help coaches and players identify the root flaws within a team’s defensive pattern.

The use of data in the sport should advance into true rugby analytics and deep intelligence by effectively and meaningfully using the data available in the sport. Analysts should aim to fully understanding what the team is trying to achieve and then go on to identify the metrics that influence those goals. This will allow them to inform decisions that impact performance and change behaviours. Since context is key it should become the central piece of most analytical work, as without it data insights presented to coaches lack value and practicality.

Sports Performance Analysis in Rugby 8.png

The role of analytics and technology is only going to grow even further. There is increasingly new technology coming into Rugby Union. This places increasing demand on people who can process vast amounts of data and come up with relevant analysis, while at the same time not losing touch with the nature of coaching practices in the Rugby Union. While some questions can be raised about today’s appropriate use of data analysis in defining and optimizing team performance, it is without a doubt that technology has open the doors to a wide range of developments that have evolved the jobs of coaches and players. While the study by Manley and Williams exposes some concerns of how data is being applied at a club level, it is also true that player wellbeing (i.e. concussion prevention) has seen a substantial improvement with the aid of technological advancements. The idea of data analysis is not to replace all other aspects of the coaching practice but to combines the coaches’ experience and intuitions with video and data analysis to help inform decisions on training priorities, on team selection, on tactics, and longer term on player recruitment and player retention issues. There is an important place for technology and data in the sport, but like everything, a healthy balance needs to be established where data and intuition strongly complement each other.

Citations:

  • Barbaschow, A. (2019). New Zealand All Blacks balances data analytics with 'living in moment' of match. ZD Net Online. Link to article.

  • Braue, D. (2018). Rugby Australia taps big data to improve player performance. IT News Online. Link to article.

  • Cameron, I. (2019). Rugby Union legal battle brewing as players set to fight for right to 'data'. Rugby Pass. Link to article.

  • Carter, C. (2015). 27 August 1995: Rugby Union turns professional. Money Week Online. Link to article.

  • Creasey, S. (2013). Rugby Football Union uses IBM predictive analytics for Six Nations. ComputerWeekly.com. Link to article.

  • Dawson, A. (2017). How GPS, drones, and apps are revolutionizing rugby. Business Insider Online. Link to article.

  • Gerrard, B. (2015). Rugby Union analytics – five ways data is changing the sport. The Guardian Online. Link to article.

  • James, S. (2015). Statistics and data analysis are important in rugby team selection, but nothing beats personal opinion. The Telegraph Online. Link to article.

  • Katwala, A. (2019). Smart gumshields are monitoring rugby concussions. Wired Online. Link to article.

  • Leadbeater, S. (2019). How Big Data & Artificial Intelligence are having a positive impact in the sport of Rugby Union. Think Big Business Online. Link to article.

  • Macaulay, P. (2019). World Rugby turns to data analytics to tackle concussion risk. Computer World Online. Link to article.

  • Manley, A. & Williams, S. (2019). ‘We’re not run on Numbers, We’re People, We’re Emotional People’: Exploring the experiences and lived consequences of emerging technologies, organizational surveillance and control among elite professionals. Organization, 1-22. Link to study.

  • Rees, P. (2020). Is rugby union losing its way by becoming a numbers game? The Guardian Online. Sports: Rugby Union. Link to article.

  • Rees, P. (2020.) Body fat recordings and mood scores: has technology gone too far in rugby?. The Guardian Online. Sports: Rugby Union. Link to article.

  • Streeter, J. (2019). Catapult elevates use of data with all-new Rugby Suite. Insider Sport Online. Link to article.

  • Watt, D. (2019). Five things that business leaders can learn from England Rugby. Director Online. Link to article.

History Of Performance Analysis: The Controversial Pioneer Charles Reep

Thorold Charles Reep was born in 1904 in the small town of Torpoint, Cornwall, on the south west of England. At the age of 24, he joined the English Royal Air Force to serve as an accountant, where he learned the necessary mathematical skills and attention to detail that he went on to employ throughout his career. During World War II, he was deployed in Germany, and would eventually be awarded the rank of Wing Commander.

Thorold Charles Reep (1904-2002) - Source: The Sun

Thorold Charles Reep (1904-2002) - Source: The Sun

From a young age, Reep was a faithful supporter of his local club Plymouth Argyle and would frequently attend matches at Home Park Stadium. However, his relocation to London after joining the Royal Air Force gave him an opportunity to attend Tottenham Hotspurs and Arsenal matches. In 1933, Arsenal’s captain Charles Jones came to Reep’s camp to talk about the analysis of wing play being used by the London club, which emphasise the objective of wide players to quickly move the ball up the pitch. The talk deeply inspired Reep, who soon became a keen enthusiast of Arsenal’s manager Herbert Chapman and his attacking style of football. This was the start of Reep’s passion for attacking football and its adoption across the country.

Arsenal FC 1933 squad including Herbert Chapman and Charles Jones - Source: Storie Di Calcio

Arsenal FC 1933 squad including Herbert Chapman and Charles Jones - Source: Storie Di Calcio

In March 1950, during a match between Swindon Town and Bristol Rovers at the County Ground, Reep became increasingly frustrated during the first half of the match by Swindon’s slow playing style and continuously inefficient scoring attempts. He took his notepad and pen out at half time and started recording some rudimentary actions, pitch positions and passing sequences with outcomes using a system that mixed symbols and notes to obtain a complete record of play. He wanted to better understand Swindon’s playing patterns and scoring performance and suggest any possible improvements needed to guarantee promotion. He ended up recording a total of 147 attacking plays by Swindon in that second half of their 1-0 win against Bristol.

Swindon Town vs Bristol Rovers 1950 Match Report - Source: Swindon Town FC

Using a simple extrapolation, Reep estimated that a full match of football would consist on an average of 280 attacking moves with an average of 2 goals scored per match. This indicated an average scoring conversion rate of only 0.71% per goal, suggesting only a small improvement was needed for a side to increase their average to 3 goals per game from just 2.

In the years that followed, Charles Reep quickly established himself as the first performance analyst in professional football, as he witnessed how the information he was collecting was being used to plan strategy and analyse team performance. He never stopped developing his theory of the game, watching and notating an average of 40 matches a season, taking him around 80 hours per match. He was often spotted recording match events from the stand at Plymouth's Home Park wearing a miner's helmet to illuminate his notebook, meticulously scribbling down play-by-play spatial data by hand.

In 1958, he attended the World Cup in Solna, near Stockholm, and produced a detailed record of the total number of goals scored, shots and possessions during the final. He wanted to provide an objective count of what took place in that match, away from opinions, biased recollections or a few single memorable events on the pitch. He produced a total of fifty pages of match drawings and feature dissection that took him over three months to complete.

Match between the domestic champions of England (Wolverhampton Wanderers) and Hungary league winners (Budapest Honved) in 1954. Stan Cullis declared his team as “champions of the world” after their 3-2 victory. This provoked a lot of criticism and i…

Match between the domestic champions of England (Wolverhampton Wanderers) and Hungary league winners (Budapest Honved) in 1954. Stan Cullis declared his team as “champions of the world” after their 3-2 victory. This provoked a lot of criticism and inspired the creation of the official European Cup the following season - Source: These Football Times

The real-time notational system Charles Reep developed took him to Brentford in 1951. Manager Jackie Gibbons offered him a part-time adviser position to help the struggling side avoid relegation from Second Division. With Reep’s help, Brentford managed to double their goals per match ratio and secure their Division spot by winning 13 of their last 14 matches.

The following season, his Royal Air Force duties moved Reep to Shropshire, near Birmingham. There he met Stan Cullis, at the time manager of the successful and exciting side Wolverhampton Wanderers. Cullis offered Charles Reep to take similar advisory responsibilities at his club to the ones he successfully undertook at Berntford. Reep not only brought with him his acquired knowledge from the analysis performed at Swindon and Brentford but also a innovative, real-time process that provided hand notations of every move of a football match, together with subsequent data transcription and analysis. As a strong believer of direct attacking football, Reep’s work only reinforced Cullis’ preestablished opinions of how the game should be played.

Stan Cullis, Wolverhampton Wanderers manager from 1948 to 1964 - Source: Solavanco

Stan Cullis, Wolverhampton Wanderers manager from 1948 to 1964 - Source: Solavanco

In his three and a half years at Wolves, Reep helped the club implement a direct, incisive style of play that consisted of very few aesthetics (i.e. skill moves) but instead took advantage of straightforward, fast wingers. Square passing by Wolves players became frowned upon by Cullis and the coaching team. During this time, the concept of Position of Maximum Advantage (POMO) began to emerge, describing the area of the opposition’s box in which crossed should be directed to in order to increase the chances of scoring. Under the Reep-Cullis partnership, Wolves achieved European success in what was then the European Champions Cup competition.

In 1955, Charles Reep retired from the Royal Air Force and was offered £750 for a one-year renewable contract by Sheffield Wednesday to work as an analyst alongside manager Eric Taylor. He ended up spending 3 years at Sheffield Wednesday, achieving promotion from Division Two in his first season at the club. On his final season at the club, his departure was triggered by the disappointing results by the team, and saw Reep point fingers at the club’s key player for refusing to buy into his long-ball playing system. During the remaining of his career, his direct involvement with clubs became a lot more sporadic. Nevertheless, he managed to help a total of twenty three managers from teams such as Wimbledon, Watford or even the Norwegian national team understand and adopt his football philosophy.

Over the years away from club roles, Charles Reep continued to investigate the relationships between passing movements, goals, games and championships, as well as the influence that random chance has on those variables. He was keen to continue to develop his theory by summarizing all his notes and records he had been collecting since 1950. During this analysis, Reep developed an interest in probability and the law of negative binomial, which he applied to his dataset. His analytical methods eventually became public after he shared his notes with News Chronicle and the magazine Match Analysis.

These publications demonstrated that Charles Reep had discovered insights of the game not previously analysed. Some of these suggested that teams usually scored on average one goal every nine shots or that half of the goals scored came from balls recovered in the last third of the pitch. One of his most famous remarks was to suggest that teams are more efficient when they reduce the time they spent passing the ball around and instead focus on lobbing the ball forward with as few number of passes as possible. He was a firm promoter of a quicker, more direct, long-ball playing style.

Reep followed a notational analysis method of dividing the pitch into four sections to identify a shooting area approximately 30 metres from the goal-line. This detailed in-event notation and post-event analysis enabled him to accurately measure the distance and trajectory of every pass. Amongst his findings, he discovered that:

  • It took 10 shots to get 1 goal

  • 50% of goals were scored from 0 or 1 passes

  • 80% of goals are scored within 3 or less passes

  • Regaining possession within the shooting area is a vital source of goal-scoring opportunities

  • 50% of goals come from breakdowns in a team’s own half of the pitch

In 1953, Reep went on to publish his statistical analysis of patterns of play in football in the Journal of the Royal Statistical Society. In his paper, he analysed 578 matches to assess the distribution of passing movements and found that 99% of all plays consisted of less than six passes, while 95% of them consisted of less than four. These findings backed Reep’s beliefs of reducing the frequency of passing and possession time by moving the ball forwards as quickly as possible. He wanted that the truth he had discovered dictated how teams play.

Manual notational analysis prior to the introduction of technology - Source: Keith Lyons

Manual notational analysis prior to the introduction of technology - Source: Keith Lyons

From his first analysis of the 1950 Swindon Town match against Bristol Rovers all the way to the mid-1990s, Charles Reep went on to notate and analyse a total of 2,200 matches. In 1973, Reep analysed England's 3-1 loss against West Germany in the 1972 European Championship to vigorously protest the “pointless sideways” passing style of play adopted by the Germans. In that match, the Germans had outplayed the English with a smooth, passing style of football that was labelled at the time as “total football”. Reep attempted to argue against the praise this new passing style of play had received across the continent by implying that it lacked the attractiveness demanded by fans as it placed goal scoring as a secondary objective in exchange for extreme elaboration of play. Instead, he pushed forward his own views regarding the use of long balls and suggested that, even though they less frequently found the aimed player, they brought unquestionable gains. He stated that, based on his analysis, the chance generation value of five long passes missed was equal to five of them made.

Swindon Town vs Bristol Rovers 1950 Programme - Source: Swindon Town FC

Swindon Town vs Bristol Rovers 1950 Programme - Source: Swindon Town FC

Most of Charles Reep’s analysis supported the effectiveness of using a direct style of football, with wingers as high up the pitch as possible waiting for long balls. This approach to the game a had significant influence in the English national team between the 1970s and 1980s, when the debate of the importance of possession had become the central topic of conversation amongst FA directors. Reep, often described as an imperious individual intolerant of criticism, argued against the need for ball possession, contrary to the philosophy backed by then FA’s technical director Allen Wade.

It was not until 1983, when Wade was replaced as technical director by his former assistant Charles Hughes – a strong believer of long ball play – that Reep’s direct football ideology became the new FA's explicit tactical philosophy of the English game. Hughes saw in Reep’s work an opportunity to redefine the outdated ideals of the amateur founders of the FA and introduce his own mandate across the whole English game. This mandate consisted on a style of play that focused on long diagonals and physicality of players. As a result, technically gifted midfielders found themselves watching how the ball flew over their heads as they struggled with overly physical challenges.

Charles Hughes, The FA’s former technical director of coaching - Source: The Times

Charles Hughes, The FA’s former technical director of coaching - Source: The Times

Controversy And Criticism

Charles Reep’s simplistic methods have been, and continue to be, critised by many football fans and analytics enthusiasts. One critic indicated that while his study assessing passing distribution showed that almost 92% of moves constituted of less than 3 passes, his dataset only contained 80% of the goals, and not 92%, from these short possessions. This contradicts Reep’s beliefs by illustrating that moves of 3 or fewer passes were in fact a less effective strategy to score goals. Additionally, it also demonstrated that Charles Reep’s argument that most goals happened after fewer than four pass movements was simply due to the fact that most movements in football (92% from his dataset) are short possessions, thus it would be understandable that most goals would be scored in that manner.

Similarly, his study did not appear to take into consideration differences in team quality. Evidence of this can be seen in that the World Cup matches he analysed, which contained double the amount of plays with seven or more passes than those he recorded from English league matches. The indication suggest that Reep missed the fact that a higher quality of the game in a higher level competition, such as the World Cup, with better players available, seemed to provide longer passing moves than in English football league matches where the average technical quality of players would be inferior. Furthermore, critics have also added that none of Reep’s analysis takes into consideration any additional factors to playing style, such as the level of exhaustion exerted on the opposition by forcing them to chase the ball around through passing.

Reep’s character and very strong preconceived notions could have prevented him from investigating alternative hypotheses that did not agree with his philosophy of direct football. He was often described as an absolutist that wanted to push his one generic winning formula. This caused most of Reep’s analysis to be ignorant of the numerous essential factors that can affect a match’s outcome. Critics have often labelled Reep’s influence on the philosophies applied to English football and coaching styles for over 30 years as “horrifying”, due the fundamental misinterpretations Reep committed throughout his work. As previously stated, one of these consisted on applying the same considerations and level of weighting to a match by an English Third Division team than to a match in the World Cup. He paid no attention to the quality of the teams involved, ignoring potentially valid assumptions that a technically poorer team may experience greater risks when attempting to play possession football. Instead, he followed his own preconceptions, such as assuming that teams should always be trying to score, when in reality teams may decide to defend their scoreline advantage by holding possession.

Aside from the criticism for his poor methods and misinterpreted finding, Reep has also been recognised for the new approaches he introduced to the analysis of the game. He was one of the first pioneers to show that football had constant and predictable patterns and that statistics give us a chance to identify what we would otherwise had missed. He initiated the thinking around the recreation of past performance through data collection, which could then inform strategies to achieve successful match outcomes. While he might not have been an outstanding data analyst, Charles Reep was a great accountant with great attention to detail and ability to collect data.

The approaches he introduced have significantly evolved since Reep’s first notational analysis in 1950. Technologies and analytical frameworks developed since the 1990s have facilitated the emergence of video analysis and data collection systems to improve athlete performance. From the foundation of Prozone in 1995 that offered high-quality video analysis to the appearance of Opta Sports or Statsbomb as global data providers capturing millions of data points per match, the field of notational and performance analysis in football has evolved in line with the technological revolution of the last few decades. The popularity of big data and the growing desire of data-driven objectivity has become important priorities within professional clubs when aiming to gain competitive advantage in a game of increasingly tight margins. Reep’s work initiated the machinery that is today an ecosystem of video analysis software, data providers, analysts, academia, data-influenced management decisions and redefined coaching processes that constitute a key piece of what modern football is today. While none of these elements can win a match on their own, they surely have been making crucial contributions in providing clubs with those smallest advantages that make the largest of differences.

Citations:

  • Instone, D. (2009). Reep: Visionary Or Detrimental Force? Spotlight On Man Whose Ideas Cullis Embraced. Wolves Heroes. Link.

  • Lyons, K. (2011). Goal Scoring in Association Football: Charles Reep. Keith Lyons Clyde Street. Link.

  • Medeiros, J. (2017). How data analytics killed the Premier League's long ball game. Wired. Link.

  • Menary, S. (2014). Maximum Opportunity; Was Charles Hughes a long-ball zealot, or pragmatist reacting to necessity? The Blizzard. The Football Quarterly. Link.

  • Pollard, R. (2002). Charles Reep (1904-2002): pioneer of notational and performance analysis in football. Journal of Sports Sciences, 20(10), 853-855. Link.

  • Pollard, R. (2019). Invalid Interpretation of Passing Sequence Data to Assess Team Performance in Football: Repairing the Tarnished Legacy of Charles Reep. The Open Sports Sciences Journal, 12, 17-21. Link.

  • Reep, C. & Benjamin, B. (1968). Skill and chance in association football. Journal of the Royal Statistical Society, 131, 581-585.

  • Sammonds, C. (2019). Charles Reep: Football Analytics’ Founding Father. How a former RAF Wing Commander set into motion football’s data revolution. Innovation Enterprise Channels. Link.

  • Sykes, J. & Paine, N. (2016). How One Man’s Bad Math Helped Ruin Decades Of English Soccer. FiveThirtyEight. Link.

A New Way Of Classifying Team Formations In Football

One of the most important tactical decisions made in football is deciding on the best team formation,  determining what roles each player has and the playing style. Laurie Shaw and Mark Glickman from the Department of Statistics at Harvard University recently developed an innovative, data-driven way of identifying different tendencies seen by managers when giving tactical instructions to their players, specifically around team formations. They measured and classified 3,976 observations of different spatial configurations of players on the pitch for teams with and without the ball. They then analysed the changes of these formations throughout the course of a match.

 While team formations in football have evolved over the years, they continue to heavily rely on a classification system that simply counts the number of defenders, midfielders and forwards (i.e. 4-3-3). However, Laurie and Mark argued that this system only provides a crude summary of player configurations within a team, ignoring the fluidity and nuances these formations may experience during specific circumstances of a match. For instance, when Jürgen Klopp prepares his formations at Liverpool, he creates a defensive version where all players know their roles and an offensive one that aims to exploit the best areas of the pitch. Therefore, Liverpool prepare different formations for different phases of the game; a detail that is lost when describing them as using a simple 4-3-3 formation.

Identifying Defensive And Offensive Formations

The researchers used tracking data to make multiple observations of team formations in the 100 matches analysed; separating formations with and without possession. By doing so, they identified a unique set of formations that are most frequently used by teams. These groups helped them classify new formation observations to then analyse major tactical transitions during the course of a match.

Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

The above diagram from Laurie and Mark’s study shows a defending team moving as a coherent block by having players retain their relative position, showing that their formation is not defined by the positions of players on the pitch in absolute terms but by their positions relative to one another. Starting from the player in the densest part of the team, Laurie and Mark calculated the relative position of each player using the average angle and distance between said player and his nearest neighbour over a specific time period in a match, and subsequently repeating the same process with the latter’s neighbor and so on. By calculating the average vectors between all pairs of players in the team, they obtained a center of mass of a team’s formation, which is then aligned to the centre of the pitch when plotting team formations.

Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

The researchers made multiple observations of a team’s defensive and offensive configurations throughout the match. They aggregated together the observed possession into two-minute intervals. For example, for the team in possession they plotted all possessions into two-minutes time periods and then measured their formations in each of those sets, and did the same process for the team without possession during the same time period.

The diagram below shows a set of formation observations for a team during a single match, illustrating that the team defends with a 4-1-4-1 formation, but attacks with three forwards and with the fullbacks aligning with the defensive midfielder. These findings also illustrate that while the defensive players remained compacted, the movement of attacking players, such as central striker was more varied. The consistency in all the observations also suggest that the managers did not change formations significantly during the match. 

Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Grouping Similar Formations Together Into Five Clusters

Additionally, Laurie and Mark used an agglomerative hierarchical clustering to identify unique sets of formations that teams used in the 100 matches analysed; constituting 1,988 observations of defensive formations and 1,988 observations of offensive ones. To be able to group formations together, they first had to define a metric that established the level of similarity between two separate formations. The similarity between two players in two different formations is quantified using the Wasserstein distance, using their two bivariate normal distributions, with their own means and covariance matrix, where the Wassertein distance between them is calculated by squaring the L2 norm of the difference between their means. However, an entire team’s formation consists on a set of 10 bivariate normal distributions, one for each outfield player. Therefore, to compare two different team formations the researchers calculated the minimum cost of moving from one distribution to another using the total Wasserstein distance. The blue area in the diagram below indicates the number of players that deviate from the formation’s average position.

Laurie and Mark also found that two formations may be identical in shape, but one may be more compact than the other. In order to classify formations solely by shape and not by their degree of expansion across the pitch, they had to scale the formations so that compactness is no longer a discriminator in their clustering.

Once this was resolved, the hierarchical clustering applied to the dataset simply found two most similar formation observations based on the Wasserstein distance metric to combine them and form a group. Then, it found the next two most similar ones, forming more groups, and so on. This process identified 5 groups of formations with each group containing 4 variant formations, producing a total of 20 unique formations.

Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

 The first group of formations correspond to 17% of all observations in the sample of Laurie and Mark’s study. The commonality of these four variants in the first group of formations is that there are five defenders, but with variations in the number of midfielders and forwards. This group of formations was most predominant in defensive situations, with between 73%-88% of their observations being of teams without possession.

Sports Performance Analysis - Team Formations
Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Group 2 and Group 3 share the commonality of having 4 defenders, with group two in the second row consisting of more compact midfields, as oppose to a more expanded midfield in Group 3 formations.

Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Group 4 contained predominantly attacking formations consisting on three defenders, where the wingbacks push high up the pitch, and with variations in structure of the midfield and forward line.

Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Group 5 formations contained two defenders with fullbacks pushed up the field and with some variations in the forward line with either two or three forwards, as well as different structures on the midfield. These group of formations consistent entirely in offensive formations observations.

As illustrated by these groupings, the hierarchical clustering Laurie and Mark applied was very efficient in separating offensive and defensive formation observations, even after excluding the dimension of the area of the formation (i.e. how compact the formations are) as a discriminator. Additionally, while some of these formations aligned with traditional ways to describe formations, such as 4-4-2 or 4-1-4-1, others do not clearly fall within these historical classifications. Once the formation clusters were identified, the researchers developed a basic model selection algorithm to categorise any new formation observations into any of these groups by finding the maximum likelihood cluster.

Transitions Between Offensive And Defensive Formations

Laurie and Mark took their research a step forward by evaluating the pairing tendencies by coaches of the various defensive and offensive formations. In the diagram below, they illustrated that the teams that defend with Cluster 2 frequently transition into an offensive formation like the one in Cluster 16, with the wingbacks pushing up. Also, half of the teams with the defensive formation in Cluster 9 tend to use the offensive formation in Cluster 10, while the other half transition to a formation similar to Cluster 18. This demonstrated a clear story in to how a player transitions from their defensive role to their attacking role. Moreover, it showed that some defensive formations allow more variety in terms of the offensive formations than others.

Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Tactical Match Analysis Through This Methodology

The methodology developed by Laurie and Mark allows teams to measure and detect significant changes in formations throughout the match. They were able to produce diagrams such as the one below to illustrate the formation changes in both defensive (diamonds) and offensive (circles), including annotations of goals (top lines) and substitutions (bottom lines). The story of the match in the diagram shows a red team conceding a goal in the first half and then making a significant tactical change at half time as well as a substitution. Laurie and Mark found this situation very usual, as whenever there was a major tactical change it was often accompany with a substitution. Comparing with other matches, they found that this particular red team made major tactical changes at half time in around a quarter of their matches, providing insights into how their manager reacts to given situations.

Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

In another diagram, they demonstrated how their methodology can also help study how changes in formation begin impact the outcome of a match. In this match, the blue team were predominantly attacking down the wings in the first half, with most of their high quality opportunities coming from right wing. In the second half, the red team changed their formation to five defenders instead of four, which reduced the attacks from the blue team’s right wing and instead going through the centre, presumably less busy since they now have two midfielders rather than three.

Source: Shaw, L. &amp; Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Source: Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit.

Finally, this methodology also allows teams to establish the link between chance creation and formation structure. They can also measure how different the position of opposing players is from their preferred defensive structure (i.e. how are are they out of position). At the same time, it allows for the measurement of the level of attacking threat by assessing the amount of high value territory the attacking team controls near the defending team’s goal. These pitch control models enable the measurement of threatening positions even when no shot took place. Laurie and Mark suggest that this kind of analysis allows teams to better understand how the attacking team maneuvers defenders out of their positions or how they take advantage defending team being out of position after a high press or a counterattack.

Citations:

  • Shaw, L. & Glickman, M. (2019) Dynamic analysis of team strategy in professional football. Barça Sports Analytics Summit. Link to paper

Automated Tracking Of Body Positioning Using Match Footage

A team of imaging processing experts from the Universitat Pompeu Fabra in Barcelona have recently developed a technique that identifies a player’s body orientation on the field within a time series simply by using video feeds of a match of football. Adrià Arbués-Sangüesa, Gloria Haro, Coloma Ballester and Adrián Martín (2019) leveraged computer vision and deep learning techniques to develop three vector probabilities that, when combined, estimated the orientation of a player’s upper-torso using his shoulder and hips positioning, field view and ball position.

This group of researchers argue that due to the evolution of football orientation has become increasingly important to adapt to the increasing pace of the game. Previously, players often benefited from sufficient time on the ball to control, look up and pass. Now, a player needs to orientate their body prior to controlling the ball in order to reduce the time it takes him to perform the next pass. Adrià and his team defined orientation as the direction in which the upper body is facing, derived by the area edging from the two shoulders and the two hips. Due to their dynamic and independent movement, legs, arms and face were excluded from this definition.  

Sports Performance Analysis - OpenPose

To produce this orientation estimate, they first calculated different estimates of orientation based on three different factors: pose orientation (using OpenPose and super-resolution for image enhancing), field orientation (the field view of a player relative to their position on the field) and ball position (effect of ball position on orientation of a player). These three estimates were combined together by applying different weightings and produce the final overall body orientation of a player.

1. Body Orientation Calculated From Pose

The researchers used the open source library of OpenPose. This library allows you to input a frame and retrieve a human skeleton drawn over an image of a person within that frame. It can detect up to 25 body parts per person, such as elbows, shoulders and knees, and specify the level of confidence in identifying such parts. It can also provide additional data points such as heat maps and directions.

However, unlike in a closeup video of a person, in sports events like a match of football players can appear in very small portions of the frame, even in full HD frames like broadcasting frames. Adrià and team solved this issue by upscaling the image through super-resolution, an algorithmic method to image resolution by extracting details from similar images in a sequence to reconstruct other frames. In their case, the researcher team applied a Residual Dense Network model to improve the image quality of faraway players. This deep learning image enhancement technique helped researchers preserve some image quality and detect the player’s faces through OpenPose thanks to the clearer images. They were then able to detect additional points of the player’s body and accurately define the upper-torso position using the points of the shoulders and hips.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. &amp; Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. & Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

Once the issue with image quality was solved by researchers and the player’s pose data was then extracted through OpenPose, the orientation in which a player was facing was derived by using the angle of the vector extracted from the centre point of the upper-torse (shoulders and hips area). OpenPose provided the coordinates of both shoulders and both hips, indicating the position of these specific points in a player’s body relative to each other. From these 2D vectors, researchers could determine whether a player was facing right or left using the x and y axis of the shoulder and hips coordinates. For example, if the angle of the shoulders shown in OpenPose is 283 degrees with a confidence of 0.64, while the angle of the hips is 295 degrees with a confidence level of 0.34, researchers will use the shoulders’ angle to estimate the orientation of the player due to its higher confidence level. In cases where a player is standing parallel to the camera and the angles of either the hips or the shoulders are impossible to establish as they are all within the same coordinate in the frame, then researchers used the facial features (nose, eyes and ears) as a reference to a player’s orientation, using the neck as the x axis.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. &amp; Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. & Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

This player and ball 2D information was then projected into the football pitch footage showing players from the top to see their direction. Using the four corners of the pitch, researchers could reconstruct a 2D pitch positioning that allowed them to match pixels from the footage of the match to the coordinates derived from OpenPose. Therefore, they were now able to clearly observe whether a player in the footage was going left or right as derived by their model’s pose results.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. &amp; Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. & Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

In order to achieve the right level of accuracy in exchange for precision, researchers clustered similar angles to create a total of 24 different orientation groups (i.e. 0-15 degree, 15-30 degrees and so on), as there was not much difference in having a player face an angle of 0 degrees or 5 degrees.

 2. Body Orientation Calculated From Field View Of A Player

Researchers then quantified field orientation of a player by setting the player’s field of view during a match to around 225 degrees. This value was only used as a backup value in case of everything else fails, since it was a least effective method to derive orientation as the one previously described. The player’s field of view was transformed into probability vectors with values similar to the ones with pose orientation that are based on y coordinates. For example, a right back on the side of the pitch will have its field of view reduced to about 90 degrees, as he is very unlikely to be looking outside of the pitch.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. &amp; Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. & Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

3. Orientation Calculated From Ball Positioning

The third estimation of player orientation was related to the position of the ball on the pitch. This assumed that players are affected by their relative position in relation to the ball, where players closer to the ball are more strongly oriented towards it while the orientation of players further away from it may be less impacted by the ball position. This step of player orientation based on ball position accounts for the relative effect of ball position. Each player is not only allocated a particular angle in relation to the ball but also a specific distance to it, which is converted into probability vectors.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. &amp; Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. & Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

Combination Of All The Three Estimates Into A Single Vector

Adrià and the research team contextualized these results by combining all three estimates into as single vector by applying different weights to each metric. For instance, they found that field of view corresponded to a very small proportion of the orientation probability than the other two metrics. The sum of all the weighted multiplications and vectors from the three estimates will correspond to the final player orientation, the final angle of the player. By following the same process for each player and drawing their orientation onto the image of the field, player movements can be tracked during the duration of the match while the remain on frame.

In terms of the accuracy of the method, this method managed to detect at least 89% of all required body parts for players through OpenPose, with the left and right orientation rate achieving a 92% accuracy rate when compared with sensor data. The initial weighting of the overall orientation became 0.5 for pose, 0.15 for field of view and <0.5 for ball position, suggesting the pose data is the highest predictor of body orientation. Also, field of view was the least accurate one with an average error of 59 degrees and could be excluded altogether. Ball orientation performs well in estimating orientation but pose orientation is a stronger predictor in relation to the degree of error. However, the combination of all three outperforms the individual estimates.

Some limitations the researchers found in their approach is the varying camera angles and video quality available by club or even within teams of the same club. For example, matches from youth teams had poor quality footage and camera angles making it impossible for OpenPose to detect players at certain times, even when on screen.  

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. &amp; Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

Source: Arbues-Sangüesa, A.; Haro, G.; Ballester C. & Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit.

Finally, Adrià et al. suggest that video analysts could greatly benefir from this automated orientation detection capability when analyzing match footage by having directional arrows printed on the frame that facilitate the identification of cases where orientation can be critical to develop a player or a particular play. The highly visual aspect of the solution makes is very easily understood by players when presenting them with information about their body positioning during match play, for both first team and the development of youth players. This metric could also be incorporated into the calculation of the conditional probability of scoring a goal in various game situations, such as its inclusion during modeling of Expected Goals. Ultimately, these innovative advances in automatic data collection can relief many Performance Analyst from hours of manual coding of footage when tracking match events.

Citations:

Arbues-Sangüesa, A.; Haro, G.; Ballester C. & Martin A. (2019) Head, Shoulders, Hip and Ball... Hip and Ball! Using Pose Data to Leverage Football Player Orientation. Barça Sports Analytics Summit. Link to article.

StatsBomb: Advanced Football Analytics Through An Interactive Platform

STATSBOMB is a UK-based football analytics and data visualisation company introducing common data analytics practices seen in business and tech to the world of football analytics. Through their recently launched (February 2019) STATSBOMB IQ data visualisation platform they offer immediate accessibility to valuable football insights from all major leagues and players across the globe.

The company was founded in January 2017, after self-described data geek Ted Knutson - now CEO and co-founder of STATSBOMB - traded a decade in the sports betting industry to partner with Charlotte Randall - Chief Operating Officer - and “produce the best possible analytic toolset for football clubs to use in player recruitment, team analysis, and opposition scouting”. What started as a blog sharing ideas about applied statistics in football turned into a reputable business collecting vast amounts of football data and offering an interactive visualisation platform enabling them to establish a global customer based including major clubs, federations, media, broadcasters and gambling organisations. In their ambition to establish themselves as an industry leader, STATSBOMB has recently acquired Egypt-based sports data collection company ArqamFC, gathering over 5,000 data points per match. Ted Knutson claimed that this move will allow them to offer double the amount of data points than any other provider.

STATSBOMB’s new data visualisation platform STATSBOMB IQ is the latest pioneering move by the company. Their dashboards, charts and graphs follow a similar aesthetic, clarity and data blending to those displayed by Tableau, possibly the largest data visualisation package in tech. While most, if not all, charts come already built out-of-the-box, their interactivity and filtering tools allow for sufficient customization to answer a wide range of analytical questions.

Salah’s 2018/19 STATSBOMB Profile

Salah’s 2018/19 STATSBOMB Profile

Messi’s 2018/19 STATSBOMB Profile

Messi’s 2018/19 STATSBOMB Profile

The platform has an outstanding processing performance when switching between the various sections and quickly display vast amounts of data on the screen. From player radars to shot maps, shot distributions, defensive activity, xG trendlines, corner maps and even player comparison showing similarities or complementary skill sets, STATSBOMB IQ is a reliable and robust tool offering immediate access to a complete picture of the latest football data within the click of a button.

Barcelona - Liverpool - 2019-05-01.png
Liverpool - Barcelona - 2019-05-07.png

The company also offers consultancy services to ease users into their data tools and provide them with the right assets to navigate their platform. This assistance when interpreting their large dataset - they collect more than twice the events per match than their competitors - is key in order to make their service digestible. However, the easy navigation through the clearly defined themes makes this task quick to grasp. Some of these themes include:

  • Pressure: analysing how players and teams press and how they perform under pressure

  • Shooting: including the location of attacking and defending players to provide both attacking and shot defending insights.

  • Goalkeeping: detailed actions down to goalkeeper positioning and movements that can be tied to the insights of gathered from the quality of the shot.

Cristiano Ronaldo Serie A 2018_2019.png
Lionel Messi La Liga 2018_2019.png

While the company does not intend to replace videoanalysis, it does emphase on the compatibility of their data visualisation features to reduce the time spent by analysts and coached reviewing player and team footage during performance evaluations. By spotting the right patterns and trends in the data, a more focused approach to videoanalysis can be adopted that will narrow down the areas to further investigate. One thing is certain, their stunning data visualisations bring a refreshing approach to football analytics providing invaluable insights and introducing tools to the field of applied sports analytics that are closer aligned to today’s available technologies.

StatsBomb IQ Platform.jpg

Performance Indicators in Football

Micheal Hughes et al discussed in 2012 in their article "Moneyball and soccer - an analysis of the key performance indicators of elite male soccer players by position", how team sports like football offer an ideal scope for analysis thanks to the numerous factors and combinations, from individual to teams, that can be used to identify performance influencers.

READ HUGHES M.D. ET AL'S FULL ARTICLE HERE

The article suggests that, in a sport like football, in order for a team to be successful, each player must effectively undertake a specific role and a set of functions based on the position the play in on the field. Through a study carried out with 12 experts and 51 sport science students, they aimed to identify which are the most common performance indicators that should be evaluated in a player's performance based on their playing profile. They started by defining the following playing positions in football:

  • Goalkeeper

  • Full Back

  • Centre Back

  • Holding Midfilder

  • Attacking Midfilder

  • Wide Midfielder

  • Strikers

Each performance indicator identified by position would be then categorized into the following 5 categories:

  • Physiological

  • Tactical

  • Technical - Defensive

  • Technical - Attacking

  • Psychological

Through group discussions between the experts and the level 3 sport scientist, they came up with the following traits required for each of the above positions.

Source: Moneyball and soccer by Michael Hughes et al (2012)

Source: Moneyball and soccer by Michael Hughes et al (2012)

The study identified that most performance indicators of outfield players were the same across position, with only the order of priority of each PI varying by position. Only goalkeepers had a different set of PIs than any other position. While these classifications of skills by position were done in a subjective method (ie. group discussion), it is a good first step towards the creation of techno-tactical profiles based on the players position and functions on the field, as pointed out by Dufour in 1993 in his book 'Computer-assisted scouting in soccer'. The above table provides a framework in which coaches and analyst can further evaluate the performance of players in relation to their position. However, tactics and coaching styles or preferences may cause the order of priority of each PI within each category to vary by team. The article also suggests that a qualitative way of measuring the level of each performance indicator should be used to evaluate a particular player.

The above suggests that positions may play a key role when assessing performance in footbal. From a quantitative perspective, when analysing the performance indicators to determine success or failure, or even to establish a benchmark to which to aim for, there are several metrics an analyst will look to gather through notational analysis:

Technical:

  • Shooting game

    • Total number of goals

    • Total number of shots

    • Total number of shots on target

    • Total shot to goal scoring rate (%)

    • Total shot on target to goal scoring rate (%)

    • Shots to goal ratio

    • Shots on target to goal ratio

    • Total number of shots by shooting position (ie. inside the box)

    • Total number of shots by shot type (ie. header, set piece, right foot, etc.)

    • xG (read more)

  • Passing game

    • Total number of passes

    • Total pass completion rate (%)

    • Total number of short passes (under X metres away)

    • Total short pass completion rate (%)

    • Total number of long passes (over X metres away)

    • Total long pass completion rate (%)

    • Total number of passes above the ground

    • Total chip/cross pass completion rate (%)

    • Total number of passes into a particular zone (ie. 6 yard box)

    • Total zone pass completion rate (%)

    • Pass to Goal ratio

    • Total number of unsuccessful passes leading to turnovers (ie. interceptions)

    • Total pass turnover rate (%)

  • Defensive game

    • Total number tackles

    • Total number of tackles won

    • Total tackle success rate (%)

    • Total number of tackles in the defensive third zone

    • Total number of tackles won in the defensive third zone

    • Total number of fouls conceded

    • Total number of fouls conceded leading to goals conceded (after X minutes of play without possession)

    • Total number of pass interceptions won

    • Total number of possession turnovers won

Tactical:

  • Attacking

    • Total number of set pieces

    • Total number of attacking corners

    • Total number of free-kicks (on the attacking third zone)

    • Total number of counterattacks (ie. based on X number of passes between possession start in own half to shot)

    • Average duration of attacking play (from possession start to shot)

    • Average number of passes per goal

  • Possession

    • Total percentage of match possession (%)

    • Total percentage of match possession in opposition's half

    • Total percentage of match possession in own half

    • Total number of possessions

    • Total number of non-shooting turnovers

    • Ratio of possessions to goals

    • Total number of passes per possession

    • Total number of long passes per possession

    • Total number of short passes per possession

  • Defensive

    • Total number of clearances

    • Total number of offsides by opponent team

    • Total number of corners conceded

    • Total number of shots conceded

    • Total number of opposition's passes in defensive third zone

    • Total number of opposition's possessions entering the defensive third zone

    • Average duration of opposition's possession

It is important to note that teams may adapt both their tactics and style of play based of the various circumstances they face in a game. For example, a team scoring a winning goal in the last 10 minutes may chose to give up possession in order to sit back in their defensive third during the remaining of the game. When using quantitative analysis to determine the success or failure again the performance indicator, it is important to take context into consideration for a more complete and accurate analysis.

Notational analysis: a synonym of today's performance analysis

While motion analysis and biomechanics constitute important areas in performance analysis, one of the most popular and fundamental pieces of performance analysis in sport is the use of notational analysis. Notational analysis is the identification and analysis of critical patterns and events in a performance that lead to a successful outcome. Hughes (2004) defined notational analysis as "a procedure that could be used in any discipline that requires assessment and analysis of performance". The information used for notational analysis is usually gathered by observing a team's performance in a competitive environment. By notating numerous events that take place on the pitch, such as striker positioning, defenders' tackle success rate or midfielders pass completion rate, an analyst can identify strengths and weaknesses and provide these results to coaches who then use them to adapt training sessions or share accurate feedback with players and the entire team.

The importance of notational analysis comes from the limited recalling ability that coaches, as human beings, have when remembering specifics of the performance of their teams, and how these can be biased by their beliefs and other motives. As Hughes and Franks described, by receiving objective data of what happened during a game, a coach can make a more informed decision by enhancing his or her abilities to accurately assess the events of a game and improving the quality of feedback he is able to provide to the players. A big miss by a striker might be recalled by coaches and other players more vividly than the same's striker effective positioning or successful dribbling in the same game. At a professional level, we often hear pundits and fans rate a player's performance in a game based on a small number of noticeable actions that took place, such as a missed penalty or a defender's mistake that led to a one-on-one chance by the opposition team. However, through notational analysis, a more complete view of that player's performance may provide a more accurate perspective on the players contribution in the game and inform any future decisions towards that player, such as training structure or upcoming match presence.

Different teams in different sports will define their own frameworks of performance indicators that allows them to identify the areas in the game they are most interested in evaluating. This means that there is a wide range of information that is captured today in notational analysis depending on the environment the analyst is working in. This is in part due to the lack of a common set of performance indicators being identified as the key to sporting success, particularly in team sports where it is practically impossible to account for every single events that could lead to winning a match. A football team may consider percentage of shots on target, possession percentage and pass completion rate to be their performance indicators to benchmark themselves against for a game, while a different team in the same sport may want to consider possession percentage on the opposition's last third, defensive tackles won and total number of shots. As Hughes stated in 2011, while all these may be considered valid information to collect, the lack of a common framework across sport may be slowing down the research and analysis to develop notational analysis further.

There are certain challenges in notational analysis, particularly when it comes to live events. A single analyst notating events and patterns in real time may be subjected to human error or miss certain actions. This is why most sport statistics companies and elite sporting organisations employ several analysts to collect the same performance indicators on a live game, allowing to compare notated statistics between analysts with the purpose of improving the accuracy of the data collected. Another challenge of the notational analysis process is subjectivity, were events notated that have a certain degree of ambiguity may be captured differently by different analysts. While notational analysis aims to add objectivity when evaluating a team's performance by quantifying the events, it is possible that the definition of such events may change depending on the interpretation the analyst capturing the event has on that action.

During the last two decades, a large number of new technologies have developed the methods and effectiveness of notational analysis in sport. While traditional analyst often used a pen and a notepad to notate all the various events they considered relevant, technologies like Opta, Dartfish or Sportscode have become a central asset for notational analysts in the industry. The use of a video camera and a video analysis software can now provide analyst with a wide range of features and tools to collect as much information as they require to assess performance against specific performance indicators.

 

Performance Indicators in Rugby Union

In 2012, Michael T Hughes, Michal M Hughes, Jason Williams, Nic James, Goran Vuckovic and Duncan Locke wrote an insightful academic journal discussing the performance indicators in rugby union during the 2011 World Cup. They gathered various materials from professional analysts working for coaches and player at the World Cup event, and verified the reliability and accuracy of their data against video footage from different matches.

READ FULL JOURNAL HERE

rugby-655027_640.jpg

This research study analyses the influence of the following key performance indicators in the final outcome of a game:

Scoring Indicators: 

  • Points scored

    • Total points scored in WWC 2011

    • Points scored per game

    • Points scored agains Tier A teams

    • Points scored per game against Tier A teams

  • Tries scored

    • Total tries scored in WWC 2011

    • Tries scored per game

    • Tries scored from set pieces

    • Percentage of tries scored from set piece

    • Tries scored from set pieces per game

    • Tries scored from broken play

    • Percentage of tries scored from broken play

    • Tries scored from broken play per game

Quality Indicators:

  • Total Possession - Times and Productivity

    • Minutes that ball is in play in the match

    • Rest minutes in the match

    • Minutes with possession in the match

    • Percentage of time with possession

    • Number of possessions in the match

    • Minutes per possession

    • Minutes of possession per point scored

    • Number of possessions per point scored

    • Minutes of possession per try scored

    • Number of possessions per try scored

    • Total number of line breaks

    • Total number of line breaks per game

    • Minutes of possession per line break

    • Number of possessions per line break

    • Total number of set piece line breaks

    • Total number of set piece line breaks per game

    • Percentage of set piece line breaks

    • Total number of broken play line breaks

    • Total number of broken play line breaks per game

    • Percentage of broken play line breaks

    • Number of phases in the match

    • Percentage of phases per possession

    • Attacking penalties won

  • Attacking Possession

    • Number of possessions in opposition's 22 line

    • Number of converted possessions in opposition's 22 line

    • Percentage of converted possessions in opposition's 22 line

    • Number of points from opposition's 22 line

    • Number of points from opposition's 22 line per game

    • Number of points per possession in opposition's 22 line

  • Kicking game

    • Total number of kicks at goal

    • Total number of kicks converted

    • Percentage of kicks converted

    • Penalties conceded

While these key performance indicators of a rugby union game or tournament can be useful to summarize the some elements of a team's performance, what M. Hughes et al (2012) found was the there was little correlation between each individual metric, or set of metrics, with the final outcome of the World Cup 2011 tournament. For example, France was identified as one of the worst teams in most of these metrics, though they were the runners-up of the tournament.

quino-al-406222-unsplash.jpg

The paper also touches on the challenges individual player performance analysis in rugby union. Due to the nature of the sport, a specific position on the field will require its own set of performance indicators. The study suggests to analyse an individuals performance against common key performance indicators and use that individual's performance profile to run intra-position comparisons (Hughes et al, 2012). This also leads to the creation of position profiles, where strengths and weaknesses of players playing in each position can be identified. It is also suggested that the individual player profiles should be based in the context of the team's profile as well as the opposition team's strengths and weaknesses, as these elements will impact a player's performance profiling.

Similarly to most team sports, randomness and luck can play a big part in the final outcome of a rugby union match. Therefore, predicting the performance of a team based on a few data points might not be enough to correlate it to the final performance achieved by that team. There are many complex interactions that occur during a rugby union game between teammates and oppositions which are difficult to account for through today's available statistics. However, studies like the one carried out by Hughes et al (2012) are another step towards narrowing down the best procedures to follow to successfully apply analytics to rugby performance predictions and team sports in general.

The effect of GDPR in sports performance analysis

On 25th May 2018, a new Global Data Protection Regulation launched in the EU, significantly improving the control European citizens have on their personal data collected by third parties. While GDPR covers many complex areas around the subject of data collection, storage and transfer of personal data by third parties, the key topics that are normally highlighted when discussing this new regulation are that an individual must now provide consent prior to a third party collecting data about themselves and that said individual has also the right to request the data collected to be deleted at any point in time, as well as to revoke any prior consent given to collect personal data.

How does the new regulation affect sport organisations?

Like any other company in any industry, sport clubs and organisations also require to reassess the data they collect from their fans, volunteers, employees and any other member of the club. No organisation that collects and stores personal data of an EU citizen, even in sports, is exempt of the €20 million or 4% of yearly turnover fines if they are found noncompliant.

One of the biggest changes a club now needs to manage is around fan collected data, often used to increase fan engagement and delivering marketing campaign to grow the club's fan base. Like many marketing departments in numerous organisations collect a wide variety of information about their customers, such as interests, personally identifiable data (PII) purchase history and any actions individuals take on websites and physical events they attend, such as a football match. Clubs need to reevaluate the level of consent they receive to continue to store and collect all this data points about their fans and prospective supporters. Similarly, GDPR applies to the employer-employee relationship and data sharing. This means the clubs will also require consent from players, coaches and members of staff.

Aside from evaluating their data management and applying new procedures, clubs will also require to be able to demonstrate compliance by updating and making public their data privacy policies and new processes they put in place for GDPR. This includes clearly informing how individuals can request their data store, update it or remove it altogether, as well as the steps to follow to revoke consent if they wish to do so.

And how does it affect Sport Performance Analysis?

Player profiling is one of the various key tasks of a performance analyst. It can involve either evaluating your own player's performance or assessing the players from the rival club the team will be facing on their next fixture. An analyst would gather data on the player's recent performances, strengths, weaknesses and playing styles to compile detailed reports to present to the coaching team.

In Article 22, GDPR tackles profiling directly as it refers to it as building up a picture of the type of person someone is by evaluating certain personal aspects relating to a natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements. However, the new legislation also specifies that consent is only necessary if automatic decision-making is applied based on this profiling, and also only if such automatic decision-making creates any legal effects or significantly affects the individual in question. This means that the simple task of profiling should not require the consent of the individual unless sensitive personal data, such as health, race or other sensitive data, is collected during profiling.

This suggests that player profiling in sports can be interpreted as not requiring the player's consent. Firstly, decision-making based on profiling in this scenario is not an automatic one. This means that even though player profiles are collected to make decision on tactics, training session preparation or recruitment, there is always a element of human review of such profiles, usually by the coaching team, which could rule out the classification of this processes as being for "automated decision-making" as required in order to apply to GDPR guidelines. Secondly, the profiling carried out by analysts should not have any legal effects or significant affect the individual being profiles. The human intervention in reviewing these profiles also backs up this argument, as no "automatic" effects are generated by this activity.

There is, however, a counter-argument worth considering, and that is around the sensitive nature of the data used in profiling. Player profiling can include sensitive information about the player in question, particularly around his or hers health. Injuries are bound to appear in a majority of player profiles generated by analysts, particularly if the goal is to optimize injury prevention. In such cases, consent is required to be provided by the player as the profiling now contains sensitive data of that natural person. It is also worth considering the application of GDPR in the scouting of youth talent, were profiling is carried out by gathering data on minors where parental consent should be obtained. Data collected from minors cannot fall be considered as having a legitimate reason for gathering such information without prior consent.

Navigating the complex world of GDPR is undoubtably challenging for many teams and analysts. However, it is important to know the scenarios when consent is required to produce a piece of analysis involving player data and when, as Articule 6(f) states, there is a "legitimate reason" to collect data without consent. Nevertheless, while consent might not always be required, it is always important to evaluate the scope, transparency and long-term purpose of the profiling process before assuming no consent is required. This can include areas such as the player's right to decline their data from being collected and request the deletion of any previously collected data. One way or another, a performance analysis team now needs to consider the implementation of new processes around data management in their day to day roles.

What are Expected Goals (xG)?

What are Expected Goals (xG)?

Expected Goals, or xG, are the number of goals a player or team should have scored when considering the number and type of chances they had in a match. It is a way of using statistics to provide an objective view to common commentaries such as: ”He shouldn't miss that!” "He's got to score those chances!" "He should have had a hat-trick!”

Goals in football are rare events, with just over 2.5 goals scored on average per game. Therefore, the historical number of goals does not provide a large enough sample to predict the outcome of a match. This means that shots on target and total number of shots are now being used as the next closest stats to predict number of goals. However, not all shots have the same likelihood of ending up in the back of the net.

This is where xG comes into play. Expected Goals uses various characteristics of the shots being taken together with historical data of such types of shots to predict the likelihood of a specific shot being scored. Since xG is simply an averaged probability of a shot being scored, a team or player may outperform or underperform their xG value. This means that they could be scoring chances that the average player would miss or that they could be missing chances that are often scored.

xG is often used to analyse various scenarios:

  • To predict the score of an upcoming match using historical data of the teams involved.

  • Assess a team’s or player’s “true” performance on a match or season, regardless of their short-term form or one-off actions on a pitch. It provides a data point on the number and quality of chances being created regardless of the final result.

  • Identify performing players in underperforming teams, or those who receive less playing minutes, by assessing which ones are more effective than the quality of their chances they receive would suggest.

  • Understand the defensive performance of a team by assessing how effectively are they preventing the opponent team from scoring their chances.

Origin of the ExpectedGoals Model

In April 2012, Advanced Data Analyst Sam Green from sport statistics company Opta first explained his innovative approach to assessing the performance of Premier League goalscorers, inspired by similar models being used in American sports. However, it was not until the beginning of the 2017/18 season when BBC’s Match of The Day debut their use of xG by their popular football pundits to make xG a focal topic of conversation by many football fans. 

Over the years, Opta has collected numerous data points of in-game actions in all of the top football leagues. When creating the xG model, Sam Green and the Opta team analysed more than 300,000 shots and a number of different variables using Opta’s on-ball event data, such as angle of the shot, assist type, shot location, the in-game situation, the proximity of opposition defenders and distance from goal. They were then able to assign an xG value, usually as a percentage, to every goal attempt and determine how good a particular type of chance is. As new matches are played new data is collected to continuously refine the xG model.

There is no one specific model to calculate xG. When looking at xG it is important to consider that the xG value would depend on the factors that the analyst creating the xG model wants to incorporate in the calculations. Since its release to the public, the xG theory raised considerable attention in the analytics community, with many enthusiasts working and adjusting the model in their own ways in an attempt to perfect it. This means there are now several different xG models out there, each of them considering different factors. Some would consider whether it was a goal scored with their feet or with their head, other consider the situation that led to the shot and so on, but the final prediction each model outputs have shown to only vary slightly across different models.

How is xG calculated?

Opta’s xG model is based on the fact that the most basic requirement to score goals is to take shots. However, not all strikers score goals from the same number of shots. As Sam Green identified, in the 2011/12 season Van Persie only needed 5.4 shots to score a goal, while Luis Suarez took 13.8 shots for each goal he scored. However, they both shot the same number of times per game they played.

Cross

This is why Opta decided to look deeper into the quality of chances each striker received by adding the average location from which each shots was taken. However, they soon realized that location on its own was not enough. A penalty spot chance could come from a penalty kick, a header from a corner or a 1 on 1 against the goalkeeper, each with a very different likelihood of ending up in a goal. That is why Opta decided to incorporate additional data points to the model. Unfortunately, the exact model with all the factors considered by Opta has not been made public but a number of analyst have attempted to replicate or improve the model since its first release.

The xG model was designed to return an xG value for each player, team or chance depending on the dimension that the data is being analysed in: a full season, a particular match, a specific half in a game or group of goal attempts. Let’s say a player like Harry Kane takes 100 shots from chances that, based on historical Premier League data, have a probability of being scored of 0.202 (or 20.2%). Kane's xG value would be 20 expected goals scored (100 shots x 0.202). This xG number would contain an average of some ‘big scoring chances’ Kane took, such as penalties with 0.783xG, other non-penalty shots inside the box with varying xG values such as 0.387xG and maybe even shots outside the box with an 0.036xG value. The models attempts to balance the number of shots a player takes with the quality of these chances. For example, a player may get himself into very dangerous attacking positions inside the box in 23 occasions with high xG value and score the same number of goals than a player that continuously tries his luck from outside the box with 81 shots attempts that have a lower xG value.

Once an xG value has been calculated, a player or team’s performance can be evaluated on whether they are over or under-performing such value. In the above example, Harry Kane may actually score 25 goals during the full season, 5 goals above his 20 xG value, suggesting that his ability of converting chances is above-average and he can find the net in difficult scoring situations. Similarly, a player with a 20 xG value who has scored 15 goals suggests that he is missing chances that he probably should have scored.

Goalkeeper

Opta took xG a step further and assessed the impact the player had to a specific chance using their shot quality. They did so by factoring into the xG calculation the propensity to hit the target a shot taken by the player has and then comparing the former xG(Overall) value against this new xG(On Target) one. Their analysis showed that at the time Van der Vaart’s shooting saw his xG increase from 6.9xG to 10.3xG(On Target), suggesting that the type of shots he took were of higher quality than the average when xG was calculated before he took the shot. xG(OT) when compared to actual goals may also indicate how much a player was affected by the quality of goalkeeping he had to face. In the same season, Mikel Arteta scored 7 goals with just 3.5xG(OT) suggesting he got ‘luckier’ in front of goal as his shooting quality should have only given him just over 3 goals.

xG(OT) can be used to assess goalkeeping quality when used in reverse. Since it only takes into consideration shots on target, a keeper’s participation in these sort of chances is crucial to the final outcome of the play. De Gea conceding 22 goals with an 27xG(OT) suggests that he has blocked goals in situation were they are normally conceded.

Why are Expected Goals important in today's football?

Luck and randomness influences results in football more often than any other sports. We have all seem teams being dominated throughout a match and manage to score a last minute winning goal while having a lower number of chances than their opposition. But how sustainable is that? We have also seen world class strikers become out-of-form and spend a few games without seeing the back of the net. Is the player not taking advantage of the chances being provided by his teammates? xG allows us to assess the process over the results of a match, or performance of a player or team, by rating the quality of chances instead of the actual outcome.

football-1274661_1280.jpg

The most used example to explain xG’s efficiency is the Juventus season of 2015/16. Juventus only won 3 out of their first 10 games but the difference between their actual goals and xG was considerably high. This meant that the had the chances but were not converting them, suggesting that their negative run of results might not last if they just get a bit luckier in front of goal. Sacking manager Massimo Allegri could have been a mistake, since after match day 12 their luck changed and ended up winning the league title with 9 games spare.

xG gives us a more accurate way of predicting match outcomes than by simply using individual stats. In the Premier League, only 71.6% of teams that had the most shots won the fixture, while close to 81% of teams that obtain a higher xG score win games. It eliminates historical assumptions that popular tradition in football has created and provides a statistically relevant point of argument to whether the performance of a player or team is above or below the average given a number of historical data points.

garry-mendes-rodrigues-2846045_1280.jpg

When using expected goals to see which players are hitting the target more or less than the numbers suggest they should, teams can scout promising prolific goalscorers if they consistently score more goals than the quality of chances they get. On the other hand if a player surpasses his expected goals for a few games but has no history of doing so in the past, it might come down to his form and luck rather than goalscoring talent, and he might struggle to sustain that over a long period of time.

Limitations of the Expected Goals model

The xG model is only as good as the factors being input into its calculations. These data inputs are limited by the data we possess today from companies such as Opta. Other factors, such as shot power, curl or dip on the shot or whether the goalkeeper is unsighted or off balance might not be considered in most xG models out there. Due to model being based on averages, the random nature of a football match and the rarity of goals in the sport makes it almost impossible to consider with enough statistical significance all historical factors that can cause a goal to be scored. xG should be used as indicative and supportive information for decision making purposes and generating opinions rather than a finite answer to the performance of a team or player.

As the model’s creator Sam Green puts it: “a system like this will also fail to predict a high scoring game. Since it is based on averages and with around half of matches featuring fewer than 2.5 goals, this is to be expected”. We also need to consider that a shot taken by a Manchester United striker should have a higher xG than one taken by a Stoke City player, suggesting that on average Man Utd would outperform their xG on a chance by chance basis while Stoke City would underperform it if the xG is calculated using averages from all English teams' shot history.

Criticism and the Future of xG models

The recent misuse of Expected Goals as a analysis metric during pundit commentary has encouraged numerous criticism. A team may score one or two difficult chances early in a game and sit back for the remaining of the 90 minutes, allowing their opponents to take many shots from different positions, thus increasing the opponents xG. One could then claim that the losing team achieved a higher xG therefore deserves the win. This is why xG should always be taken with additional context of the game before creating a verdict. Statistics can just tell us what happened in a game but a wider view is necessary to show you how it happened and give you a clearer idea on what’s yet to come. Certain in-game actions by players cannot be measured with a statistical model today, such as the ability of a defender in getting in front of a shot attempt despite never touching the ball.

There is also a strong resistance from the football community to the use of data. Football is a traditional and emotional sport by nature, with experience and accepted wisdom dominating people’s opinions. Most fans see the use of statistics as intrusive and challenging their popular and historic knowledge of “the beautiful game”. After experiencing their team lose, most of them are not interested in listening to television pundits discuss how their team performed against their expected goals. Despite analytics having plenty to offer to football performance analysis, there are still doubters. xG’s debut in Match of the Day shaked social media with instant mentions of “stat nerds” and claims that the numbers in football are “pointless” and “bollocks”. However, it has been made clear by Opta that xG is not intended to ever replace scouts and pundits but simply aid them in their analysis of a game.

Despite all this resistance and criticism by some pundits and football fans to accept this new era of football analysis, Opta and various sport analysts continue to evolve the use of statistics to analyse performance in numerous areas in football. Models such as xG are the first round of statistical systems and will soon be followed by upcoming ones such as Defensive Coverage, which will assess tackles, blocks, interceptions, man-marking and clearances. Football’s data revolution has started and will continue to see developments every season.