Analyzing Football Players using Central Tendency and Dispersion
In this project we apply measures of Central Tendency & Dispersion to suggest two players for the Striker position!
Table of contents
- 00. Project Overview
- 01. Concept Overview
- 02. Data Overview & Preparation
- 03. Applying Measures of Central Tendency & Dispersion
- 04. Analysing The Results
Project Overview
Context
Manchester United football club wants to know which player they should sign for the Striker position from the list provided. We need to perform a comparative Analysis between players and suggest two players whom they should sign.
Additional Note:
- One of the players should be less than 25 years of age.
- One of the players should have preferably played in the English Premier League.
Actions
We started inspecting the data to see any null values are present with help of .info() method. Based on the additional note, we created a column “Age Group” assigning a condition to players based on their age. We then found the value counts with respect to the “Age Group”. We found there are 6 players who are below 25 years of Age. We also found value counts of players who played in English premier league. There are 3 players who played in EPL.
We then calculated the total contribution of each player based on total number of goals and total number of assists to get an understanding of player’s contribution. While this can be a useful thing with respect to a player, it will not help in concluding that this is the best player. Further, we went in detail to understand their performances.
With respect to the measures of central tendency(Mean, Median, Mode), we calculated the mean(average) of columns such as no.of shots, PS%, AerialsWon and Rating for each player but it turns out there are no clear differences between the players. Then we calculated the Mean, Median and Mode of “Ratings” column for each player but again most of the ratings are similar.
We finally used Measures of Central tendency in conjugation with Measures of Dispersion i.e using Variance and Standard Deviation to find the ratings of the players.
Results
We found two players Dusan Vlahovic & Harry Kane who are selected for the striker position.
Concept Overview
Measures of Central Tendency
- A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.
- As such, measures of central tendency are sometimes called measures of central location.
- Example : Mean, Median & Mode.
Mean
Mean is a single number taken as representative of a list of numbers. It can be represented with symbol mu (µ)
Median
Median is the middle number of series when ordered. Finding median in three steps:
- Arrange the numbers from largest to smallest.
- If you have an odd number of values, the median is the middle one.
- if you have an even number of values, the median is calculated by adding the two middle numbers together and divided by 2
Mode
The observation with highest frequency.
Measures of Dispersion
- Measures of Central Tendency yield information about center or middle part of a dataset.
- The measure of dispersion describes the spread of data.
- The measure of dispersion in conjuction with measures of central tendency makes possible a more complex numerical description of data
- For example : Range, Standard Deviation etc.
Range
Range is the difference between largest and smallest value.
Quartiles
The quartiles of a data set divide the data into four equal parts, with one-fourth of the data values in each part. The second quartile position is the median of the data set, which divides the data set in half.
To find the quartiles, you need to follow these steps:
- Sort the dataset in ascending order.
- Calculate the position of each quartile using the following formulas: a. Q1: (n + 1) / 4 b. Q2 (Median): (n + 1) / 2 c. Q3: 3 * (n + 1) / 4 where n is the total number of data points in the dataset.
- If the position calculated in Step 2 is a whole number, the corresponding quartile is the value at that position in the sorted dataset. If the position is a decimal, take the average of the two values at the positions on either side of the decimal to find the quartile.
Inter Quartile Range
IQR is especially useful in a situation where data users are more interested in values toward the middle and less interested in extremes. IQR is also used to identify outliers.
- To get around this, we divided the data into quarters, and we used the interquartile range to provide us with a cut-down range of the data.
- The range of the values in these two quartiles is called the interquartile range (IQR = Q3-Q1).
Outliers
Outliers are values which are extreme above or below a set of values. Outliers are occured in three Ways:
- Measurement Error
–> Human errors caused during data collection or recording
–> For example, while filling up a survey form, the respondent entered his age as 500 years instead of 50 years. The additional zero made him an outlier.
- Sampling Error
–> The error occurs when the sample does not represent the entire population.
–> We have to conduct a survey regarding the durability of iphones, our sample will consist of respondents who use iphones and not Android phones.
- Natural Error
–> When an outlier is not caused due to an error but occurs naturally.
–> In a Class of 50 Students, 2 Students passed the test and the rest 48 failed. Now the students who performed excellently and passed are outliers, but they are not caused due to an error.
Skewness
Skewness is a measure of the asymmetry or lack of symmetry in the data distribution. It tells us how the data is stretched out on one side compared to the other. Skewness can be positive, negative, or zero.
- Positive Skewness: In a positively skewed distribution, the tail on the righthand side is longer or stretched out, and most of the data is concentrated on the left. This is also known as a right-skewed distribution.
- Negative Skewness: In a negatively skewed distribution, the tail on the lefthand side is longer or stretched out, and most of the data is concentrated on the right. This is also known as a left-skewed distribution.
- Zero Skewness: If the data is symmetric and has a balanced distribution around the mean, it has zero skewness.
Percentiles
- Percentiles are measures of central tendency that divide a group of data into 100 parts.
- There are 99 percentiles because it takes 99 dividers to separate a group of data into 100 parts.
- Percentile is widely used in reporting test results.
Variance
- Variance is a way of measuring spread, and it’s the average of the distance of values from the mean squared.
- The problem with variance is that it can be quite difficult to think about the spread in terms of distances squared.
Standard Deviation
- Standard Deviation is the square root of Variance.
- The Standard Deviation is expressed in the same units as the mean is , whereas the variance is expressed in squared units.
Data Overview & Preparation
This dataset contains stats for 10 football players from Europe’s top leagues. I will use this data to solve the following problem statement.
In the code below, we:
- Load in the Python libraries we require for importing the data.
- Import the required data.
# install the required python libraries
import pandas as pd
import numpy as np
# import Players data
df = pd.read_excel(r'/Users/praju/Desktop/DADUO/projects/Players data.xlsx')
df.head()
A sample of this data (the first 5 rows) can be seen below:
Player Name | Age | Current Club | Opponent | competition | Date | Position | Mins | Goals | Assists | Yel | Red | Shots | PS% | AerialsWon | Rating |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Aleksandar Mitrovic | 28 | Fulham | Brentford (A)3 - 2 | Premier League | 2023-06-03 00:00:00 | FW | 90 | 0 | 0 | 1 | 0 | 1 | 54.8 | 2 | 6 |
Aleksandar Mitrovic | 28 | Fulham | Newcastle (A)1 - 0 | Premier League | 15-01-2023 | FW | 90 | 0 | 0 | 0 | 0 | 3 | 75.0 | 2 | 5 |
Aleksandar Mitrovic | 28 | Fulham | Tottenham (A)2 - 1 | Premier League | 2022-03-09 00:00:00 | FW | 90 | 1 | 0 | 1 | 0 | 5 | 50.0 | 2 | 7 |
Aleksandar Mitrovic | 28 | Fulham | Tottenham (H)0 - 1 | Premier League | 23-01-2023 | FW | 90 | 0 | 0 | 0 | 0 | 2 | 57.9 | 3 | 6 |
Aleksandar Mitrovic | 28 | Fulham | Southampton (H)2 - 1 | Premier League | 31-12-2022 | FW | 90 | 0 | 0 | 0 | 0 | 2 | 68.0 | 3 | 6 |
Data Dictionary:
- Player Name: Name of the player
- Age: The age of the player
- Current Club: Name of the club that the player currently plays for
- Opponent: Name of the team that the player played against
- Competition: Name of the competition.
- Date: Date of the match played
- Position: Playing position of the player
- Mins: Minutes played
- Goals: Total goals
- Assists: Total assists
- Yel: Yellow card
- Red: Red card
- Shots: Total shots
- PS%: Pass success percentage
- AerialsWon: Aerial duels won
- Rating: Rating per match
Applying Measures of Central Tendency & Dispersion
Datatypes and Null values
The very first thing we do is make sure the data types are correct and there are no null values present.
# Get more information about datatypes and null values in the dataframe
df.info()
Output:
Column | Non-Null | Count | Dtype |
---|---|---|---|
Player Name | 289 | non-null | object |
Age | 289 | non-null | int64 |
Current Club | 289 | non-null | object |
Opponent | 289 | non-null | object |
competition | 289 | non-null | object |
Date | 289 | non-null | object |
Position | 289 | non-null | object |
Mins | 289 | non-null | int64 |
Goals | 289 | non-null | int64 |
Assists | 289 | non-null | int64 |
Yel | 289 | non-null | int64 |
Red | 289 | non-null | int64 |
Shots | 289 | non-null | int64 |
PS% | 289 | non-null | float64 |
AerialsWon | 289 | non-null | int64 |
Rating | 289 | non-null | int64 |
We can see there are 289 non-null values in each column, indicating that there are no missing values.
Calculating Frequency
We will start with basic checks i.e count of values.
df['Age_Group']=np.where(df['Age']>=25,'>= 25 years','< 25 years')
df.groupby('Age_Group')['Player Name'].nunique()
Output:
Age_Group | Count |
---|---|
< 25 years | 6 |
>= 25 years | 4 |
We can see:
- There are 6 players who are below 25 years.
- There are 4 players who are above or equal 25 years.
Number of Players played in EPL
Checking the number of players played in English Premier League.
df.groupby('competition')['Player Name'].nunique()
Output:
competition | Count |
---|---|
Bundesliga | 2 |
Champions League | 5 |
Europa League | 1 |
Liga Portugal | 1 |
Ligue 1 | 1 |
Premier League | 3 |
Seria A | 3 |
We can see :
- 3 Players played in English Premier League.
Number of goals & assists
Checking the number of goals & assists scored by each player.
tc = df.groupby('Player Name')[['Goals','Assists']].agg('sum')
tc = pd.DataFrame(tc).reset_index()
tc
Output:
Player Name | Goals | Assists |
---|---|---|
Aleksandar Mitrovic | 11 | 1 |
Dusan Vlahovic | 11 | 4 |
Gonçalo Ramos | 20 | 3 |
Harry Kane | 24 | 4 |
Ivan Toney | 18 | 4 |
Jonathan David | 19 | 4 |
Marcus Thuram | 13 | 4 |
Randal Kolo Muani | 14 | 10 |
Rasmus Højlund | 7 | 2 |
Victor Osimhen | 25 | 4 |
Total Contribution
Calculating total contribution of each player.
tc['Total Contribution'] = tc['Goals']+tc['Assists']
tc.sort_values(by='Total Contribution',ascending=False)
Output:
Player Name | Goals | Assists | Total Contribution |
---|---|---|---|
Victor Osimhen | 25 | 4 | 29 |
Harry Kane | 24 | 4 | 28 |
Randal Kolo Muani | 14 | 10 | 24 |
Gonçalo Ramos | 20 | 3 | 23 |
Jonathan David | 19 | 4 | 23 |
Ivan Toney | 18 | 4 | 22 |
Marcus Thuram | 13 | 4 | 17 |
Dusan Vlahovic | 11 | 4 | 15 |
Aleksandar Mitrovic | 11 | 1 | 12 |
Rasmus Højlund | 7 | 2 | 9 |
We can find :
- Victor Osimhen & Harry Kane are top 2 in goals.
- Randal Kolo Muani is at top for assists.
- Six players with goal contribution more than 20.
So can we suggest the players to clients based on goals & Assists? No, We need to go in detail to understand their performance.
Apply Average
Check the average no.of shots, PS%, AerialsWon & Ratings per match by each player.
others = df.groupby('Player Name')[['Shots','PS%','AerialsWon','Rating']].agg('mean').round(0).astype('int')
others = pd.DataFrame(others).reset_index()
others
Output:
Player Name | Shots | PS% | AerialsWon | Rating |
---|---|---|---|---|
Aleksandar Mitrovic | 4 | 60 | 4 | 7 |
Dusan Vlahovic | 3 | 71 | 1 | 6 |
Gonçalo Ramos | 3 | 73 | 2 | 7 |
Harry Kane | 3 | 71 | 2 | 7 |
Ivan Toney | 3 | 62 | 4 | 7 |
Jonathan David | 3 | 83 | 0 | 7 |
Marcus Thuram | 3 | 73 | 1 | 7 |
Randal Kolo Muani | 2 | 66 | 2 | 7 |
Rasmus Højlund | 2 | 73 | 1 | 6 |
Victor Osimhen | 4 | 72 | 2 | 7 |
We can see:
- Average shots per match : Mitrovic & Osimhen
- Average PS% per match : David, Ramos, Thuram, Hojlund
- Average AerialsWon per match : Mitrovic,Toney
- Average Rating : Multiple Players , there are no clear difference in rating if you see , most of them got either 7 or 6 rating. Out of 10 players, 2 of them with 6 rating.
Average offers you only one-dimensional view of your data. They tell you what the center of your data is, but that’s it. While this can be useful, its often not enough.
Calculating Mean, Median & Mode of Rating
Checking mean, median & mode ratings by each player
mmm = df.groupby(['Player Name'])['Rating'].agg(['mean','median']).round(0).astype('int')
mmm = pd.DataFrame(mmm).reset_index()
mmm1 = df.groupby('Player Name')['Rating'].apply(pd.Series.mode).to_frame()
mmm1 = mmm1.reset_index()
del mmm1['level_1']
mmm1.drop_duplicates(subset=['Player Name'],keep='first',inplace=True)
mmm1=mmm1.reset_index()
mmm['mode']=mmm1['Rating']
mmm
Output:
Player Name | mean | median | mode |
---|---|---|---|
Aleksandar Mitrovic | 7 | 7 | 6 |
Dusan Vlahovic | 6 | 6 | 6 |
Gonçalo Ramos | 7 | 7 | 7 |
Harry Kane | 7 | 7 | 7 |
Ivan Toney | 7 | 6 | 6 |
Jonathan David | 7 | 6 | 6 |
Marcus Thuram | 7 | 7 | 6 |
Randal Kolo Muani | 7 | 7 | 6 |
Rasmus Højlund | 6 | 6 | 6 |
Victor Osimhen | 7 | 7 | 7 |
We can see:
- Mean, Median, Mode are same for most of the players. so it is difficult for us to identify players with a difference.
Note:
- Mean, Median, Mode can provide information about the “Center” of data.
- Mean is most commonly used, however median can sometimes be a better measure.
- Mean can be easily influenced by extreme values, so be careful.
- Look for large difference between mean and median. Could be a warning sign.
Quartiles & Percentiles
Using Quartiles & percentiles on data to see if there is a difference in players.
qp=df.groupby('Player Name')['Rating'].quantile([0.25,0.5,0.75,0.8,0.9])
qp=pd.DataFrame(qp).reset_index()
qp.head(20)
Output of first 20 rows:
Player Name | level_1 | Rating |
---|---|---|
Aleksandar Mitrovic | 0.25 | 6.0 |
Aleksandar Mitrovic | 0.50 | 7.0 |
Aleksandar Mitrovic | 0.75 | 7.0 |
Aleksandar Mitrovic | 0.80 | 7.0 |
Aleksandar Mitrovic | 0.90 | 8.0 |
Dusan Vlahovic | 0.25 | 6.0 |
Dusan Vlahovic | 0.50 | 6.0 |
Dusan Vlahovic | 0.75 | 7.0 |
Dusan Vlahovic | 0.80 | 7.0 |
Dusan Vlahovic | 0.90 | 7.2 |
Gonçalo Ramos | 0.25 | 6.0 |
Gonçalo Ramos | 0.50 | 7.0 |
Gonçalo Ramos | 0.75 | 7.0 |
Gonçalo Ramos | 0.80 | 7.8 |
Gonçalo Ramos | 0.90 | 8.0 |
Harry Kane | 0.25 | 6.0 |
Harry Kane | 0.50 | 7.0 |
Harry Kane | 0.75 | 7.0 |
Harry Kane | 0.80 | 7.0 |
Harry Kane | 0.90 | 8.0 |
We can see:
- Nothing much can be inferred.
- We need some other way of summarizing the data in addition to the measures of central tendency.
Applying measures of Dispersion
Range - uses smallest & largest number in a set. The rest of the values are ignored.
r1 = df.groupby('Player Name')['Rating'].agg(['max','min'])
r1 = pd.DataFrame(r1).reset_index()
r1['Range']=r1['max']-r1['min']
r1
Output:
Player Name | max | min | Range |
---|---|---|---|
Aleksandar Mitrovic | 9 | 5 | 4 |
Dusan Vlahovic | 8 | 5 | 3 |
Gonçalo Ramos | 10 | 5 | 5 |
Harry Kane | 9 | 6 | 3 |
Ivan Toney | 10 | 5 | 5 |
Jonathan David | 9 | 6 | 3 |
Marcus Thuram | 8 | 5 | 3 |
Randal Kolo Muani | 9 | 5 | 4 |
Rasmus Højlund | 9 | 5 | 4 |
Victor Osimhen | 9 | 5 | 4 |
We can see:
- Nothing can be inferred.
Note:
- Main problem with range is that it may include outliers
- If data has outliers, the range will include them, even though there may be only one or two extreme values, that could lead to a misleading result.
- The range can measure how far the values are spread out, but its difficult to get a real picture of how the data is distributed.
Calculating IQR
Checking IQR for each player
iqr = df.groupby('Player Name')['Rating'].describe().drop(columns = ['min','max','mean','count','std','50%']).round(0).astype('int')
iqr = pd.DataFrame(iqr)
iqr['IQR'] = iqr['75%']-iqr['25%']
iqr
Output:
Player Name | 25% | 75% | IQR |
---|---|---|---|
Aleksandar Mitrovic | 6 | 7 | 1 |
Dusan Vlahovic | 6 | 7 | 1 |
Gonçalo Ramos | 6 | 7 | 1 |
Harry Kane | 6 | 7 | 1 |
Ivan Toney | 6 | 7 | 1 |
Jonathan David | 6 | 7 | 1 |
Marcus Thuram | 6 | 7 | 1 |
Randal Kolo Muani | 6 | 7 | 1 |
Rasmus Højlund | 6 | 7 | 1 |
Victor Osimhen | 7 | 8 | 1 |
We can see:
- Nothing can be inferred.
Calculating Variance & Dispersion
In this step we need to find the ratings which vary the least
vsd = df.groupby('Player Name')['Rating'].agg(['var','std']).round(2)
vsd = pd.DataFrame(vsd)
vsd.sort_values(by='std')
Output:
Player Name | Var | Std |
---|---|---|
Dusan Vlahovic | 0.64 | 0.80 |
Harry Kane | 0.64 | 0.80 |
Marcus Thuram | 0.70 | 0.84 |
Rasmus Højlund | 0.79 | 0.89 |
Randal Kolo Muani | 0.92 | 0.96 |
Jonathan David | 0.98 | 0.99 |
Gonçalo Ramos | 1.02 | 1.01 |
Aleksandar Mitrovic | 1.13 | 1.06 |
Ivan Toney | 1.14 | 1.07 |
Victor Osimhen | 1.24 | 1.11 |
We can see:
- As the standard deviation is high, the values tend to be away from the mean
- As the standard deviation is low, the values tend to be closer to the mean
As per the insights obtained from the above output, we can say the players ‘Dusan Vlahovic’ & ‘Harry Kane’ are closer to the Mean i.e their performance is consistent which indicates the ratings of these two players vary the least.
Analysing The Results
Checking the players who played in different clubs
df1=df[['Player Name','Age','competition']]
df1.groupby(['Player Name','Age'])['competition'].unique()
Output:
Player Name | Age | Clubs |
---|---|---|
Aleksandar Mitrovic | 28 | [Premier League] |
Dusan Vlahovic | 23 | [Seria A, Champions League, Europa League] |
Gonçalo Ramos | 21 | [Liga Portugal, Champions League] |
Harry Kane | 29 | [Premier League, Champions League] |
Ivan Toney | 27 | [Premier League] |
Jonathan David | 23 | [Ligue 1] |
Marcus Thuram | 25 | [Bundesliga] |
Randal Kolo Muani | 24 | [Bundesliga, Champions League] |
Rasmus Højlund | 20 | [Seria A] |
Victor Osimhen | 24 | [Seria A, Champions League] |
We can see:
- Players who played in different clubs
As per the Client Requirements:
- One of the players should be less than 25 years of age - We can see ‘Dusan Vlahovic’ whose age is less than 25.
- One of the players should have preferably played in the English Premier League - We can see ‘Harry Kane’ who played in the English Premier League.
It is also important to note that we should also consider total number of matches played by each player, then only standard deviation will be more effective.
df1['Player Name'].value_counts()
Output:
Player Name | Count |
---|---|
Harry Kane | 38 |
Randal Kolo Muani | 33 |
Gonçalo Ramos | 32 |
Jonathan David | 29 |
Dusan Vlahovic | 29 |
Victor Osimhen | 28 |
Ivan Toney | 28 |
Marcus Thuram | 26 |
Rasmus Højlund | 25 |
Aleksandar Mitrovic | 21 |
We can see:
- Harry Kane - played 38 matches.
- Dusan Vlahovic - played 29 matches.