Analyzing Football Players using Central Tendency and Dispersion

15 minute read

In this project we apply measures of Central Tendency & Dispersion to suggest two players for the Striker position!

00. Project Overview
- Context
- Actions
- Results
01. Concept Overview
02. Data Overview & Preparation
03. Applying Measures of Central Tendency & Dispersion
04. Analysing The Results

Project Overview

Context

Manchester United football club wants to know which player they should sign for the Striker position from the list provided. We need to perform a comparative Analysis between players and suggest two players whom they should sign.

Additional Note:

One of the players should be less than 25 years of age.
One of the players should have preferably played in the English Premier League.

Actions

We started inspecting the data to see any null values are present with help of .info() method. Based on the additional note, we created a column “Age Group” assigning a condition to players based on their age. We then found the value counts with respect to the “Age Group”. We found there are 6 players who are below 25 years of Age. We also found value counts of players who played in English premier league. There are 3 players who played in EPL.

We then calculated the total contribution of each player based on total number of goals and total number of assists to get an understanding of player’s contribution. While this can be a useful thing with respect to a player, it will not help in concluding that this is the best player. Further, we went in detail to understand their performances.

With respect to the measures of central tendency(Mean, Median, Mode), we calculated the mean(average) of columns such as no.of shots, PS%, AerialsWon and Rating for each player but it turns out there are no clear differences between the players. Then we calculated the Mean, Median and Mode of “Ratings” column for each player but again most of the ratings are similar.

We finally used Measures of Central tendency in conjugation with Measures of Dispersion i.e using Variance and Standard Deviation to find the ratings of the players.

Results

We found two players Dusan Vlahovic & Harry Kane who are selected for the striker position.

Concept Overview

Measures of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.
As such, measures of central tendency are sometimes called measures of central location.
Example : Mean, Median & Mode.

Mean

Mean is a single number taken as representative of a list of numbers. It can be represented with symbol mu (µ)

Median

Median is the middle number of series when ordered. Finding median in three steps:

Arrange the numbers from largest to smallest.
If you have an odd number of values, the median is the middle one.
if you have an even number of values, the median is calculated by adding the two middle numbers together and divided by 2

Mode

The observation with highest frequency.

Measures of Dispersion

Measures of Central Tendency yield information about center or middle part of a dataset.
The measure of dispersion describes the spread of data.
The measure of dispersion in conjuction with measures of central tendency makes possible a more complex numerical description of data
For example : Range, Standard Deviation etc.

Range

Range is the difference between largest and smallest value.

Quartiles

The quartiles of a data set divide the data into four equal parts, with one-fourth of the data values in each part. The second quartile position is the median of the data set, which divides the data set in half.

To find the quartiles, you need to follow these steps:

Sort the dataset in ascending order.
Calculate the position of each quartile using the following formulas: a. Q1: (n + 1) / 4 b. Q2 (Median): (n + 1) / 2 c. Q3: 3 * (n + 1) / 4 where n is the total number of data points in the dataset.
If the position calculated in Step 2 is a whole number, the corresponding quartile is the value at that position in the sorted dataset. If the position is a decimal, take the average of the two values at the positions on either side of the decimal to find the quartile.

Inter Quartile Range

IQR is especially useful in a situation where data users are more interested in values toward the middle and less interested in extremes. IQR is also used to identify outliers.

To get around this, we divided the data into quarters, and we used the interquartile range to provide us with a cut-down range of the data.
The range of the values in these two quartiles is called the interquartile range (IQR = Q3-Q1).

Outliers

Outliers are values which are extreme above or below a set of values. Outliers are occured in three Ways:

Measurement Error

–> Human errors caused during data collection or recording

–> For example, while filling up a survey form, the respondent entered his age as 500 years instead of 50 years. The additional zero made him an outlier.

Sampling Error

–> The error occurs when the sample does not represent the entire population.

–> We have to conduct a survey regarding the durability of iphones, our sample will consist of respondents who use iphones and not Android phones.

Natural Error

–> When an outlier is not caused due to an error but occurs naturally.

–> In a Class of 50 Students, 2 Students passed the test and the rest 48 failed. Now the students who performed excellently and passed are outliers, but they are not caused due to an error.

Skewness

Skewness is a measure of the asymmetry or lack of symmetry in the data distribution. It tells us how the data is stretched out on one side compared to the other. Skewness can be positive, negative, or zero.

Positive Skewness: In a positively skewed distribution, the tail on the righthand side is longer or stretched out, and most of the data is concentrated on the left. This is also known as a right-skewed distribution.
Negative Skewness: In a negatively skewed distribution, the tail on the lefthand side is longer or stretched out, and most of the data is concentrated on the right. This is also known as a left-skewed distribution.
Zero Skewness: If the data is symmetric and has a balanced distribution around the mean, it has zero skewness.

Percentiles

Percentiles are measures of central tendency that divide a group of data into 100 parts.
There are 99 percentiles because it takes 99 dividers to separate a group of data into 100 parts.
Percentile is widely used in reporting test results.

Variance

Variance is a way of measuring spread, and it’s the average of the distance of values from the mean squared.
The problem with variance is that it can be quite difficult to think about the spread in terms of distances squared.

Standard Deviation

Standard Deviation is the square root of Variance.
The Standard Deviation is expressed in the same units as the mean is , whereas the variance is expressed in squared units.

Data Overview & Preparation

This dataset contains stats for 10 football players from Europe’s top leagues. I will use this data to solve the following problem statement.

In the code below, we:

Load in the Python libraries we require for importing the data.
Import the required data.

# install the required python libraries
import pandas as pd
import numpy as np

# import Players data
df = pd.read_excel(r'/Users/praju/Desktop/DADUO/projects/Players data.xlsx')

df.head()

A sample of this data (the first 5 rows) can be seen below:

Player Name	Age	Current Club	Opponent	competition	Date	Position	Mins	Goals	Yel	Shots	PS%	AerialsWon	Rating
Aleksandar Mitrovic	28	Fulham	Brentford (A)3 - 2	Premier League	2023-06-03 00:00:00	FW	90	0	1	1	54.8	2	6
Aleksandar Mitrovic	28	Fulham	Newcastle (A)1 - 0	Premier League	15-01-2023	FW	90	0	0	3	75.0	2	5
Aleksandar Mitrovic	28	Fulham	Tottenham (A)2 - 1	Premier League	2022-03-09 00:00:00	FW	90	1	1	5	50.0	2	7
Aleksandar Mitrovic	28	Fulham	Tottenham (H)0 - 1	Premier League	23-01-2023	FW	90	0	0	2	57.9	3	6
Aleksandar Mitrovic	28	Fulham	Southampton (H)2 - 1	Premier League	31-12-2022	FW	90	0	0	2	68.0	3	6

Data Dictionary:

Player Name: Name of the player
Age: The age of the player
Current Club: Name of the club that the player currently plays for
Opponent: Name of the team that the player played against
Competition: Name of the competition.
Date: Date of the match played
Position: Playing position of the player
Mins: Minutes played
Goals: Total goals
Assists: Total assists
Yel: Yellow card
Red: Red card
Shots: Total shots
PS%: Pass success percentage
AerialsWon: Aerial duels won
Rating: Rating per match

Applying Measures of Central Tendency & Dispersion

Datatypes and Null values

The very first thing we do is make sure the data types are correct and there are no null values present.

# Get more information about datatypes and null values in the dataframe

df.info()

Output:

Column	Non-Null	Count	Dtype
Player Name	289	non-null	object
Age	289	non-null	int64
Current Club	289	non-null	object
Opponent	289	non-null	object
competition	289	non-null	object
Date	289	non-null	object
Position	289	non-null	object
Mins	289	non-null	int64
Goals	289	non-null	int64
Assists	289	non-null	int64
Yel	289	non-null	int64
Red	289	non-null	int64
Shots	289	non-null	int64
PS%	289	non-null	float64
AerialsWon	289	non-null	int64
Rating	289	non-null	int64

We can see there are 289 non-null values in each column, indicating that there are no missing values.

Calculating Frequency

We will start with basic checks i.e count of values.

df['Age_Group']=np.where(df['Age']>=25,'>= 25 years','< 25 years')
df.groupby('Age_Group')['Player Name'].nunique()

Output:

Age_Group	Count
< 25 years	6
>= 25 years	4

We can see:

There are 6 players who are below 25 years.
There are 4 players who are above or equal 25 years.

Number of Players played in EPL

Checking the number of players played in English Premier League.

df.groupby('competition')['Player Name'].nunique()

Output:

competition	Count
Bundesliga	2
Champions League	5
Europa League	1
Liga Portugal	1
Ligue 1	1
Premier League	3
Seria A	3

We can see :

3 Players played in English Premier League.

Number of goals & assists

Checking the number of goals & assists scored by each player.

tc = df.groupby('Player Name')[['Goals','Assists']].agg('sum')
tc = pd.DataFrame(tc).reset_index()
tc

Output:

Player Name	Goals	Assists
Aleksandar Mitrovic	11	1
Dusan Vlahovic	11	4
Gonçalo Ramos	20	3
Harry Kane	24	4
Ivan Toney	18	4
Jonathan David	19	4
Marcus Thuram	13	4
Randal Kolo Muani	14	10
Rasmus Højlund	7	2
Victor Osimhen	25	4

Total Contribution

Calculating total contribution of each player.

tc['Total Contribution'] = tc['Goals']+tc['Assists']
tc.sort_values(by='Total Contribution',ascending=False)

Output:

Player Name	Goals	Assists	Total Contribution
Victor Osimhen	25	4	29
Harry Kane	24	4	28
Randal Kolo Muani	14	10	24
Gonçalo Ramos	20	3	23
Jonathan David	19	4	23
Ivan Toney	18	4	22
Marcus Thuram	13	4	17
Dusan Vlahovic	11	4	15
Aleksandar Mitrovic	11	1	12
Rasmus Højlund	7	2	9

We can find :

Victor Osimhen & Harry Kane are top 2 in goals.
Randal Kolo Muani is at top for assists.
Six players with goal contribution more than 20.

So can we suggest the players to clients based on goals & Assists? No, We need to go in detail to understand their performance.

Apply Average

Check the average no.of shots, PS%, AerialsWon & Ratings per match by each player.

others = df.groupby('Player Name')[['Shots','PS%','AerialsWon','Rating']].agg('mean').round(0).astype('int')
others = pd.DataFrame(others).reset_index()
others

Output:

Player Name	Shots	PS%	AerialsWon	Rating
Aleksandar Mitrovic	4	60	4	7
Dusan Vlahovic	3	71	1	6
Gonçalo Ramos	3	73	2	7
Harry Kane	3	71	2	7
Ivan Toney	3	62	4	7
Jonathan David	3	83	0	7
Marcus Thuram	3	73	1	7
Randal Kolo Muani	2	66	2	7
Rasmus Højlund	2	73	1	6
Victor Osimhen	4	72	2	7

We can see:

Average shots per match : Mitrovic & Osimhen
Average PS% per match : David, Ramos, Thuram, Hojlund
Average AerialsWon per match : Mitrovic,Toney
Average Rating : Multiple Players , there are no clear difference in rating if you see , most of them got either 7 or 6 rating. Out of 10 players, 2 of them with 6 rating.

Average offers you only one-dimensional view of your data. They tell you what the center of your data is, but that’s it. While this can be useful, its often not enough.

Calculating Mean, Median & Mode of Rating

Checking mean, median & mode ratings by each player

mmm = df.groupby(['Player Name'])['Rating'].agg(['mean','median']).round(0).astype('int')
mmm = pd.DataFrame(mmm).reset_index()
mmm1 = df.groupby('Player Name')['Rating'].apply(pd.Series.mode).to_frame()
mmm1 = mmm1.reset_index()
del mmm1['level_1']
mmm1.drop_duplicates(subset=['Player Name'],keep='first',inplace=True)
mmm1=mmm1.reset_index()
mmm['mode']=mmm1['Rating']
mmm

Output:

Player Name	mean	median	mode
Aleksandar Mitrovic	7	7	6
Dusan Vlahovic	6	6	6
Gonçalo Ramos	7	7	7
Harry Kane	7	7	7
Ivan Toney	7	6	6
Jonathan David	7	6	6
Marcus Thuram	7	7	6
Randal Kolo Muani	7	7	6
Rasmus Højlund	6	6	6
Victor Osimhen	7	7	7

We can see:

Mean, Median, Mode are same for most of the players. so it is difficult for us to identify players with a difference.

Note:

Mean, Median, Mode can provide information about the “Center” of data.
Mean is most commonly used, however median can sometimes be a better measure.
Mean can be easily influenced by extreme values, so be careful.
Look for large difference between mean and median. Could be a warning sign.

Quartiles & Percentiles

Using Quartiles & percentiles on data to see if there is a difference in players.

qp=df.groupby('Player Name')['Rating'].quantile([0.25,0.5,0.75,0.8,0.9])
qp=pd.DataFrame(qp).reset_index()
qp.head(20)

Output of first 20 rows:

Player Name	level_1	Rating
Aleksandar Mitrovic	0.25	6.0
Aleksandar Mitrovic	0.50	7.0
Aleksandar Mitrovic	0.75	7.0
Aleksandar Mitrovic	0.80	7.0
Aleksandar Mitrovic	0.90	8.0
Dusan Vlahovic	0.25	6.0
Dusan Vlahovic	0.50	6.0
Dusan Vlahovic	0.75	7.0
Dusan Vlahovic	0.80	7.0
Dusan Vlahovic	0.90	7.2
Gonçalo Ramos	0.25	6.0
Gonçalo Ramos	0.50	7.0
Gonçalo Ramos	0.75	7.0
Gonçalo Ramos	0.80	7.8
Gonçalo Ramos	0.90	8.0
Harry Kane	0.25	6.0
Harry Kane	0.50	7.0
Harry Kane	0.75	7.0
Harry Kane	0.80	7.0
Harry Kane	0.90	8.0

We can see:

Nothing much can be inferred.
We need some other way of summarizing the data in addition to the measures of central tendency.

Applying measures of Dispersion

Range - uses smallest & largest number in a set. The rest of the values are ignored.

r1 = df.groupby('Player Name')['Rating'].agg(['max','min'])
r1 = pd.DataFrame(r1).reset_index()
r1['Range']=r1['max']-r1['min']
r1

Output:

Player Name	max	min	Range
Aleksandar Mitrovic	9	5	4
Dusan Vlahovic	8	5	3
Gonçalo Ramos	10	5	5
Harry Kane	9	6	3
Ivan Toney	10	5	5
Jonathan David	9	6	3
Marcus Thuram	8	5	3
Randal Kolo Muani	9	5	4
Rasmus Højlund	9	5	4
Victor Osimhen	9	5	4

We can see:

Nothing can be inferred.

Note:

Main problem with range is that it may include outliers
If data has outliers, the range will include them, even though there may be only one or two extreme values, that could lead to a misleading result.
The range can measure how far the values are spread out, but its difficult to get a real picture of how the data is distributed.

Calculating IQR

Checking IQR for each player

iqr = df.groupby('Player Name')['Rating'].describe().drop(columns = ['min','max','mean','count','std','50%']).round(0).astype('int')
iqr = pd.DataFrame(iqr)
iqr['IQR'] = iqr['75%']-iqr['25%']
iqr

Output:

Player Name	25%	75%	IQR
Aleksandar Mitrovic	6	7	1
Dusan Vlahovic	6	7	1
Gonçalo Ramos	6	7	1
Harry Kane	6	7	1
Ivan Toney	6	7	1
Jonathan David	6	7	1
Marcus Thuram	6	7	1
Randal Kolo Muani	6	7	1
Rasmus Højlund	6	7	1
Victor Osimhen	7	8	1

We can see:

Nothing can be inferred.

Calculating Variance & Dispersion

In this step we need to find the ratings which vary the least

vsd = df.groupby('Player Name')['Rating'].agg(['var','std']).round(2)
vsd = pd.DataFrame(vsd)
vsd.sort_values(by='std')

Output:

Player Name	Var	Std
Dusan Vlahovic	0.64	0.80
Harry Kane	0.64	0.80
Marcus Thuram	0.70	0.84
Rasmus Højlund	0.79	0.89
Randal Kolo Muani	0.92	0.96
Jonathan David	0.98	0.99
Gonçalo Ramos	1.02	1.01
Aleksandar Mitrovic	1.13	1.06
Ivan Toney	1.14	1.07
Victor Osimhen	1.24	1.11

We can see:

As the standard deviation is high, the values tend to be away from the mean
As the standard deviation is low, the values tend to be closer to the mean

As per the insights obtained from the above output, we can say the players ‘Dusan Vlahovic’ & ‘Harry Kane’ are closer to the Mean i.e their performance is consistent which indicates the ratings of these two players vary the least.

Analysing The Results

Checking the players who played in different clubs

df1=df[['Player Name','Age','competition']]
df1.groupby(['Player Name','Age'])['competition'].unique()

Output:

Player Name	Age	Clubs
Aleksandar Mitrovic	28	[Premier League]
Dusan Vlahovic	23	[Seria A, Champions League, Europa League]
Gonçalo Ramos	21	[Liga Portugal, Champions League]
Harry Kane	29	[Premier League, Champions League]
Ivan Toney	27	[Premier League]
Jonathan David	23	[Ligue 1]
Marcus Thuram	25	[Bundesliga]
Randal Kolo Muani	24	[Bundesliga, Champions League]
Rasmus Højlund	20	[Seria A]
Victor Osimhen	24	[Seria A, Champions League]

We can see:

Players who played in different clubs

As per the Client Requirements:

One of the players should be less than 25 years of age - We can see ‘Dusan Vlahovic’ whose age is less than 25.
One of the players should have preferably played in the English Premier League - We can see ‘Harry Kane’ who played in the English Premier League.

It is also important to note that we should also consider total number of matches played by each player, then only standard deviation will be more effective.

df1['Player Name'].value_counts()

Output:

Player Name	Count
Harry Kane	38
Randal Kolo Muani	33
Gonçalo Ramos	32
Jonathan David	29
Dusan Vlahovic	29
Victor Osimhen	28
Ivan Toney	28
Marcus Thuram	26
Rasmus Højlund	25
Aleksandar Mitrovic	21

We can see:

Harry Kane - played 38 matches.
Dusan Vlahovic - played 29 matches.

Prajivinn T C

Data Science Portfolio

Analyzing Football Players using Central Tendency and Dispersion

Table of contents

Project Overview

Context

Actions

Results

Concept Overview

Measures of Central Tendency

Measures of Dispersion

Data Overview & Preparation

Applying Measures of Central Tendency & Dispersion

Datatypes and Null values

Calculating Frequency

Number of Players played in EPL

Number of goals & assists

Total Contribution

Apply Average

Calculating Mean, Median & Mode of Rating

Quartiles & Percentiles

Applying measures of Dispersion

Calculating IQR

Calculating Variance & Dispersion

As per the insights obtained from the above output, we can say the players ‘Dusan Vlahovic’ & ‘Harry Kane’ are closer to the Mean i.e their performance is consistent which indicates the ratings of these two players vary the least.

Analysing The Results

Checking the players who played in different clubs

Two players ‘Dusan Vlahovic’ & ‘Harry Kane’ are the finalists for the striker position!!!