EDA project report

I would like to provide a sense of the current situation in the field of male professional soccer. The FIFA video game annually updates comprehensive players’ ability stats based on their past year performance, which should be indicative of the real-world player capacities.

I started by summarizing the top players from each position. One thing to notice is that some players are multi-faceted and can play many positions. We should be careful to deal with it when performing by-position analysis.

We can see that there are 9005 players (out of 19781) that are not fixed into one position, with their “position” column displaying multiple positions they can play separated by ‘|’. I garnered the 15 unique positions available in the data and group top players based on the positions they play.

Here is a glimpse of the summarizations I made. And if you are a soccer fan, you should be pretty familiar with the names leading the lists of corresponding positions, such as Lionel Messi, Cristiano Ronaldo, and Kevin De Bruyne.

I then attempted to discover some insights into professional clubs.

To measure the overall strength of pro clubs, I calculated the average player ratings of each team, and here are the Top 10 teams with the highest average ratings. We should not be surprised that the teams that appear on the list, such as Juventus and Bayern, are exactly the most powerful ones in the real world currently.

I then calculated the average age of each pro club and here is a list of the Top 10 youngest teams. Although most of them are kind of novel to us and are even secondary teams (e.g. FC Bayern Muchen II), we could be relatively optimistic about their future performance in the coming years, with the assumption that younger players have greater potential to improve before reaching a golden age. The team RB Leipzig, which ranked 8th on the list, already made some noises in the European soccer league.

I also tried to assess the strength of national teams. Again, I used the average player ratings as the standard for comparison. However, I narrow down the countries by the constraint that they should have at least 23 players in the data, which is the mandatory requirement for major tournaments. The calculated average is also based on the Top 23 players from each country. Although some top players, for some reasons like age, no longer play for their national teams, the average I calculated should be indicative of a country’s overall strength. We can see that the European countries Belgium, Italy, Portugal, and England lead the list, which is in accordance with the contemporary world ranks.

Finally, I tested for the assumption that age should be correlated with a player’s potential throughout the career as younger players have great space to improve. I included the current overall ratings in my regression as controls to measure the ceteris paribus effect of age on potential.

The estimated coefficient of age is -0.8790947. The negative sign coincides with the assumption that the elder the player becomes, the less capacity for him to improve. The fitted/predicted values overlap well with the actual data, and the R² is 0.81, suggesting a good fit for the model. Therefore, we can conclude sufficient statistical evidence of the correlation between a player’s age and potential.