Peter McKeever

Dublin/Düsseldorf. Teacher. Language Consultant. Hobby Statistician

Football Analytics

How to Compile Odds for Football: A Beginner’s Guide Part II – More on xGoals and Our First Match

In Part I we looked at gathering and cleaning data from football-data.co.uk, and building our first basic table. In this part we’ll discuss xGoals in a bit more detail and run through our first method for calculating odds, using the upcoming game between Liverpool and Everton.

Expected Goals

The bread and butter of Fanalytics. Expected Goals (xGoals, xG) are everywhere. There are many ways of calculating them. Some are quite advanced – using location data and player lineups – while others are very primitive – like this method. However they are calculated, they all have the same objective: to assess/predict the deserved number of goals scored/conceded by football teams rather than the actual number of goals.

In the previous article, I showed a method of finding the xGoals scored and conceded for each team in the EPL. Here’s the table we made again:

cok8cr

First, let’s discuss some limitations of this data.  This table only takes this season’s data into account, as such it means that this method is not reliable in the early stages of the season. This is why I recommended making a second table using the past 5 seasons worth of data also. Here is that chart:

3nfk8o

Another is dealing with newly promoted teams. Much like dealing with our dataset that only takes this season into account, there is insufficient data to accurately analyse these teams early in the season. A workaround for this is to simply use the figures of the 3 relegated teams from the previous season until enough data is available. I disagree with this line of thinking though, as it assumes that those teams are of the same quality. For me, I rather use the average of each promoted and relegated team from the previous five seasons, which I will tackle in a later article in this series that will show you how to build a league table and analyse team over/under performance.

Let’s take a look at an upcoming match to see how we can get some kind of probabilities from our table. Before we jump into this I would like to show you some ways of automating the process.

Making an Odds Calculation Worksheet

On a new worksheet, select cells c2:e2 and click “Merge & Centre”.

footdata9

Next, in your toolbar, click DATA>Data Validation and select LIST from the dropdown menu.

footdata10

Click on the fb85pj to the right of the source box and enter the following:

=’Shots and Goal Data’!$A$2:$A$21

This links the team names in the first table we created in Part I and gives us a drop down list of teams.

footdata11

With this done once, we can simply select cell C2, press CTRL+C, and paste it in cell G2 by pressing CTRL+V. Now we have a drop down list for every possible match up in the current PL season. In our home box, select Liverpool, and in our away box, select Everton. I’ve added some extra cells which you can see here:

footdata12

In cell L2, we need to find a way of taking the xGoals For and Against for both Liverpool and Everton without having to back and forth between sheets. To do this we can either use a VLOOKUP function, or an INDEX MATCH function. The index function is a lot more flexible and it is an asset to be comfortable with using it, so we’ll use it for our odds calculation (an in-depth look at formulas is something others have done much better than I could, but I may outline some of them in future articles).

So, in L2 we enter:

=INDEX(‘Shots and Goal Data’!$E$2:$E$21,MATCH(‘Odds Calculation’!L$1,’Shots and Goal Data’!$A$2:$A$21,0))

In cell L3, we need to change that $E$2:$E$21 part of the formula. This references the xG Scored column in our first table, but we need it to reference xG Conceded.

when we have that done, copy cells L2:L3 across to column M.

footdata13.JPG

 This shows us that using this season’s data, Liverpool have an xG for per game of 2.110 and an xG against of 1.231 per game. In order to find the xG for their match versus Everton though, we need to run another formula.

In cell P2, multiply Liverpools xG for (L2) by Everton’s xG against (M3) and add the home advantage:

=L2*M3+N2

In P3 do the same for Everton (M2*L3) without the Home Advantage of course.

This gives us each team’s expected goals for the match:

  • Liverpool: 2.602 Goals
  • Everton: 2.155 Goals

Now we have that set up, each time we change the teams in cells C2 and G2, this data will automatically update.

Finding the Odds

Right so, on to probably the only part you wanted to read. Now that we have our xGoals for the match between Liverpool and Everton, we can use these two inputs to calculate almost all odds in the upcoming match. To do this we need to use what is called the Poisson Distribution.

In my opinion, there are better methods for calculating odds (ZIP, Negative Binomial Dist) which I hope to cover later in this series, but a fantastic starting point is the Poisson.

The Poisson Distribution gives the range of possible outcomes occurring in a football match. The outcomes we will use are the possible scores that this match could finish with.

From cells A9:A19 enter the numbers 0-10. Do the same for cells B8:L8 so we get an outline like so:

footdata14

In cell B9, enter the following:

=POISSON.DIST($A9,$P$2,0)*POISSON.DIST(B$8,$P$3,0)

Copy the formula across and down and set the number value to Percentage:

footdata15

This shows us the range of possible outcomes in this game from 0-0 to 10-10. You’ll notice that the chance of a 0-0 is very low – too low (less than 1%), while the chance of a 10-10 finish is virtually 0%.

We can clean up the data and use some conditional formatting to highlight the key information:

i9qa5f

The green area shows all of the scorelines where Liverpool win, the Orange area shows all scorelines where Everton win, and the yellow area shows the scorelines that end in a draw. By using conditional formatting, I have highlighted the top 5 results in the data. If we use Excels’s SUM function, we can add up all the probabilities of a Liverpool victory, of an Everton victory, and of the match ending in a draw.

This shows that Liverpool have a 48.65% chance of winning, there is an 18.50% chance of a draw, and Everton have a 32.84% chance of winning.

We can turn this into decimal odds by dividing the % chance by 1:

footdata16

This gives the odds for the match at:

  • Liverpool: 2.06
  • Draw: 5.21
  • Everton: 3.05

Currently, the average odds on offer show Liverpool as a 1.62 favourite, which is a much higher chance for Liverpool than our model. I have found that using 5 seasons of data works much better for well-established PL teams while using data from only this season works better for teams new to the Premier League, and teams that are clearly over/under acheiving – Leicester and Chelsea last season come to mind. When we use the xGoal data from our 5 season table, the odds are:

  • Liverpool: 1.82
  • Draw: 5.51
  • Everton 3.70

with Liverpool expected to score 2.67 goals and Everton 1.89 goals, which is probably closer to reality with the chance of the match finishing 0-0 being better represented.

i9osma

Here we can see that the 5 season figures work much better for this game, but there are a couple of games in the coming game week that will use our 1 season data instead:

5f8pwo

Issues & Limitations

While it is a great starting point for those new to the area, there are issues with this method.

Firstly, it does not take into consideration factors such as line-ups, injuries, new signings, weather, motivation, derbies etc.. etc…

Secondly, the Poisson Distribution tends to under-estimate low scores and draws.

Lastly, Bookmaker margin is not considered in Poisson calculations. This is something we’ll take a look at in Part III when we expand the markets beyond simple Home-Draw-Away to Over/Under, Both to Score, European and Asian Handicaps, and so on.

Have a go at it yourself and let me know how you get on. Feel free to play around with layout etc… the more of your own ideas you can put into this and develop the better. If you have any questions please feel free to contact me below or @petermckeever

 

**note: an earlier version of this article displayed incorrect xG against figures for the 5 season data. This has been updated since. Thank you to @willemvdbrink for pointing it out!

6 Comments

  1. Willem van den Brink

    Dear Mr. McKeever,

    Besides the earlier correction, I have a question:
    In cells P2 and P3 you calculate xG for both teams the match. However, the values in those cells exceed both Liverpool and Everton’s xG scored and conceded (the ones calculated in part I). [Generally speaking, as long as all values are >1, P2 and P3 will exceed xG from the Shots and Goals Data]

    Imagine that the dataset on which the calculations are based (The ‘Results’ tab) contains only one match; an earlier edition of Liverpool – Everton. This game ended in 2 – 1. As this is the only game in this hypothetical dataset, average xG scored would be 2 for Liverpool and 1 for Everton, with the xG conceded being the reverse.
    However, if, based on these data, you would make a prediction for the next Liverpool – Everton, you would expect 4-1 (not taking the home advantage into account, which would make it 6-1). Of course, this is purely hypothetical, and you should never base yourself on one data point, but my intuition tells me that the prediction should be 2-1.

    Alternatively, in an infinitely large underlying dataset where all the historical results were 2-2, you would predict an xG of 4 for every team in every match-up, whereas intuitively 2-2 would be the expected result / a 2 would be expected in both cells P2 and P3.

    It seems to me that the values in P2 and P3 are useful in determining who is the more likely winner in a match-up. However, I am not sure if you can call those values expected goals, nor am I certain that if you use them in your Poisson distribution, you can use the distribution to predict scores.

    It is quite possible that I have made a mistake in my reasoning, or that I am working from an incorrect assumption. If so, my apologies. However, I am wondering if and why my intuition is wrong.

    • Willem van den Brink

      P.s. What might also help to get the xG to be more fitting is to not add .4 to the xG of the home team, but by adding .2 to the xG of the hometeam, and subtracting .2 from the xG of the away team.

      • Hi Willem,

        Thanks for your comment. Let’s step back and break down each part of the xG numbers for Liverpool and Everton.

        In order to calculate Liverpool’s expected goals v Everton, we need use the formula: Liverpool xGfor * Everton xGconceded + Home Advantage. The xGoals calculation is 2.11*1.04+0.4 for Liverpool and 1.23*1.75 for Everton for our first example, This is just our starting point so we haven’t got into previous head to head games etc… yet

        But in order to find those numbers for xGoals for and against in our table we need to do more than simply average Actual goals scored and conceded. The method I outlined involves 3 steps:

        Step 1: Find the average SOT per game for each team
        Step 2: Find how many SOT result in a goal (shown as a %) for each team
        Step 3: Multiply these two figures.

        An alternative method is to use league averages. On average, it takes teams ~10 shots to score a goal, and ~3 SOT to score a goal. We can look at each game and compare the actual result with the “expected” result. For example, Last week West Ham and Leicester finished 1-3. Using this method and these match stats:

        West Ham
        Shots – 20
        SOT – 7
        Goals – 1

        Leicester
        Shots – 11
        SOT – 5
        Goals – 3

        we could assume that:

        West Ham “should” have scored 1.78 goals and Leicester 1.92. This makes different assumptions however, as it assumes that all teams shoot and hit the target at the same rate which of course is incorrect. Another flaw with both methods is that it is quite difficult to remove things such as blocked shots and penalties from such a large data set, which many location-based xGoal models do not include. It is a primitive method intended to get people started.

        If all matches ended 2-2 and all games had the exact same number of shots and SOTs and goal conversion rates that yes the xGoal method would be crazy, but the method outlined would not work, or any I suspect as we’d all be rich betting on a 2-2 result.

        I use 0.4 for Home Advantage. It’s the average of how many more goals a team scores when playing at home. I don’t reduce the xGoals for the away team because it tends to underestimate the opposition, but please feel free to do your own thing with it. These tutorials are intended as a stepping stone for people to develop their own ideas, and there are many many ways of approaching it. I’d be very interested in seeing some results if you experiment with it!

      • Willem van den Brink

        Hello Peter,

        Thank you for your reply (both here and via Twitter).

        I understand that my hypothetical scenario’s / datasets were so irrealistic that they are not a fair test to the methodology.

        However, the underlying ‘problem’ which I tried to fit into those simplified examples still is valid as far as I can tell. If we would use the xG and xG conceded (either from this season’s data or historical data) to predict a full season, then in the prediction all teams would on average score ~50% more goals per game than their xG. I would theoretically expect that the average xG based on the data should be approximately the same as the average goals scored in the prediction.

        My calculation in the Shots and Goals Data:
        =(2*E2*(SUM($I$2:$I$21)-I2)+19*0.4)/38 [Ex = xG, Ix = xG conceded)
        The 2 is for being a full competition (playing every team twice), Ex*SUM is the calculation for expected goals as used on the odds tab, and the 19*0.4 is the home advantage.

        In any case, I appreciate that your posts are only a stepping stone intended to get the reader on his/her way. They have helped me a lot already. I am also aware that with the data used so far, it is not possible to make a very accurate forecast regardless of one’s calculations. I will be looking forward to the next part(s) of this tutorial to see how reliable/accurate one can get using freely available data (and how to get those ‘reliable’ predictions).

  2. Ian

    Hi – great article – is there a tiny typo in the part telling how to convert implied chance % to decimal odds? S

    • Hey Ian,

      Cheers, hmm possibly. Just to be clear it is 1/% chance. so a 50% chance would be 1/0.50(50%) = 2.00 or EVENS in old money

Comments are Closed

Theme by Anders Norén