understat-db: creating a database of one's own

What is it?

understat-db is a project for creating your own database of football (soccer) data from understat.com.

You can find the project’s source code on Github.

A screenshot showing the project's github page, a terminal window with the projects help text and a SQL query (with results) run against the project database.

Why should you use it?

This project isn’t prescriptive, so there’s no “correct” (or “incorrect”) way to use it. But there are two main ways in which I think it might be useful:

  • As a project to create a standalone database of football data
    • For example, using the data to create one-off visualisations in a separate project
    • Or, as a way to get started with learning SQL, without having to create a database
  • As a base project to modify and extend to suit your needs
    • For example, if you wanted to store additional data or views in the database

Why did I make it?

There are a few blog posts/experiments I’d like to write up, and I generally like to make my posts/projects repeatable. This means including any code + data required to reproduce the results. In the past, I’ve tended to embed the code inside the post notebook-style. However, this has 2 drawbacks for me.

Firstly, it makes it difficult to build upon previous work - you have to go over any required code again, or ask the reader to refer back to a past post. Neither of these strike me as great options.

Secondly, the linear format of a blog-post or notebook isn’t always the clearest way to communicate. Especially when dealing with concepts in code, which doesn’t necessarily have a clear start, middle, and end.

Hopefully, having some form of packaging/code distribution will help separate the blog content from the code. One or two of these future projects will build on top of understat-db, but I thought it would be useful enough to release as a standalone project.

(I also wanted to try out nbdev and this seemed like a good fit (see “nbdev”).)

What’s next?

If this project is useful to enough people, I’d like to flesh out the documentation and write one or two tutorials. If this sounds good to you, or (even better), you’d like to help - let me know!

I’d also like to release something similar for working with Statsbomb’s public data, but that’s a little further down my list at the moment. In any case, if that sounds interesting to you, watch this space.

Other info

This project uses a few different bits of software:

  • Python (3.6+)
  • Postgres (12+ recommended)
  • nbdev
  • dbt
  • Docker (optional - used for the database)

You don’t necessarily need to know each of these tools to use the project, but it’s worth being aware of them if you want to extend or fork the project.