Maths for Data Engineering Jobs: The Only Topics You Actually Need (& How to Learn Them)

9 min read

If you are applying for data engineering jobs in the UK, maths can feel like a vague requirement hiding behind phrases like “strong analytical skills”, “performance mindset” or “ability to reason about systems”. Most of the time, hiring managers are not looking for advanced theory. They want confidence with the handful of maths topics that show up in real pipelines:

Rates, units & estimation (throughput, cost, latency, storage growth)

Statistics for data quality & observability (distributions, percentiles, outliers, variance)

Probability for streaming, sampling & approximate results (sketches like HyperLogLog++ & the logic behind false positives)


Discrete maths for DAGs, partitioning & systems thinking (graphs, complexity, hashing)


Optimisation intuition for SQL plans & Spark performance (joins, shuffles, partition strategy, “what is the bottleneck”)


This article is written for UK job seekers targeting roles like Data Engineer, Analytics Engineer, Platform Data Engineer, Data Warehouse Engineer, Streaming Data Engineer or DataOps Engineer.

Who this is for

You will get the most value if you are in one of these groups:

  • Route A: Career changers from software engineering, IT, ops, analytics or finance who can code but want the “data platform” thinking

  • Route B: Students & grads who have done some stats or CS theory but want to translate it into job-ready pipelines

Same topics either way. The difference is whether you learn best by building first or by understanding the concept first.

Why maths matters in data engineering

Data engineering is applied maths dressed up as infrastructure. The work is full of questions like:

  • How many events per second can this pipeline handle before it falls behind

  • How much storage will we need in 30 days if retention is 90 days

  • Is this “drop” in conversions a real change or a data quality issue

  • Is this join slow because of skew, shuffles or bad partitioning

  • Are we OK with an approximate distinct count if it reduces cost dramatically Google Cloud Documentation

If you can answer those with clear assumptions, you come across as someone who can run pipelines in production not just write transformations.

The only maths topics you actually need

1) Units, rates & back-of-the-envelope estimation

This is the most underrated “maths skill” in data engineering. Nearly every performance or cost decision comes down to unit conversion plus a simple model.

What you actually need

  • Bytes vs bits, KB/MB/GB/TB, rows vs events

  • Rate conversions: events/sec → events/day, MB/sec → GB/day

  • Growth thinking: daily growth × retention window ≈ steady-state storage

  • Order-of-magnitude estimation: you do not need perfect numbers, you need plausible numbers

Real data engineering examples

Example: event ingestion volume

  • 2,000 events/sec average

  • payload 800 bytes per eventData per second ≈ 1.6 MB/secData per day ≈ 1.6 × 86,400 ≈ 138,240 MB ≈ 138 GB/dayIf retention is 30 days, steady-state raw storage ≈ 4.1 TB plus overhead, indexing & replicas.

Example: pipeline lag

  • Backlog: 120 million events

  • Consumers: 12 workers

  • Each worker: 500 events/sec sustainedTotal throughput = 6,000 events/secDrain time ≈ 120,000,000 / 6,000 ≈ 20,000 seconds ≈ 5.5 hours

If you can do this calmly in interviews, you look operationally ready.

2) Statistics for data quality, monitoring & “is this normal”

Most data engineering failures are not “the pipeline is down”. They are “the data is wrong”, “late”, “duplicated” or “quietly drifting”. That is stats.

What you actually need

  • Mean, median, variance, standard deviation

  • Percentiles (p50, p95, p99) for latency & skewed distributions

  • Outliers & heavy tails

  • Seasonality & baselines (daily/weekly patterns)

  • Simple control-chart style thinking: what counts as unusual

Where it shows up

  • Freshness, completeness & volume monitoring

  • Anomaly detection for row counts, null rates, duplicate rates

  • Latency monitoring for pipelines & SLAs

  • Deciding whether a metric change is a business signal or a data issue

A practical way to think about it

If a metric has natural variability, you do not alert on “any change”. You alert on a change that is statistically unusual compared to its baseline.

This is why data teams often define thresholds using standard deviation bands or percentile-based rules rather than fixed numbers.

3) Probability for sampling, approximate answers & streaming reality

Probability matters in data engineering because sometimes exact is expensive, slow or impossible in real-time. You often use approximations intentionally, but you must understand the error trade-off.

HyperLogLog++ for approximate distinct counts

Many warehouses & platforms provide approximate distinct counts using sketching algorithms. BigQuery documents HyperLogLog++ as an algorithm that estimates cardinality from sketches & notes that approximate aggregation typically uses less memory than exact COUNT(DISTINCT) but introduces statistical error. Google Cloud Documentation

That is not “cheating”. It is the engineering choice when linear memory usage is impractical, or when the data is already approximate. Google Cloud Documentation

How to talk about it in interviews

  • What you gain: speed & cost

  • What you accept: a quantifiable error bound

  • When it is acceptable: dashboards, large-scale trends, operational monitoring

  • When it is not: finance reporting, audits, billing, compliance-critical metrics

Streaming windows, watermarks & late data

Streaming pipelines force you into probabilistic thinking because data arrives late, out of order or never. Google Cloud Dataflow explicitly describes using windows, watermarks & triggers to aggregate elements in unbounded collections. Google Cloud DocumentationApache Beam explains that a watermark is a guess about when all data in a window is expected to have arrived because data does not always arrive in time order or at predictable intervals. beam.apache.org

That “guess” is not a flaw. It is how real streaming systems work. Your job is choosing the windowing strategy plus allowed lateness that balances accuracy vs timeliness.

4) Discrete maths for DAGs, partitioning & distributed systems

You do not need pure maths. You do need discrete “systems maths” because data engineering is distributed.

Graphs & DAGs

Orchestration tools represent pipelines as Directed Acyclic Graphs. Airflow’s core concepts describe tasks arranged into DAGs with dependencies. Apache AirflowThe scheduler monitors tasks & DAGs then triggers task instances once dependencies are complete. Apache Airflow

Graph thinking helps you:

  • reason about critical paths

  • spot bottlenecks

  • design idempotent reruns

  • control failure blast radius

Partitioning as a scalability primitive

Streaming platforms use partitioning to scale. Kafka’s documentation explains that topics are partitioned, spreading a topic across multiple “buckets” on different brokers which enables distributed placement of data. kafka.apache.org

In batch engines, partitioning affects shuffles, joins & execution time. Spark SQL performance tuning docs describe optimisations that use existing storage layout to avoid shuffles such as Storage Partition Join under certain conditions. spark.apache.org

Hashing basics

Hashing shows up in:

  • partition assignment

  • deduplication keys

  • consistent identifiers

  • probabilistic data structures (Bloom filters, sketches)

You do not need proofs. You need the intuition that a good hash spreads values evenly, making workloads more balanced.

5) Optimisation intuition for SQL plans & Spark jobs

Data engineering interviews often include questions like:

  • why is this query slow

  • what would you change to reduce cost

  • how would you optimise a join

  • how do you debug skew

This is optimisation thinking, not calculus.

What you actually need

  • Cost models & query planners: what “cost” means conceptually

  • Join strategy intuition: broadcast vs shuffle joins, skew risk

  • Partitioning strategy: partition keys, clustering, bucketing as concepts

  • The habit of measuring: explain plans, job UI, stage metrics

PostgreSQL documentation explains that EXPLAIN shows the query plan, including estimated execution cost, with start-up vs total cost. PostgreSQLSpark’s SQL performance tuning documentation covers how certain optimisations can avoid shuffles depending on storage layout & partitioning. spark.apache.org

The interview skill is not memorising tricks. It is being able to say: “Here is the bottleneck. Here is what I would measure. Here is what I would change.”

A 6-week maths plan for data engineering jobs

This plan assumes 4–5 sessions per week of 30–60 minutes. Each week produces one publishable output.

Week 1: Units, rates & pipeline estimation

Learn

  • unit conversions, throughput modelling, backlog drain mathsBuild

  • a simple “pipeline calculator” notebook: events/sec → GB/day → retention storageOutput

  • GitHub repo with examples & assumptions

Week 2: Distributions & monitoring baselines

Learn

  • percentiles, variance, outliers, seasonalityBuild

  • a notebook that takes a time series of row counts or latency then flags anomalies using simple baseline rulesOutput

  • a short report explaining why you chose that alert logic

Week 3: Data quality as statistics

Learn

  • null rates, duplicate rates, referential integrity rates as measurable signalsBuild

  • implement data tests in dbt: dbt’s docs describe data tests as SQL queries that select “failing” records & pass when zero failing rows are returned. docs.getdbt.comOutput

  • a dbt project with tests plus a README explaining what each test protects

Week 4: Freshness & SLAs

Learn

  • freshness as a measurable promise, “latest loaded_at” vs “now”Build

  • configure dbt source freshness: dbt documents freshness blocks with warn_after & error_after to define acceptable time since most recent record. docs.getdbt.comOutput

  • a repo that shows freshness configuration plus a sample alerting approach

Week 5: Probability & approximate analytics

Learn

  • why sampling & sketches exist, how to talk about error vs costBuild

  • implement approximate distinct counting with HLL++ in BigQuery or reproduce the idea with a sketch libraryBigQuery’s HLL++ functions are documented as approximate aggregate functions that introduce statistical error while reducing memory compared to exact COUNT(DISTINCT). Google Cloud DocumentationOutput

  • a notebook comparing exact vs approximate: time, cost & accuracy story

Week 6: Capstone optimisation & systems thinking

Learn

  • query planning, shuffles, skew, partition strategy, DAG bottlenecksBuild

  • pick one slow query or Spark job scenario & write an optimisation noteUse EXPLAIN concepts for SQL plan reasoning PostgreSQLUse Spark SQL performance tuning guidance for shuffle-related thinking spark.apache.orgOutput

  • a portfolio “investigation report” with a before/after plan

Portfolio projects that prove the maths on your CV

Project 1: Data quality testing suite

What you build

  • a dbt project with tests for not null, unique, relationships plus a couple of custom business-rule testsdbt’s docs explain tests as assertions expressed as SQL select statements that return failing records. docs.getdbt.comWhy it matters

  • UK employers love evidence you can prevent silent failures

Project 2: Freshness dashboard & SLA rules

What you build

  • dbt source freshness rules, a “freshness report” output & a simple alerting policydbt explicitly frames freshness snapshots as part of meeting SLAs. docs.getdbt.comWhy it matters

  • “data on time” is often more valuable than “data perfect”

Project 3: Approximate distinct counts at scale

What you build

  • compare exact distinct vs HLL++ approximate distinct at different scalesBigQuery documents HLL++ as sketch-based cardinality estimation with statistical error trade-off. Google Cloud DocumentationWhy it matters

  • shows you understand cost-performance-accuracy trade-offs

Project 4: Airflow DAG with critical path analysis

What you build

  • an Airflow DAG with 6–10 tasks, clear dependencies & retriesAirflow docs define tasks arranged into DAGs with dependencies. Apache AirflowWhy it matters

  • shows you can reason about pipeline structure & failure handling

Project 5: Spark optimisation write-up

What you build

  • a small Spark job where you change partitioning or join strategy then explain the impactSpark’s SQL performance tuning docs describe how storage layout & partitioning can reduce shuffle under certain optimisations. spark.apache.orgWhy it matters

  • performance thinking is a common UK interview differentiator

How to write this on your CV

Instead of “strong maths” or “analytical”, use outcome statements:

  • Built dbt data tests as SQL assertions to prevent regressions by selecting failing records & enforcing quality gates docs.getdbt.com

  • Implemented source freshness SLAs using warn_after & error_after thresholds to detect stale upstream data docs.getdbt.com

  • Reduced distinct-count cost by using sketch-based approximate aggregation with quantified error trade-offs Google Cloud Documentation

  • Diagnosed slow queries using EXPLAIN plan cost interpretation plus targeted indexing or rewrite decisions PostgreSQL

  • Optimised Spark SQL workloads by reducing shuffle pressure through partition-aware strategies guided by Spark tuning concepts spark.apache.org

Resources section

Orchestration & DAG fundamentals

Data quality testing

Freshness & SLAs

Streaming & partitioning

Approximate analytics

  • BigQuery HyperLogLog++ functions: sketch-based approximate aggregate functions with statistical error trade-off Google Cloud Documentation

Query plans & optimisation

Streaming windowing concepts

Related Jobs

Senior Data Engineer

Senior Data Engineer Salary: Up to £75,000 I am working with a well-established financial services organisation that is undergoing a major transformation of its data and analytics capabilities. The data team plays a critical role in building scalable, cloud-first data solutions that provide actionable insights to support executive and operational decision-making. These insights underpin the organisation's growth strategy across both...

Tenth Revolution Group
Oxford

Data Engineer - £350PD - Remote

Data Engineer - £350PD - Remote Required Technical Skills Data Pipeline & ETL Design, build, and maintain robust ETL/ELT pipelines for structured and unstructured data Hands-on experience with AWS Glue and AWS Step Functions Implementation of data validation, data quality frameworks, and reconciliation checks Strong error handling, monitoring, and retry strategies in production pipelines Experience with incremental data processing patterns...

Tenth Revolution Group
City of London

Senior Data Engineer

Senior Data Engineer Bristol 12-Month Contract Paying up to £79p/h (Outside IR35) Role Overview: Our client a large Aerospace company is looking for a experienced Senior Data Engineer with to assist with building and managing data pipelines using the Elastic Stack (Elasticsearch, Logstash, Kibana) and Apache NiFi Key Responsibilities: Design, develop, and maintain secure and scalable data pipelines using the...

ARM
Broomhill, City of Bristol

Snowflake DevOps Engineer - Fully Remote - £450/pd

Snowflake DevOps Engineer - Fully Remote - £450/pd (Outside IR35) Please note - this role is only open to applicants who are based in the UK with the unrestricted right to work in the UK. This organisation is not able to offer sponsorship. About the Role We are seeking an experienced Snowflake DevOps Engineer to join our team on a...

Tenth Revolution Group
City of London

Data Engineer – Databricks Specialist (SC Cleared)

Data Engineer – Databricks Specialist (SC Cleared) SR2 is supporting a high-impact greenfield transformation programme within the public sector. We are looking for a hands-on Data Engineer with deep experience in Databricks and modern data architectures to help shape and build the foundation of a new cloud-native data platform from the ground up. This role offers the opportunity to directly...

SR2
Farringdon, Greater London

Data Engineer

Are you a Senior Data Engineer with iGaming or Gambling experience, looking to build and scale modern data platforms? BENEFITS: £80,000–£95,000 depending on experience, fully remote, excellent benefits package You’ll be joining a fast-growing iGaming and online casino company operating a custom-built platform that supports millions of player interactions. The business is a recognised leader across sports betting and online...

Eligo Recruitment Ltd
Chaucer

Subscribe to Future Tech Insights for the latest jobs & insights, direct to your inbox.

By subscribing, you agree to our privacy policy and terms of service.

Hiring?
Discover world class talent.