Creating a custom aggregate in PostgreSQL
Why use an aggregate instead of returning all the data?
Sometimes you do not need all the data brought from the database into your normal programming language. Using normal aggregate functions such as SUM
, COUNT
or AVERAGE
you may save your database server, network and application server massive amounts of network activity, a lot of time and potentially yourself a lot of work.
But there is no function included for what I want!
While PostgreSQL has many different types of aggregate functions included, sometimes what you need is a little too custom for it to be included out of the box. For this there is the ability to create your own custom aggregates using a relatively easy to understand API.
The scenario
Suppose we want to find out how long (in time) a particular race is. We could just take the fastest time, but there may be an exceptional competitor far faster than everyone else. We could also take an average of all the finishers, but that may include some really slow competitors. Another option is to take the top n
competitors and find the average time. This is the method we will describe for this tutorial.
How to define a custom aggregate
A custom aggregate can be as easy as filling in the gaps of the following:
create aggregate race_time(comp_cnt smallint, fin_time bigint) (
stype = ...
initcond = ...
sfunc = ...
finalfunc = ...
);
We have a few missing elements here, but don't worry, I will explain what they are and then we will fill them in.
If we were writing this in JavaScript...
Using a reduce
function would get us a long way towards finding the answer. Reduce in Javascript has the following signature:
arr.reduce(callback[, initialValue])
And the callback
has the following signature:
callback(accumulator, currentValue)
Therefore if you wanted to get the average of the three highest numbers you may write the following
function get_race_time(comp_count) {
function callback(accumulator, currentValue) {
if (accumulator.length < comp_count) {
return accumulator.concat([currentValue]);
}
const r = accumulator.concat([currentValue]);
const min = accumulator.reduce((a, b) => Math.max(a, b), -1);
r.splice(r.indexOf(min), 1);
return r;
};
function finalizer(accumulator) { // imperative for the purpose
let result = 0; // of blog text readability.
for (let i = 0; i < accumulator.length; i++) {
result = result + accumulator[i];
}
return result / accumulator.length;
}
return function race_time_implementation(number_array) {
const initialValue = [];
let reduceResult = number_array.reduce(callback, initialValue);
return finalizer(reduceResult);
};
}
const race_time = get_race_time(3);
And get the answer 2
.
Converting the JavaScript back to SQL
stype
The stype
is the type of the accumulator
, initialValue
and reduceResult
. This would be called an "Array of Number" in JavaScript, number[]
in TypeScript, but in SQL it would be bigint[]
.
initcond
The initcond
is the value of initialValue
from the JavaScript code. However initcond
is weird as you must express it as a varchar
, as if it would be typecast into the stype
. If you were to run a select '{1,2,3}'::bigint[]
query you would select an array of bigint with numbers 1
, 2
and 3
within, so '{}'
is the correct value.
sfunc
We are now making progress because the sfunc
is the callback
from the reduce function. If we were only intending our aggregate to take one argument it would actually have an identical signature.
However did you notice that callback
really has three parameters, accumulator
, currentValue
as an actual parameters and comp_count
from the wrapping function. As callback
needs all these arguments, so does our aggregate (sfunc
), as below.
create function race_time_sfunc(acc bigint[], comp_count smallint, fin_time bigint)
returns bigint[] as
$$
with t (f) as (
select unnest(array_append(acc, fin_time))
order by 1 asc
limit comp_count
)
select array_agg(f) from t
$$
language sql immutable;
In this function we're creating a common table expression with the lowest n
values stored within the acc
array with row_time
appended using the unnest function.
After this we just use the array_agg function to re-transform that common table expression back into an array.
finalfunc
Lastly we have the finalfunc
which, in our JavaScript takes the result of the sfunc
(bigint[]
) and averages the values within. This is exactly the same as the finalizer()
function in the JavaScript implementation.
create function race_time_finalfunc(acc bigint[])
returns numeric as
$$
with t (f) as (select unnest(acc))
select avg(f) from t
$$
language sql immutable;
Tying it all together
We have now discussed, and hopefully understood all the component parts to create an aggregate so we should be able to substitute the values in. Once you have created the sfunc
and finalfunc
functions you should be able to run the following to create the aggregate.
create aggregate race_time(comp_count smallint, fin_time bigint) (
stype = bigint[],
initcond = '{}',
sfunc = race_time_sfunc,
finalfunc = race_time_finalfunc
);
Does it work?
create temporary table test_table (val int);
insert into test_table(val) values (2), (3), (5), (3), (4), (1);
select race_time(3::smallint, val) from test_table
Gives the answer 2
, so appears to.
I actually wrote tests while preparing this blog post. When I have a write up complete I will share them too.
Tags: postgresql