Matt at Keyboard Writes Code

A blog about software development, technology and other things that interest me

Hacky tools for PostgreSQL

June 14, 2019 — Matt Forrester

Hacky tools for PostgreSQL that makes interacting / extracting / analysis of data in PostgreSQL easier.

screenshot

NOTE: Examples are from a PostgreSQL version of ergast.

psql-out

Sometimes I find I have requests like the following:

Run this query and give me the results [in an Excel file so I can look at them in Excel].

Also sometimes I want to:

Run a complicated query and get the results as ndjson so I can process them in JavaScript.

But sometimes I'm:

Looking at data trying to figure out if a hypothesis makes sense and want to throw things into graphs very, very quickly.

It's not cool, it's not pretty, but this is what psql-out is for!

If you've got your environmental variables set up (see pgpass-env lower down) you can do something like the following:

echo 'select name, position from competitors' | psql-out

And get a well formed CSV file out into STDOUT.

To write it to a file you just need to add > your_file.csv to the end to pipe STDOUT to the desired file:

echo 'select name, position from competitors' | psql-out > your_file.csv

If you want it as an ndjson file you can run

echo 'select name, position from competitors' | psql-out -f ndjson

To get an ndjson file out. You can also get a TSV (Tab seperated CSV) out by passing -t tsv.

termgraph-runner

When you've got your nice CSV using the tools from above and you want to see if the data looks correct quickly one of the best ways I know to do this is to draw a quick graph. You could fire up Excel or LibreOffice and pointy-clicky to get your graph. This is a really bad solution if you're still finding out whether your data is correct because that feedback loop of command-line -> csv -> spreadsheet > graph is pretty long. What if you could quickly draw graphs right in your terminal... You can using termgraph-runner (which is backed by the awesome termgraph.


echo '
    select
        code,
        sum(CASE WHEN position = 1 THEN 1 ELSE 0 END) as win,
        sum(CASE WHEN position <= 3 THEN 1 ELSE 0 END) as podium
    from results
    natural join drivers
    group by code
    order by 2 desc limit 5' | psql-out -f csv | termgraph-runner --stacked

pgpass-env

I found that I have files that look like the following in the root of most of my source code repositories (and in my .gitignore of course):

export PGHOST="127.0.0.1"
export PGDATABASE="my_product"
export PGUSER="my_product"
export PGPASSWORD="a_good_password"
export PGPORT="5432"
export LISTEN_PORT="4040"

However for my database administration GUI I also have a ~/.pgpass file in my home directory with the following:

127.0.0.1:5432:my_product:my_product:a_good_password

This is a duplication of data and is kinda ridiculous.

Enter pgpass-env which is a simple bash script that converts the former into the latter

The idea is to store a name above the lines in the ~/.pgpass like the following

# local_my_product
127.0.0.1:5432:my_product:my_product:a_good_password
# local_another_product
127.0.0.1:5432:another_product:another_product:a_good_password

Running just pgpass-env it gives you a list of possible options:

$ pgpass-env
Pass one of the following
    * local_my_product
    * local_another_product

But if you would pass the name of a connection it would output:

$ . pgpass-env local_my_product
> postgres://my_product@127.0.0.1:5432/my_product

While at the same time performing the required EXPORT PGUSER=my_product etc. Using the preceding . means those environmental variables will be brought into the current environment, which is probably what you want.

It also adds adds $DATABASE_URL which I use in vim-dadbod but I also understand is used by Heroku.

NOTE: Look at the BASH source code, pgpass-env is quick, simple code to get the job done, not perfect code. You can see that the .pgpass fields are separated by : but I have put no thought in how to escape a : should one be included in a password. If your password includes a : it'll probably break.

Installation

Installation is simple with some BASH tomfoolery:

mkdir -p ~/.local/bin && find . -maxdepth 1 -type f -executable | parallel ln -s "$PWD/{/}" ~/.local/bin

Other interesting tools I've found to do portions of this...

I've not found anything that I can use to draw pie charts simply in the terminal - ideas welcome.

Software
  • graph-cli Can accept input from STDIN and will pop up your graph in a window.
  • Veusz Is a desktop app which looks like a mix of Tableau and a DTP application. It's file format is plain text and it even has a Python API. It should be possible to wrap this and spit out a great image file.
  • feedgnuplot looks like the thing I'd use if I wanted to draw a line graph. Because it's based on gnuplot it can draw your line graph right in the terminal or can popup a window.
Libraries
  • ggplot is an R library for drawing graphs, could probably whip something up again to output an image file.
  • vega and vega-lite are by far the best JavaScript graphing libraries I've ever come across. You can specify a graph using just json and they have a CLI interface that can spit out png's etc.

You can find this project at GitHub.

Tags: linux, unix, cli, command-line, postgresql

Code In PostgreSQL: Combining data from multiple tables with INNER JOIN

May 15, 2019 — Matt Forrester

This series of articles

This is the third of the Code in PostgreSQL series of articles.

Articles in this series
The reason why SQL is so important

When developing systems we often have a choice of writing code (NodeJS, C#, Python or PHP etc) or SQL. I believe that sometimes the decision to write code is taken without fully evaluating how much of the task could be offloaded to SQL.

In this series of articles I wish to show the huge benefits of using and learning SQL by examining progressively more difficult scenarios with increasing amounts of SQL knowledge. In doing this I hope to illustrate that sometimes large amounts of extra code is written for what SQL can achieve quicker, with less complexity and more readability.

To be more specific, we should try to follow the rule of least power more often.

About the Ergast data set

For this series of articles we will be using the Ergast data set, which is a provided under the Attribution-NonCommercial-ShareAlike 3.0 Unported Licence.

Setting up the Ergast database within PostgreSQL

To set up the Ergast database within PostgreSQL I did the following:

I allowed psql and friends to work without me having to put in a password all the time by configuring the PostgreSQL environmental variables.

export PGUSER=postgres PGPASSWORD=postgres PGDATABASE=postgres PGHOST=127.0.0.1

Then import the Ergast database. NOTE: At the time of writing I was unable to install the PostgreSQL version.

wget -O /tmp/f1db_ansi.sql.gz http://ergast.com/downloads/f1db_ansi.sql.gz
cat /tmp/f1db_ansi.sql.gz | \
    gzip -d | \
    sed 's/int(..)/int/' | \
    sed 's/ \+AUTO_INCREMENT//' | \
    sed "s/\\\'/\'\'/g" | \
    sed 's/UNIQUE KEY \"\(\w\+\)\"/UNIQUE /' | \
    sed 's/^ *KEY .*(\"\(.*\)\")/CHECK ("\1" > 0)/' | \
    sed 's/ date NOT NULL DEFAULT .0000.*,/ date,/'| psql

The Aim

At the end of the last article we had the following dataset:

pointsdriverIdyear
36312017
317202017
3058222017
20582017
2008172017
1688302017
1008152017
878392017
548322017
43132017
438072017
408402017
281542017
198252017
1742017
138382017
88352017
58262017
58362017
0182017
08142017
08282017
08412017
08422017
08432017

I would like to augment this dataset with the names of the drivers, so the results would look something like the following:

pointsdriverIdforenamesurnameyear
3631forenamesurname2017

Where forename and surname have the real values in.

Table Structure

content-assets/2019-05-15-code-in-postgresql-inner-join/erd.svg

Table Data

races
raceIdyearroundcircuitIdnamedatetimeurl
971201733Bahrain Grand Prix2017-04-1615:00:00https://en.wikipedia.org/wiki/2017_Bahrain_Grand_Prix
9772017970Austrian Grand Prix2017-07-0912:00:00https://en.wikipedia.org/wiki/2017_Austrian_Grand_Prix
driverStandings
driverStandingsIdraceIddriverIdpointspositionpositionTextwins
647958561196552
64810856367770
drivers
driverIddriverRefnumbercodeforenamesurnamedobnationalityurl
2heidfeldHEINickHeidfeld1977-05-10Germanhttp://en.wikipedia.org/wiki/Nick_Heidfeld
4alonso14ALOFernandoAlonso1981-07-29Spanishhttp://en.wikipedia.org/wiki/Fernando_Alonso

Implementing the JavaScript

If we think about what code we had in the previous article there are two peices of functionality we're missing. These are:

  1. The ability to find a row in the drivers table that matches a row in our current result set.
  2. The ability to mix/join this row from drivers with our current results.

I'm going to aim to write code that is highly reusable and also still performs well on very large data sets.

Finding drivers efficiently

The obvious answer to finding a driver from a list of drivers would be to use Array.find()... something lie the following?

const assert = require("assert");


const drivers = [
    { driverId: 2, forename: "Lewis", surname: "Hamilton" },
    { driverId: 14, forename: "Fernando", surname: "Alonso" }
];


/**
 * Find one `row` within rows that has `value` within the specified `column`.
 *
 * @param column string The property within the rows to look within.
 * @param value number|string The value that column (above) should be.
 * @param rows Row[] An array of objects to represent rows.
 * @return Row
 */
function arrayFind(column, value, rows) {
    return rows.find((row) => {
        return row[column] == value;
    });
}


assert.equal(arrayFind("driverId", 14, drivers).forename, "Fernando");


module.exports = arrayFind;

This is certainly a reusable piece of code and was easy to write and hopefully for you to understand.

I can see two problems here though.

The first problem is performance. When we need to look up a driverId we need to scan all the rows in drivers up until the point we find the correct one. We will be doing this for all of the (hypothetically millions of) driverId we want to look up. So I'm pretty sure the performance characteristics of this is not great.

The other short coming I can see is that it will only ever retreive one row. This is often what we want to acheive, but not always. An example of when this is not enough is when you have one customerId and you want to find / match / join it to all orders in another table.

The following would perform much better and allow returning multiple rows:

sql-spitting-image/_indexBySimple.js
const assert = require("assert");


/**
 * Given an array of Row, index them using a specific column so you can find a
 * Row quickly without having to `.find()` it.
 *
 * @param columnName keyof Row
 * @param rows Row[]
 * @return Map<Row[columnName],Row>
 */
function indexBySimple(columnName, rows) {
    return rows.reduce((acc, row) => {
        if (!row.hasOwnProperty(columnName)) { return acc; }

        const k = row[columnName];
        if (!acc.has(k)) {
            acc.set(k, []);
        }
        acc.get(k).push(row);

        return acc;
    }, new Map());
}


/**
 * Given an index, find all rows that have the value
 *
 * @param index Map<Row[columnName], Row>
 * @param value Row[columnName]
 * @return Row[]
 */
function findByIndex(value, index) {
    if (!index.has(value)) { return []; }
    return index.get(value);
}

const index = indexBySimple(
    "driverId",
    [
        { driverId: 2, forename: "Lewis", surname: "Hamilton" },
        { driverId: 14, forename: "Fernando", surname: "Alonso" }
    ]
);

assert.equal(
    findByIndex(14, index)[0].forename,
    "Fernando"
);


module.exports = { indexBySimple, findByIndex };

The indexBySimple function can scan through the whole set of drivers and fill up a Map with the key being driverId and the values are the actual rows with have that driverId. Once we have this Map looking up drivers by driverId will become very cheap.

Mixing a drivers record with our current results

Combining an Object of one type (driverRow) with another (currentResults) is really easy in ES6 because you can simply destruct the objects to create new one like the following

    const newObject = {...currentResults, ...driverRow};

Building the libraries

sql-spitting-image/innerJoinSimple.js

Because of all our planning the innerJoinSimple library has become really quite simple.

const { indexBySimple } = require('./_indexBySimple');

// interface LeftRow extends Row {
//     // Here there may be fields
// }
// interface RightRow extends Row {
//     // Here there may be fields
// }

/**
 * For every leftRow, combine it with as many as possible rightRow.
 *
 * @param leftRows LeftRow[]
 * @param joinColumns [keyof LeftRow, keyof RightRow] The fields to join
 * @param rightRows RightRow[]
 * @return Row[]
 */
function innerJoinSimple(leftRows, joinColumns, rightRows) {

    const [leftColumn, rightColumn] = joinColumns;

    /**
     * Join leftRow to all found foundRightRows
     *
     * @param leftRow LeftRow
     * @param foundRightRows RightRow[]
     */
    function joinRows(leftRow, foundRightRows) {
        return foundRightRows.map(rightRow => {
            return {...rightRow, ...leftRow};
        });
    }


    const rightRowIndex = indexBySimple(rightColumn, rightRows);

    let results = [];
    for (const leftRow of leftRows) {
        if (rightRowIndex.has(leftRow[leftColumn])) {
            results = results.concat(
                joinRows(
                    leftRow,
                    rightRowIndex.get(leftRow[leftColumn])
                )
            );
        }
    }

    return results;
}

module.exports = innerJoinSimple;

Reading through it you can see that the first thing it does is build an index for the right set of data.

After this it will read through all of the left set of data, checking if it can be joined to the right, if it can it will be.

Libraries

sql-spitting-image/select.js
function select(spec) {

    function mapper(row) {
        let r = {};
        spec.forEach(([theFrom, theTo]) => {
            r[theTo] = row[theFrom]
        });
        return r;
    }

    return function selectImpl(rows) {
        return rows.map(mapper);
    };
}


module.exports = select;
sql-spitting-image/orderBy.js
/**
 * Creates a `sortFn` for the JavaScript [Array.prototype.sort](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort) function.
 *
 * @param columnName string The name of the column to sort by.
 * @param direction string If this is 'ASC' sort ascending, otherwise descending.
 * @return (a: number, b: number) => number The `sortFn`.
 */
function getSingleCompareFunction(columnName, direction) {

    const flipper = direction.toLowerCase() == 'asc' ? 1 : -1;

    return function singleCompareFunction(rowA, rowB) {
        return (rowA[columnName] - rowB[columnName]) * flipper;
    }
}


/**
 * Orders a set of rows
 *
 * @param columnName string
 * @param direction string ( 'ASC' || 'DESC' )
 * @param rows Row[]
 * @return Row[]
 */
function orderBy(columnName, direction='ASC') {

    const compareFunction = getSingleCompareFunction(columnName, direction);

    return function(rows) {
        return rows.sort(compareFunction);
    };
}


orderBy.getSingleCompareFunction = getSingleCompareFunction;
module.exports = orderBy;
sql-spitting-image/orderByMulti.js
const { getSingleCompareFunction } = require('./orderBy');

/**
 * Gets a `sortFn` for [Array.prototype.sort](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort)
 * that will sort whole rows based on a list of `columnName` and `direction`
 * tuples.
 *
 * @param colDirectionTuples `[columnName: string, direction: string][]`
 * Ordering specification where `columnName` and `direction` are parameters from
 * `getSingleCompareFunction`.
 * @return (a: Row, b: Row) => number The `sortFn`.
 */
function orderByMulti(colDirectionTuples) {

    function compareFunction(rowA, rowB) {
        return colDirectionTuples.reduce((acc, [col, dir]) => {
            if (acc != 0) { return acc; }
            const cf = getSingleCompareFunction(col, dir);
            return cf(rowA, rowB);
        }, 0);
    }

    return function(rows) {
        return rows.sort(compareFunction);
    }

}

module.exports = orderByMulti;
sql-spitting-image/limit.js
/**
 * Gets the first n rows from a set of rows.
 *
 * @param n number
 * @return (rows: Rows[]) => Rows[]
 */
function limit(n) {
    return function(rows) {
        return rows.slice(0, n);
    }
}


module.exports = limit;
sql-spitting-image/qryTable.js
const { runQuery } = require('./../_utils');

/**
 * Gets results from a table
 *
 * @param table string The of a table to pull from.
 * @param column The colum to look at to decide when to include a row.
 * @param values number[] Retreive values where `column` is one of these values.
 * @return Promise<Row[]>
 */
function qryTable(table, column=null, values) {
    if (column === null) {
        return runQuery(`select * from "${table}"`);
    }
    if (values.length == 0) { return []; }
    const inClause = '(' + values.map((_, i) => '$' + (i + 1)).join(",") + ')';
    const sql = `
        select *
        from "${table}"
        where "${column}" in ${inClause}`;

    return runQuery(sql, values);
}


module.exports = qryTable;

Main Code

The last thing to do is glue all the code together. See below:

const { output } = require('./_utils');
const select = require('./sql-spitting-image/select');
const orderBy = require('./sql-spitting-image/orderBy');
const orderByMulti = require('./sql-spitting-image/orderByMulti');
const limit = require('./sql-spitting-image/limit');
const qryTable = require('./sql-spitting-image/qryTable');
const innerJoinSimple = require('./sql-spitting-image/innerJoinSimple');

function addStatic(data) {
    return function addStaticImpl(rows) {
        return rows.map(r => {
            return {...r, ...data};
        });
    };
}

qryTable('races', 'year', [2017])
    .then(orderBy('round', 'desc'))
    .then(limit(1))
    .then((races) => races.map(r => r.raceId))
    .then((raceIds) => {
        return Promise.all([
            qryTable('driverStandings', 'raceId', raceIds),
            qryTable('drivers') // might as well do in parallel!
        ]);
    })
    .then(([driverStandings, drivers]) => {
        return innerJoinSimple(
            driverStandings,
            ['driverId', 'driverId'],
            drivers
        );
    })
    .then(orderByMulti([['points', 'desc'], ['driverId', 'asc']]))
    .then(select([
        ['points', 'points'],
        ['driverId', 'driverId'],
        ['forename', 'forename'],
        ['surname', 'surname']
    ]))
    .then(addStatic({year: 2017}))
    .then(output)
    .catch(err => { console.log("ERROR:", err) });

Again we have a rather large amount of code, however I again think it is quite readable.

Even if you disagree and don't like this code I hope you will agree that this amount of code could easily be wrote quite badly.

Pro's
  • Broken down quite well into bite size peices.
  • A lot of this code is quite reusable, if you wish.
Con's
  • There's a lot of it.
  • We are again requesting more data than is required.

The SQL

WITH "lastRaceIn2017" as (
    SELECT "raceId" FROM races
    WHERE year = 2017
    ORDER BY "round" DESC
    LIMIT 1
)
SELECT
    "driverStandings".points,
    "driverStandings"."driverId",
    drivers.forename,
    drivers.surname,
    2017 as year
FROM "driverStandings"
INNER JOIN drivers ON drivers."driverId" = "driverStandings"."driverId"
WHERE "raceId" IN ( SELECT "raceId" FROM "lastRaceIn2017" )
ORDER BY
    "driverStandings".points DESC,
    "driverStandings"."driverId" ASC

Pro's

  • Shorter than the JavaScript.
  • If this were called by JavaScript we would need only one Promise, which is much easier to write and reason about.
  • The INNER JOIN relatively effortlessly mixes in data about drivers into what we had before.

Con's

  • There's not much here that's re-usable, other than the knowledge you've acquired.

Tags: code-in-postgresql, javascript, postgresql

Code In PostgreSQL: You can do lots with just IN, ORDER BY and LIMIT

March 03, 2019 — Matt Forrester

This series of articles

This is the first of the Code in PostgreSQL series of articles.

Articles in this series
The reason why SQL is so important

When developing systems we often have a choice of writing code (NodeJS, C#, Python or PHP etc) or SQL. I believe that sometimes the decision to write code is taken without fully evaluating how much of the task could be offloaded to SQL.

In this series of articles I wish to show the huge benefits of using and learning SQL by examining progressively more difficult scenarios with increasing amounts of SQL knowledge. In doing this I hope to illustrate that sometimes large amounts of extra code is written for what SQL can achieve quicker, with less complexity and more readability.

To be more specific, we should try to follow the rule of least power more often.

About the Ergast data set

For this series of articles we will be using the Ergast data set, which is a provided under the Attribution-NonCommercial-ShareAlike 3.0 Unported Licence.

Setting up the Ergast database within PostgreSQL

To set up the Ergast database within PostgreSQL I did the following:

I allowed psql and friends to work without me having to put in a password all the time by configuring the PostgreSQL environmental variables.

export PGUSER=postgres PGPASSWORD=postgres PGDATABASE=postgres PGHOST=127.0.0.1

Then import the Ergast database. NOTE: At the time of writing I was unable to install the PostgreSQL version.

wget -O /tmp/f1db_ansi.sql.gz http://ergast.com/downloads/f1db_ansi.sql.gz
cat /tmp/f1db_ansi.sql.gz | \
    gzip -d | \
    sed 's/int(..)/int/' | \
    sed 's/ \+AUTO_INCREMENT//' | \
    sed "s/\\\'/\'\'/g" | \
    sed 's/UNIQUE KEY \"\(\w\+\)\"/UNIQUE /' | \
    sed 's/^ *KEY .*(\"\(.*\)\")/CHECK ("\1" > 0)/' | \
    sed 's/ date NOT NULL DEFAULT .0000.*,/ date,/'| psql
Assumed level of SQL Knowledge

In this JavaScript example we will assume the writer has sufficient SQL knowledge to use a WHERE statement along with the ability to only return certain fields using SELECT. After this we will see how this can be accomplished in one single SQL statement using IN, ORDER BY and LIMIT.

The aim

Lets say you want to find out the final points for drivers in the 2017 Formula 1 World Championship. The schema for the tables we will be using is as follows:

content-assets/2019-03-03-code-in-postgresql-in-order-by-limit/erd.svg

An example of the data you would find in these tables is shown below:

races
raceIdyearroundcircuitIdnamedatetimeurl
969201711Australian Grand Prix2017-03-2605:00:00https://en.wikipedia.org/wiki/2017_Australian_Grand_Prix
9702017217Chinese Grand Prix2017-04-0906:00:00https://en.wikipedia.org/wiki/2017_Chinese_Grand_Prix
971201733Bahrain Grand Prix2017-04-1615:00:00https://en.wikipedia.org/wiki/2017_Bahrain_Grand_Prix
9722017471Russian Grand Prix2017-04-3012:00:00https://en.wikipedia.org/wiki/2017_Russian_Grand_Prix
973201754Spanish Grand Prix2017-05-1412:00:00https://en.wikipedia.org/wiki/2017_Spanish_Grand_Prix
driversStandings
driverStandingsIdraceIddriverIdpointspositionpositionTextwins
64782855363770
647958561196552
647978564212331
6480585623410100
64810856367770

The driverStandings table has the points for every driver in every race. The problem here is that there is no record in the driverStandings table for which season a raceId belongs to. So we need to get a bit creative... Here is one possible solution:

  1. Looking at the races table's year column, we can find all the raceId in 2017.
  2. If we can get all races in a given year we should be able to get the last race because the round will be the highest within that year.
  3. Find the points and driverId for the drivers who were in that raceId by reading the driverStandings table.
  4. Sort them by points descending.
  5. The very first row contains the driverId which has the most points in that season. This driverId is the world champion. The ones later on denote their final position (assuming the points differ).

Implementing the JavaScript

Libraries
sql-spitting-image/limit.js
/**
 * Gets the first n rows from a set of rows.
 *
 * @param n number
 * @return (rows: Rows[]) => Rows[]
 */
function limit(n) {
    return function(rows) {
        return rows.slice(0, n);
    }
}


module.exports = limit;
sql-spitting-image/select.js
function select(spec) {

    function mapper(row) {
        let r = {};
        spec.forEach(([theFrom, theTo]) => {
            r[theTo] = row[theFrom]
        });
        return r;
    }

    return function selectImpl(rows) {
        return rows.map(mapper);
    };
}


module.exports = select;
sql-spitting-image/orderBy.js
/**
 * Creates a `sortFn` for the JavaScript [Array.prototype.sort](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort) function.
 *
 * @param columnName string The name of the column to sort by.
 * @param direction string If this is 'ASC' sort ascending, otherwise descending.
 * @return (a: number, b: number) => number The `sortFn`.
 */
function getSingleCompareFunction(columnName, direction) {

    const flipper = direction.toLowerCase() == 'asc' ? 1 : -1;

    return function singleCompareFunction(rowA, rowB) {
        return (rowA[columnName] - rowB[columnName]) * flipper;
    }
}


/**
 * Orders a set of rows
 *
 * @param columnName string
 * @param direction string ( 'ASC' || 'DESC' )
 * @param rows Row[]
 * @return Row[]
 */
function orderBy(columnName, direction='ASC') {

    const compareFunction = getSingleCompareFunction(columnName, direction);

    return function(rows) {
        return rows.sort(compareFunction);
    };
}


orderBy.getSingleCompareFunction = getSingleCompareFunction;
module.exports = orderBy;
sql-spitting-image/orderByMulti.js
const { getSingleCompareFunction } = require('./orderBy');

/**
 * Gets a `sortFn` for [Array.prototype.sort](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort)
 * that will sort whole rows based on a list of `columnName` and `direction`
 * tuples.
 *
 * @param colDirectionTuples `[columnName: string, direction: string][]`
 * Ordering specification where `columnName` and `direction` are parameters from
 * `getSingleCompareFunction`.
 * @return (a: Row, b: Row) => number The `sortFn`.
 */
function orderByMulti(colDirectionTuples) {

    function compareFunction(rowA, rowB) {
        return colDirectionTuples.reduce((acc, [col, dir]) => {
            if (acc != 0) { return acc; }
            const cf = getSingleCompareFunction(col, dir);
            return cf(rowA, rowB);
        }, 0);
    }

    return function(rows) {
        return rows.sort(compareFunction);
    }

}

module.exports = orderByMulti;
Main Code
const { takeOne, runQuery, output } = require('./_utils');
const limit = require('./sql-spitting-image/limit');
const select = require('./sql-spitting-image/select');
const orderBy = require('./sql-spitting-image/orderBy');
const orderByMulti = require('./sql-spitting-image/orderByMulti');


/**
 * interface RaceResult { round: number; raceId: number; }
 * interface MainResult { points: number; driverId: number; year: number; }
 */


/**
 * Get data from the `results` table.
 *
 * @param year number
 * @return Promise<RaceResult[]>
 */
function qryRaces(year) {
    return runQuery('select "round", "raceId" from races where year = $1', [year]);
}


/**
 * Gets all driver standings at a given set of raceIds
 *
 * @param raceId number
 * @return Promise<MainResult[]>
 */
function qryStandings(raceId) {

    const sql = `
        select
        "driverStandings".points,
        "driverStandings"."driverId",
        2017 as year
        from "driverStandings"
        where "raceId" = $1
        `;
    return runQuery(sql, [raceId]);

}


qryRaces(2017)
    .then(orderBy('round', 'desc'))
    .then(limit(1))
    .then((rounds) => rounds.map(r => r.raceId))
    .then(takeOne)
    .then(qryStandings)
    .then(orderByMulti([['points', 'desc'], ['driverId', 'asc']]))
    .then(select([
        ["points", "points"],
        ["driverId", "driverId"],
        ["year" , "year"]
    ]))
    .then(output)
    .catch(err => {
        console.log("ERROR:", err);
        process.exit(1);
    });

This code, despite there being a lot of it is relatively straight forward. We get a list of raceId and round from the qryRaces function. Once we have this we will order by the round from largest to smallest and take the first one. This is the last race of the season.

After this we feed that raceId directly into the qryStandings functions to get the results from the last race. Finally we are forced to use a more complicated sorting function for stability, because some drivers have the same amount of points before presenting our desired columns.

Pro's
  • There's some nice re-usable functions here.
  • The main code is quite concise and easy to understand.
Con's
  • Longer than SQL
  • We downloaded more data than necessary, in this case it is not too bad but it could have been much worse.

SQL

SELECT
    "driverStandings".points,
    "driverStandings"."driverId",
    2017 as year
FROM "driverStandings"
WHERE "raceId" IN (
    SELECT "raceId" FROM races
    WHERE year = 2017
    ORDER BY "round" DESC
    LIMIT 1
)
ORDER BY
    "driverStandings".points DESC,
    "driverStandings"."driverId" ASC

Like the JavaScript we are using ordering (ORDER BY) and limiting (LIMIT) to get the highest raceId within 2017.

The IN clause can be used to match a column against a set of numbers. For example we may choose to write WHERE "raceId" IN (4, 5, 7) which would be the same thing as writing WHERE "raceId" = 4 OR "raceId" = 5 OR "raceId" = 7.

The smart thing here is that we are using a query to get the last raceId within the IN clause instead of a directly specified list of raceId.

Finally an ORDER BY statement is used to perform sorting of the final record set, you can sort by multiple fields or use many types of expressions.

Pro's
  • Shorter than the JavaScript.
  • If this were called by JavaScript we would need only one Promise, which is much easier to write and reason about.
  • The inside of the IN clause can be ran and understood individually.
Con's
  • Is the ORDER BY / LIMIT 1 a trick?
  • It seems in code you can give the contents of IN clause a name (raceIds) but this is not possible using SQL's IN, [or is it?]({% post_url 2019-03-12-code-in-postgresql-with %}).

Results

pointsdriverIdyear
36312017
317202017
3058222017
20582017
2008172017
1688302017
1008152017
878392017
548322017
43132017
438072017
408402017
281542017
198252017
1742017
138382017
88352017
58262017
58362017
0182017
08142017
08282017
08412017
08422017
08432017

Tags: code-in-postgresql, javascript, postgresql, postgresql

Code In PostgreSQL: You can do lots with just IN, ORDER BY and LIMIT

March 03, 2019 — Matt Forrester

This series of articles

This is the first of the Code in PostgreSQL series of articles.

Articles in this series
The reason why SQL is so important

When developing systems we often have a choice of writing code (NodeJS, C#, Python or PHP etc) or SQL. I believe that sometimes the decision to write code is taken without fully evaluating how much of the task could be offloaded to SQL.

In this series of articles I wish to show the huge benefits of using and learning SQL by examining progressively more difficult scenarios with increasing amounts of SQL knowledge. In doing this I hope to illustrate that sometimes large amounts of extra code is written for what SQL can achieve quicker, with less complexity and more readability.

To be more specific, we should try to follow the rule of least power more often.

About the Ergast data set

For this series of articles we will be using the Ergast data set, which is a provided under the Attribution-NonCommercial-ShareAlike 3.0 Unported Licence.

Setting up the Ergast database within PostgreSQL

To set up the Ergast database within PostgreSQL I did the following:

I allowed psql and friends to work without me having to put in a password all the time by configuring the PostgreSQL environmental variables.

export PGUSER=postgres PGPASSWORD=postgres PGDATABASE=postgres PGHOST=127.0.0.1

Then import the Ergast database. NOTE: At the time of writing I was unable to install the PostgreSQL version.

wget -O /tmp/f1db_ansi.sql.gz http://ergast.com/downloads/f1db_ansi.sql.gz
cat /tmp/f1db_ansi.sql.gz | \
    gzip -d | \
    sed 's/int(..)/int/' | \
    sed 's/ \+AUTO_INCREMENT//' | \
    sed "s/\\\'/\'\'/g" | \
    sed 's/UNIQUE KEY \"\(\w\+\)\"/UNIQUE /' | \
    sed 's/^ *KEY .*(\"\(.*\)\")/CHECK ("\1" > 0)/' | \
    sed 's/ date NOT NULL DEFAULT .0000.*,/ date,/'| psql
Assumed level of SQL Knowledge

In this JavaScript example we will assume the writer has sufficient SQL knowledge to use a WHERE statement along with the ability to only return certain fields using SELECT. After this we will see how this can be accomplished in one single SQL statement using IN, ORDER BY and LIMIT.

The aim

Lets say you want to find out the final points for drivers in the 2017 Formula 1 World Championship. The schema for the tables we will be using is as follows:

content-assets/2019-03-03-code-in-postgresql-in-order-by-limit/erd.svg

An example of the data you would find in these tables is shown below:

races
raceIdyearroundcircuitIdnamedatetimeurl
969201711Australian Grand Prix2017-03-2605:00:00https://en.wikipedia.org/wiki/2017_Australian_Grand_Prix
9702017217Chinese Grand Prix2017-04-0906:00:00https://en.wikipedia.org/wiki/2017_Chinese_Grand_Prix
971201733Bahrain Grand Prix2017-04-1615:00:00https://en.wikipedia.org/wiki/2017_Bahrain_Grand_Prix
9722017471Russian Grand Prix2017-04-3012:00:00https://en.wikipedia.org/wiki/2017_Russian_Grand_Prix
973201754Spanish Grand Prix2017-05-1412:00:00https://en.wikipedia.org/wiki/2017_Spanish_Grand_Prix
driversStandings
driverStandingsIdraceIddriverIdpointspositionpositionTextwins
64782855363770
647958561196552
647978564212331
6480585623410100
64810856367770

The driverStandings table has the points for every driver in every race. The problem here is that there is no record in the driverStandings table for which season a raceId belongs to. So we need to get a bit creative... Here is one possible solution:

  1. Looking at the races table's year column, we can find all the raceId in 2017.
  2. If we can get all races in a given year we should be able to get the last race because the round will be the highest within that year.
  3. Find the points and driverId for the drivers who were in that raceId by reading the driverStandings table.
  4. Sort them by points descending.
  5. The very first row contains the driverId which has the most points in that season. This driverId is the world champion. The ones later on denote their final position (assuming the points differ).

Implementing the JavaScript

Libraries
sql-spitting-image/limit.js
/**
 * Gets the first n rows from a set of rows.
 *
 * @param n number
 * @return (rows: Rows[]) => Rows[]
 */
function limit(n) {
    return function(rows) {
        return rows.slice(0, n);
    }
}


module.exports = limit;
sql-spitting-image/select.js
function select(spec) {

    function mapper(row) {
        let r = {};
        spec.forEach(([theFrom, theTo]) => {
            r[theTo] = row[theFrom]
        });
        return r;
    }

    return function selectImpl(rows) {
        return rows.map(mapper);
    };
}


module.exports = select;
sql-spitting-image/orderBy.js
/**
 * Creates a `sortFn` for the JavaScript [Array.prototype.sort](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort) function.
 *
 * @param columnName string The name of the column to sort by.
 * @param direction string If this is 'ASC' sort ascending, otherwise descending.
 * @return (a: number, b: number) => number The `sortFn`.
 */
function getSingleCompareFunction(columnName, direction) {

    const flipper = direction.toLowerCase() == 'asc' ? 1 : -1;

    return function singleCompareFunction(rowA, rowB) {
        return (rowA[columnName] - rowB[columnName]) * flipper;
    }
}


/**
 * Orders a set of rows
 *
 * @param columnName string
 * @param direction string ( 'ASC' || 'DESC' )
 * @param rows Row[]
 * @return Row[]
 */
function orderBy(columnName, direction='ASC') {

    const compareFunction = getSingleCompareFunction(columnName, direction);

    return function(rows) {
        return rows.sort(compareFunction);
    };
}


orderBy.getSingleCompareFunction = getSingleCompareFunction;
module.exports = orderBy;
sql-spitting-image/orderByMulti.js
const { getSingleCompareFunction } = require('./orderBy');

/**
 * Gets a `sortFn` for [Array.prototype.sort](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/sort)
 * that will sort whole rows based on a list of `columnName` and `direction`
 * tuples.
 *
 * @param colDirectionTuples `[columnName: string, direction: string][]`
 * Ordering specification where `columnName` and `direction` are parameters from
 * `getSingleCompareFunction`.
 * @return (a: Row, b: Row) => number The `sortFn`.
 */
function orderByMulti(colDirectionTuples) {

    function compareFunction(rowA, rowB) {
        return colDirectionTuples.reduce((acc, [col, dir]) => {
            if (acc != 0) { return acc; }
            const cf = getSingleCompareFunction(col, dir);
            return cf(rowA, rowB);
        }, 0);
    }

    return function(rows) {
        return rows.sort(compareFunction);
    }

}

module.exports = orderByMulti;
Main Code
const { takeOne, runQuery, output } = require('./_utils');
const limit = require('./sql-spitting-image/limit');
const select = require('./sql-spitting-image/select');
const orderBy = require('./sql-spitting-image/orderBy');
const orderByMulti = require('./sql-spitting-image/orderByMulti');


/**
 * interface RaceResult { round: number; raceId: number; }
 * interface MainResult { points: number; driverId: number; year: number; }
 */


/**
 * Get data from the `results` table.
 *
 * @param year number
 * @return Promise<RaceResult[]>
 */
function qryRaces(year) {
    return runQuery('select "round", "raceId" from races where year = $1', [year]);
}


/**
 * Gets all driver standings at a given set of raceIds
 *
 * @param raceId number
 * @return Promise<MainResult[]>
 */
function qryStandings(raceId) {

    const sql = `
        select
        "driverStandings".points,
        "driverStandings"."driverId",
        2017 as year
        from "driverStandings"
        where "raceId" = $1
        `;
    return runQuery(sql, [raceId]);

}


qryRaces(2017)
    .then(orderBy('round', 'desc'))
    .then(limit(1))
    .then((rounds) => rounds.map(r => r.raceId))
    .then(takeOne)
    .then(qryStandings)
    .then(orderByMulti([['points', 'desc'], ['driverId', 'asc']]))
    .then(select([
        ["points", "points"],
        ["driverId", "driverId"],
        ["year" , "year"]
    ]))
    .then(output)
    .catch(err => {
        console.log("ERROR:", err);
        process.exit(1);
    });

This code, despite there being a lot of it is relatively straight forward. We get a list of raceId and round from the qryRaces function. Once we have this we will order by the round from largest to smallest and take the first one. This is the last race of the season.

After this we feed that raceId directly into the qryStandings functions to get the results from the last race. Finally we are forced to use a more complicated sorting function for stability, because some drivers have the same amount of points before presenting our desired columns.

Pro's
  • There's some nice re-usable functions here.
  • The main code is quite concise and easy to understand.
Con's
  • Longer than SQL
  • We downloaded more data than necessary, in this case it is not too bad but it could have been much worse.

SQL

SELECT
    "driverStandings".points,
    "driverStandings"."driverId",
    2017 as year
FROM "driverStandings"
WHERE "raceId" IN (
    SELECT "raceId" FROM races
    WHERE year = 2017
    ORDER BY "round" DESC
    LIMIT 1
)
ORDER BY
    "driverStandings".points DESC,
    "driverStandings"."driverId" ASC

Like the JavaScript we are using ordering (ORDER BY) and limiting (LIMIT) to get the highest raceId within 2017.

The IN clause can be used to match a column against a set of numbers. For example we may choose to write WHERE "raceId" IN (4, 5, 7) which would be the same thing as writing WHERE "raceId" = 4 OR "raceId" = 5 OR "raceId" = 7.

The smart thing here is that we are using a query to get the last raceId within the IN clause instead of a directly specified list of raceId.

Finally an ORDER BY statement is used to perform sorting of the final record set, you can sort by multiple fields or use many types of expressions.

Pro's
  • Shorter than the JavaScript.
  • If this were called by JavaScript we would need only one Promise, which is much easier to write and reason about.
  • The inside of the IN clause can be ran and understood individually.
Con's
  • Is the ORDER BY / LIMIT 1 a trick?
  • It seems in code you can give the contents of IN clause a name (raceIds) but this is not possible using SQL's IN, [or is it?]({% post_url 2019-03-12-code-in-postgresql-with %}).

Results

pointsdriverIdyear
36312017
317202017
3058222017
20582017
2008172017
1688302017
1008152017
878392017
548322017
43132017
438072017
408402017
281542017
198252017
1742017
138382017
88352017
58262017
58362017
0182017
08142017
08282017
08412017
08422017
08432017

Tags: code-in-postgresql, javascript, postgresql, postgresql

Creating a custom aggregate in PostgreSQL

November 27, 2018 — Matt Forrester

Why use an aggregate instead of returning all the data?

Sometimes you do not need all the data brought from the database into your normal programming language. Using normal aggregate functions such as SUM, COUNT or AVERAGE you may save your database server, network and application server massive amounts of network activity, a lot of time and potentially yourself a lot of work.

But there is no function included for what I want!

While PostgreSQL has many different types of aggregate functions included, sometimes what you need is a little too custom for it to be included out of the box. For this there is the ability to create your own custom aggregates using a relatively easy to understand API.

The scenario

Suppose we want to find out how long (in time) a particular race is. We could just take the fastest time, but there may be an exceptional competitor far faster than everyone else. We could also take an average of all the finishers, but that may include some really slow competitors. Another option is to take the top n competitors and find the average time. This is the method we will describe for this tutorial.

How to define a custom aggregate

A custom aggregate can be as easy as filling in the gaps of the following:

create aggregate race_time(comp_cnt smallint, fin_time bigint) (
    stype = ...
    initcond = ...
    sfunc = ...
    finalfunc = ...
);

We have a few missing elements here, but don't worry, I will explain what they are and then we will fill them in.

If we were writing this in JavaScript...

Using a reduce function would get us a long way towards finding the answer. Reduce in Javascript has the following signature:

arr.reduce(callback[, initialValue])

And the callback has the following signature:

callback(accumulator, currentValue)

Therefore if you wanted to get the average of the three highest numbers you may write the following

function get_race_time(comp_count) {

    function callback(accumulator, currentValue) {
        if (accumulator.length < comp_count) {
            return accumulator.concat([currentValue]);
        }
        const r = accumulator.concat([currentValue]);
        const min = accumulator.reduce((a, b) => Math.max(a, b), -1);
        r.splice(r.indexOf(min), 1);
        return r;
    };

    function finalizer(accumulator) { // imperative for the purpose
        let result = 0;               // of blog text readability.
        for (let i = 0; i < accumulator.length; i++) {
            result = result + accumulator[i];
        }
        return result / accumulator.length;
    }


    return function race_time_implementation(number_array) {
        const initialValue = [];
        let reduceResult = number_array.reduce(callback, initialValue);
        return finalizer(reduceResult);
    };

}

const race_time = get_race_time(3);

And get the answer 2.

Converting the JavaScript back to SQL

stype

The stype is the type of the accumulator, initialValue and reduceResult. This would be called an "Array of Number" in JavaScript, number[] in TypeScript, but in SQL it would be bigint[].

initcond

The initcond is the value of initialValue from the JavaScript code. However initcond is weird as you must express it as a varchar, as if it would be typecast into the stype. If you were to run a select '{1,2,3}'::bigint[] query you would select an array of bigint with numbers 1, 2 and 3 within, so '{}' is the correct value.

sfunc

We are now making progress because the sfunc is the callback from the reduce function. If we were only intending our aggregate to take one argument it would actually have an identical signature.

However did you notice that callback really has three parameters, accumulator, currentValue as an actual parameters and comp_count from the wrapping function. As callback needs all these arguments, so does our aggregate (sfunc), as below.

create function race_time_sfunc(acc bigint[], comp_count smallint, fin_time bigint)
    returns bigint[] as
$$
    with t (f) as (
        select unnest(array_append(acc, fin_time))
        order by 1 asc
        limit comp_count
    )
    select array_agg(f) from t
$$
language sql immutable;

In this function we're creating a common table expression with the lowest n values stored within the acc array with row_time appended using the unnest function.

After this we just use the array_agg function to re-transform that common table expression back into an array.

finalfunc

Lastly we have the finalfunc which, in our JavaScript takes the result of the sfunc (bigint[]) and averages the values within. This is exactly the same as the finalizer() function in the JavaScript implementation.

create function race_time_finalfunc(acc bigint[])
    returns numeric as
$$
    with t (f) as (select unnest(acc))
    select avg(f) from t
$$
language sql immutable;
Tying it all together

We have now discussed, and hopefully understood all the component parts to create an aggregate so we should be able to substitute the values in. Once you have created the sfunc and finalfunc functions you should be able to run the following to create the aggregate.

create aggregate race_time(comp_count smallint, fin_time bigint) (
    stype = bigint[],
    initcond = '{}',
    sfunc = race_time_sfunc,
    finalfunc = race_time_finalfunc
);

Does it work?

create temporary table test_table (val int);
insert into test_table(val) values (2), (3), (5), (3), (4), (1);
select race_time(3::smallint, val) from test_table

Gives the answer 2, so appears to.

I actually wrote tests while preparing this blog post. When I have a write up complete I will share them too.

Tags: postgresql