Skip to content

johnsonc/aggs_for_arrays

 
 

Repository files navigation

aggs_for_arrays

This Postgres extension provides various functions for operating on arrays, for instance taking the histogram of an array of numbers.

These functions are useful because if you have a lot values you want to aggregate, queries that fetch each value from a separate row can have poor performance. Storing all the values in a single row as a Postgres array can drastically improve query performance. For instance, computing a 1000-bucket histogram on one million float values stored in separate rows took 12 seconds in a simple benchmark, compared to 27 milliseconds with the array_to_hist function.

Even if you store all the values together in an array column, these functions still outperform aggregations done in SQL or plpgsql, because the Postgres C API lets you skip a lot of the work for interfacing at those higher levels. For instance, the same benchmark gave 398ms for a SQL solution and 12 seconds for a plpgsql solution. We show further benchmark results below.

Installing

This package installs like any Postgres extension. First say:

make && sudo make install

You will need to have pg_config in your path, but normally that is already the case. You can check with which pg_config.

Then in the database of your choice say:

CREATE EXTENSION aggs_for_arrays;

Functions

The available functions are described below. In general, these functions accept arrays of any integer or floating-point type, namely SMALLINT, INTEGER, BIGINT, REAL, or DOUBLE PRECISION (aka FLOAT). The return value will either be the same type (e.g. for a minimum), a FLOAT (e.g. for a mean), or an INTEGER type (e.g. for histogram bucket counts). If a function can take any numeric type, its types are shown as T.

INTEGER[] array_to_hist(values T[], bucket_start T, bucket_width T, bucket_count INTEGER)

Returns the bucket count based on the values and bucket characteristics you request.

FLOAT array_to_mean(values T[])

Returns the mean of all the values in the array.

FLOAT array_to_median(values T[])

Returns the median. Does not require a pre-sorted input. If there are an even number of values, returns the mean of the two middle values.

FLOAT sorted_array_to_median(values T[])

Just like array_to_median, but assumes values is already sorted.

FLOAT array_to_mode(values T[])

Returns the mode. Does not require a pre-sorted input. If there are several values tied for most common, returns their mean.

FLOAT sorted_array_to_mode(values T[])

Just like array_to_mode, but assumes values is already sorted.

FLOAT array_to_percentile(values T[], percentile FLOAT)

Returns the percentile you request, where percentile is a number from 0 to 1 inclusive. Asking for 0 will always give the minimum, 1 for maximum, and 0.5 the median. If you ask for a percentile that lands between two data points, we return a linear interpolation between them.

FLOAT sorted_array_to_percentile(values T[], percentile FLOAT)

Just like array_to_percentile, but assumes values is already sorted.

FLOAT[] array_to_percentiles(values T[], percentiles FLOAT[])

Just like array_to_percentile, but you can pass several percentiles and get the result for each in a single call.

FLOAT[] sorted_array_to_percentiles(values T[], percentiles FLOAT[])

Just like array_to_percentiles, but assumes values is already sorted.

T array_to_max(values T[])

Returns the greatest value in the array.

T array_to_min(values T[])

Returns the least value in the array.

T[] array_to_min_max(values T[])

Returns a tuple with the min in position 1 and the max in position 2.

FLOAT array_to_skewness(values T[])

Computes the skewness of the given values.

FLOAT array_to_kurtosis(values T[])

Computes the kurtosis of the given values.

Benchmarks

Assume you have two tables:

CREATE TABLE samples (
  id INTEGER PRIMARY KEY,
  measurement_id INTEGER NOT NULL,
  value FLOAT NOT NULL
);

CREATE TABLE sample_groups {
  id INTEGER PRIMARY KEY,
  measurement_id INTEGER NOT NULL,
  values FLOAT[] NOT NULL
};

These tables store the same information, but samples stores each sample in a separate row, and sample_groups stores a whole group in just one row.

You can run bench.sh to test the performance of various approaches:

* SQL on `samples`.
* SQL on `sample_groups`.
* PLPGSQL on `sample_groups`.
* The `aggs_for_arrays` function on `sample_groups`.

The sorted_array_to_* methods use sorted_samples and sorted_sample_groups instead.

function SQL row-based SQL array-based PLPGSQL array-based aggs_for_arrays
array_to_hist 12218.1 ms 398.235 ms 12310.800 ms 26.936 ms
array_to_mean 10630.0 ms 121.677 ms 390.983 ms 25.226 ms
array_to_median 33587.0 ms 1163.070 ms 1258.160 ms 47.996 ms
sorted_array_to_median 23239.5 ms 30.107 ms 41.225 ms 14.835 ms
array_to_mode 13724.1 ms 1505.310 ms 1552.610 ms 201.943 ms
sorted_array_to_mode 13195.2 ms 1474.130 ms 1577.770 ms 45.171 ms
array_to_percentile 24218.2 ms 2591.240 ms 1698.570 ms 179.916 ms
sorted_array_to_percentile 24305.5 ms 2102.520 ms 1204.140 ms 21.947 ms
array_to_percentiles 32367.0 ms 10735.300 ms 3608.800 ms 188.752 ms
sorted_array_to_percentiles 32294.3 ms 10153.300 ms 3120.830 ms 22.227 ms
array_to_max 10613.2 ms 115.094 ms 398.791 ms 17.321 ms
array_to_min 10600.5 ms 113.859 ms 400.926 ms 17.204 ms
array_to_min_max 10727.9 ms 169.226 ms 824.539 ms 23.922 ms
array_to_skewness 22267.2 ms 802.463 ms 1077.630 ms 120.925 ms
array_to_kurtosis 22253.1 ms 806.296 ms 1075.960 ms 112.210 ms

About

A Postgres extension with various functions that compute statistics on arrays of numbers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 52.2%
  • C 43.3%
  • PLpgSQL 3.5%
  • Other 1.0%