`aggs_for_arrays`

This Postgres extension provides various functions for operating on arrays, for instance taking the histogram of an array of numbers.

These functions are useful because if you have a lot values you want to aggregate, queries that fetch each value from a separate row can have poor performance. Storing all the values in a single row as a Postgres array can drastically improve query performance. For instance, computing a 1000-bucket histogram on one million float values stored in separate rows took 12 seconds in a simple benchmark, compared to 27 milliseconds with the array_to_hist function.

Using arrays in this way is a bit like a poor-man's column-store database: it lets you keep all the values for one attribute in one place. (It's also a bit like how in R and Pandas you often see parallel arrays rather than arrays of objects.) To simplify a little, imagine pulling one million floats off disk in a single 8MB chunk instead of asking the drive for one million separate reads. I wouldn't use this pattern if your arrays get updated a lot, lest you take a hit on writes (At least test first!), but if they are pretty stable then using arrays can greatly speed up reads.

With such an approach you could still use SQL or PLPGSQL, but these functions outperform such code, because the Postgres C API lets you skip a lot of the work for interfacing at those higher levels. For instance, the same benchmark gave 398ms for a SQL solution and 12 seconds for a plpgsql solution. We show further benchmark results below.

Note that despite the name, these functions are not true aggregate functions (summarizing multiple rows). Rather they do aggregate-like calculations on a single input array. If you want actual aggregates that take multiple input arrays, then you might be looking for my other extension, aggs_for_vecs. If this extension takes a column-store approach to your data, that one takes a row-store approach.

Installing

This package installs like any Postgres extension. First say:

make && sudo make install

You will need to have pg_config in your path, but normally that is already the case. You can check with which pg_config.

Then in the database of your choice say:

CREATE EXTENSION aggs_for_arrays;

Functions

The available functions are described below. In general, these functions accept arrays of any integer or floating-point type, namely SMALLINT, INTEGER, BIGINT, REAL, or DOUBLE PRECISION (aka FLOAT). The return value will either be the same type (e.g. for a minimum), a FLOAT (e.g. for a mean), or an INTEGER type (e.g. for histogram bucket counts). If a function can take any numeric type, its types are shown as T.

`INTEGER[] array_to_hist(values T[], bucket_start T, bucket_width T, bucket_count INTEGER)`

Returns the bucket count based on the values and bucket characteristics you request.

`INTEGER[] array_to_hist_2d(x_values T[], y_values T[], x_bucket_start T, y_bucket_start T, x_bucket_width T, y_bucket_width T, x_bucket_count INTEGER, y_bucket_count INTEGER)`

Returns the bucket count as a 2-D array based on the values and bucket characteristics you request. The data arrays x_values and y_values must be the same length. We compare each array's first element and plot it, then their second element, etc. If either x_values or y_values is NULL, the whole result is NULL. If either contains a NULL, then that position isn't plotted.

`FLOAT array_to_mean(values T[])`

Returns the mean of all the values in the array.

`FLOAT array_to_median(values T[])`

Returns the median. Does not require a pre-sorted input. If there are an even number of values, returns the mean of the two middle values.

`FLOAT sorted_array_to_median(values T[])`

Just like array_to_median, but assumes values is already sorted.

`FLOAT array_to_mode(values T[])`

Returns the mode. Does not require a pre-sorted input. If there are several values tied for most common, returns their mean.

`FLOAT sorted_array_to_mode(values T[])`

Just like array_to_mode, but assumes values is already sorted.

`FLOAT array_to_percentile(values T[], percentile FLOAT)`

Returns the percentile you request, where percentile is a number from 0 to 1 inclusive. Asking for 0 will always give the minimum, 1 for maximum, and 0.5 the median. If you ask for a percentile that lands between two data points, we return a linear interpolation between them.

`FLOAT sorted_array_to_percentile(values T[], percentile FLOAT)`

Just like array_to_percentile, but assumes values is already sorted.

`FLOAT[] array_to_percentiles(values T[], percentiles FLOAT[])`

Just like array_to_percentile, but you can pass several percentiles and get the result for each in a single call.

`FLOAT[] sorted_array_to_percentiles(values T[], percentiles FLOAT[])`

Just like array_to_percentiles, but assumes values is already sorted.

`T array_to_max(values T[])`

Returns the greatest value in the array.

`T array_to_min(values T[])`

Returns the least value in the array.

`T[] array_to_min_max(values T[])`

Returns a tuple with the min in position 1 and the max in position 2.

`FLOAT array_to_skewness(values T[])`

Computes the skewness of the given values.

`FLOAT array_to_kurtosis(values T[])`

Computes the kurtosis of the given values.

Benchmarks

Assume you have two tables:

CREATE TABLE samples (
  id INTEGER PRIMARY KEY,
  measurement_id INTEGER NOT NULL,
  value FLOAT NOT NULL
);

CREATE TABLE sample_groups {
  id INTEGER PRIMARY KEY,
  measurement_id INTEGER NOT NULL,
  values FLOAT[] NOT NULL
};

These tables store the same information, but samples stores each sample in a separate row, and sample_groups stores a whole group in just one row.

You can run bench.sh to test the performance of various approaches:

SQL on samples.
SQL on sample_groups.
PLPGSQL on sample_groups.
The aggs_for_arrays function on sample_groups.

The sorted_array_to_* methods use sorted_samples and sorted_sample_groups instead.

function	SQL row-based	SQL array-based	PLPGSQL array-based	`aggs_for_arrays`
`array_to_hist`	12218.1 ms	398.235 ms	12310.800 ms	26.936 ms
`array_to_mean`	10630.0 ms	121.677 ms	390.983 ms	25.226 ms
`array_to_median`	33587.0 ms	1163.070 ms	1258.160 ms	47.996 ms
`sorted_array_to_median`	23239.5 ms	30.107 ms	41.225 ms	14.835 ms
`array_to_mode`	13724.1 ms	1505.310 ms	1552.610 ms	201.943 ms
`sorted_array_to_mode`	13195.2 ms	1474.130 ms	1577.770 ms	45.171 ms
`array_to_percentile`	24218.2 ms	2591.240 ms	1698.570 ms	179.916 ms
`sorted_array_to_percentile`	24305.5 ms	2102.520 ms	1204.140 ms	21.947 ms
`array_to_percentiles`	32367.0 ms	10735.300 ms	3608.800 ms	188.752 ms
`sorted_array_to_percentiles`	32294.3 ms	10153.300 ms	3120.830 ms	22.227 ms
`array_to_max`	10613.2 ms	115.094 ms	398.791 ms	17.321 ms
`array_to_min`	10600.5 ms	113.859 ms	400.926 ms	17.204 ms
`array_to_min_max`	10727.9 ms	169.226 ms	824.539 ms	23.922 ms
`array_to_skewness`	22267.2 ms	802.463 ms	1077.630 ms	120.925 ms
`array_to_kurtosis`	22253.1 ms	806.296 ms	1075.960 ms	112.210 ms

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
bench		bench
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
META.json		META.json
Makefile		Makefile
README.md		README.md
aggs_for_arrays--1.0.0.sql		aggs_for_arrays--1.0.0.sql
aggs_for_arrays--1.1.1.sql		aggs_for_arrays--1.1.1.sql
aggs_for_arrays--1.2.0.sql		aggs_for_arrays--1.2.0.sql
aggs_for_arrays--1.3.2.sql		aggs_for_arrays--1.3.2.sql
aggs_for_arrays.c		aggs_for_arrays.c
aggs_for_arrays.control		aggs_for_arrays.control
array_to_count.c		array_to_count.c
array_to_hist.c		array_to_hist.c
array_to_hist_2d.c		array_to_hist_2d.c
array_to_kurtosis.c		array_to_kurtosis.c
array_to_max.c		array_to_max.c
array_to_mean.c		array_to_mean.c
array_to_median.c		array_to_median.c
array_to_min.c		array_to_min.c
array_to_min_max.c		array_to_min_max.c
array_to_mode.c		array_to_mode.c
array_to_percentile.c		array_to_percentile.c
array_to_percentiles.c		array_to_percentiles.c
array_to_skewness.c		array_to_skewness.c
sorted_array_to_median.c		sorted_array_to_median.c
sorted_array_to_mode.c		sorted_array_to_mode.c
sorted_array_to_percentile.c		sorted_array_to_percentile.c
sorted_array_to_percentiles.c		sorted_array_to_percentiles.c
util.c		util.c

License

pjungwir/aggs_for_arrays

Folders and files

Latest commit

History

Repository files navigation

aggs_for_arrays

Installing

Functions

INTEGER[] array_to_hist(values T[], bucket_start T, bucket_width T, bucket_count INTEGER)

INTEGER[] array_to_hist_2d(x_values T[], y_values T[], x_bucket_start T, y_bucket_start T, x_bucket_width T, y_bucket_width T, x_bucket_count INTEGER, y_bucket_count INTEGER)

FLOAT array_to_mean(values T[])

FLOAT array_to_median(values T[])

FLOAT sorted_array_to_median(values T[])

FLOAT array_to_mode(values T[])

FLOAT sorted_array_to_mode(values T[])

FLOAT array_to_percentile(values T[], percentile FLOAT)

FLOAT sorted_array_to_percentile(values T[], percentile FLOAT)

FLOAT[] array_to_percentiles(values T[], percentiles FLOAT[])

FLOAT[] sorted_array_to_percentiles(values T[], percentiles FLOAT[])

T array_to_max(values T[])

T array_to_min(values T[])

T[] array_to_min_max(values T[])

FLOAT array_to_skewness(values T[])

FLOAT array_to_kurtosis(values T[])