aggregate
Overview
The aggregate
statement is used to reduce the number of rows in a DSET while preserving required information within them
Syntax
aggregate
[dset.id]
[notime|daily] [offset
offset]
[nudge] [default_function
function] colname function [... colname function]
Details
The aggregate
statement is a powerful tool for reducing the number of rows in a DSET. Aggregation is based on the concept of matching rows. Any two rows that match may be merged into a single row which selectively retains information from both of the original rows. Any further rows that match may also be merged into the same result row.
A quick introduction
A match is determined by comparing all the columns which have a function of match
associated with them (further information regarding this can be found below). If all the column values match, then the rows are merged.
Merging involves examining all the columns in the data that were not used in the matching process. For each of those columns, it applies a function to the values in the two rows and updates the result row with the computed result of that function. For a full list of functions, please refer to the table further down in this article.
To illustrate this consider the following two row dataset:
If we don't care about the colour
value in the above records, we can combine them together. We do care about the quantity
however, so we'll add the two values together to get the final result.
The statement to do this is:
aggregate notime id match location match quantity sum
id match means that the values in the
id
columns must be the samelocation match means that the values in the
location
columns must be the samequantity sum means that the resulting value should be the sum of the two existing values
by default, a function of
first
is applied to the columns, such that the original row retains its value
Applying these rules to the above example we get the following single result record:
A column calledEXIVITY_AGGR_COUNT
is automatically created by the aggregate
statement and for each row in the output it will contain the number of source rows that were merged together to create that result row
Parameters
The aggregate
statement accepts a range of parameters as summarised in the table below:
If two records are deemed suitable for merging then the function determines the resulting value in each column. The available functions are as follows:
Non time-sensitive aggregation
When the notime parameter is specified, the aggregation process treats any columns flagged as start and end times in the data as data columns, not timestamp columns.
In this case when comparing two rows to see if they can be merged, the aggregation function simply checks to see if all the columns with a function of match
are the same, and if they are the two rows are merged into one by applying the appropriate function to each column in turn.
De-duplication
The following illustrates the aggregate
statement being used to remove duplicate rows from a DSET:
The analysis of the statement above is as follows:
notime - we are not interested in timestamps
default_function match - by default every column has to match before records can be aggregated
subscription_id match - this is effectively redundant as the default_function is
match
but needs to be present because at least one pair of colname function parameters is required by theaggregate
statement
The resulting DSET will have no duplicate data rows, as each group of rows whose column values were the same were collapsed into a single record.
Row reduction while preserving data
The example shown at the top of this article used the sum
function to add up the two quantity
values, resulting in the same total at the expense of being able to say which source record contributed which value to that total.
The sum
function can therefore accurately reflect the values in a number of source rows, albeit with the above limitation. By using a function of sum
, max
or min
, various columns can be processed by aggregate
in a meaningful manner, depending on the specific use case.
Time-sensitive aggregation
When aggregating, columns containing start time and end time values in UNIX epoch format can be specified. Each record in the DSET therefore has start and end time markers defining the period of time that the usage in the record represents. As well as taking the start times and end times into account, time-sensitive aggregation can perform additinal manipulations on these start and end times.
A quick example
Consider the following CSV file called aggregate_test.csv
:
It is possible to aggregate these into 3 output records with adjusted timestamps using the following Transcript task:
Resulting in:
As can be seen, for each unique combination of the values in the id
,subscription-id
and service
columns, the start and end times have been adjusted as described above and the quantity
column contains the sum of all the values in the original rows.
When performing time-sensitive aggregation, any records with a start or end time falling outside the current data date will be discarded.
Further notes
The daily parameter to aggregate
means that the START_TIME and END_TIME columns are now recognised as containing timestamps. When aggregating with the daily option, timestamps within the current dataDate are combined to result in an output record which has the earliest start time and the latest end time seen within the day.
Optionally, following daily
an offset may be specified as follows:
aggregate aggr.test daily offset 2 id match subscription_id match quantity sum
In this case the start and end timestamps are adjusted by the number of hours specified after the word offset before aggregation is performed. This permits processing of data which has timestamps with timezone information in them, and which may start at 22:00:00 of the first day and end at 21:59:59 of the second day, as an offset can be applied to realign the records with the appropriate number of hours to compensate.
The nudge parameter shaves 1 second off end times before aggregating in order to avoid conflicts where hourly records start and end on 00:00:00 9the last second of the current hour is the same as the first second of the next hour)
Last updated