aggregate

Overview

The aggregate statement is used to reduce the number of rows in a DSET while preserving required information within them.

Syntax

aggregate [counter_columncolname] [dset.id][notime|daily] [offsetoffset][nudge] [default_functionfunction] colname function [... colname function]

Details

The aggregate statement is a powerful tool for reducing the number of rows in a DSET. Aggregation is based on the concept of matching rows. Any two rows that match may be merged into a single row which selectively retains information from both of the original rows. Any further rows that match may also be merged into the same result row.

A quick introduction

A match is determined by comparing all the columns which have a function of match associated with them (further information regarding this can be found below). If all the column values match, then the rows are merged.

Merging involves examining all the columns in the data that were not used in the matching process. For each of those columns, it applies a function to the values in the two rows and updates the result row with the computed result of that function. For a full list of functions, please refer to the table further down in this article.

To illustrate this consider the following two-row dataset:

id,colour,location,quantity
1234,blue,europe,4.5
1234,green,europe,5.5

If we don't care about the colour value in the above records, we can combine them. We do care about the quantity however, so we'll add the two values together to get the final result.

The statement to do this is:

aggregate notime id match location match quantity sum

id match means that the values in the id columns must be the same
location match means that the values in the location columns must be the same
quantity sum means that the resulting value should be the sum of the two existing values
by default, a function of first is applied to the columns, such that the original row retains its value

Applying these rules to the above example we get the following single result record:

id,colour,location,quantity,EXIVITY_AGGR_COUNT
1234,blue,europe,10,2

Parameters

The aggregate statement accepts a range of parameters as summarised in the table below:

Parameter

Notes

counter_column

(may also be any of counter_col, count_col or count_column)

The name of a column in the aggregated DSET, into which a count of the source rows that were combined to create that row is placed

dset.id

The DSET to perform the aggregation against

notime

(Either notime or daily is required) If used, timestamps in records are not taken into consideration when aggregating

daily

(Either notime or daily is required) If used, specifies that timestamps in the records will be considered when aggregating

offset

(May only be used if daily is present) The number of hours to shift timestamps by prior to aggregation

nudge

(May only be used if daily is present) If present, the times in the timestamp column marked as the end time column will have 1 second shaved off them prior to aggregation

default_function

Specifies the default logic to apply to a column when merging records together. If not specified then the default is first (see table below)

colname function

One or more pairs of column + function parameters. For each pair, the specified function will be used for the specified column name when merging records together during the aggregation process. For any columns not explicitly named, the default function will be applied.

If two records are deemed suitable for merging then the function determines the resulting value in each column. The available functions are as follows:

Function

Logic

match

The value in both records must be the same

first

The existing value in the first ever result record will be used

last

The value in the last record merged will be used

sum

The values will be treated as numbers and summed

max

The values will be treated as numbers and the greatest will be used

min

The values will be treated as numbers and the smallest will be used

longest

Whichever value has the most characters in it will be used

shortest

Whichever value has the least characters in it will be used

blank

The value in the resulting merged record will be blank

avg

The values will be treated as numbers and the average will be used

The counter column

Each of the rows in a DSET that has been aggregated may be the result of merging multiple rows that existed prior to aggregation. The aggregate statement therefore creates a new column (or overwrites an existing column if told to) which for each row contains a count of the number of rows that were merged to create that row.

By default, this column is called EXIVITY_AGGR_COUNT but an alternative name may be specified using the optional counter_column parameter.

Any of counter_column, counter_col, count_column or count_col may be used as the parameter name. They all work in the same way.

For example, the following statement will perform the aggregation and place the counts for each row into a column called merged_row_count

aggregate counter_column merged_row_count test.data category match quantity sum

If the column specified by counter_column does not exist then it will be created.

If the column does exist then the contents will be overwritten.

Note that the column named by counter_column may not be assigned an aggregation function. If it is, then an error will be logged and the transform will fail for the current data date.

Non time-sensitive aggregation

When the notime parameter is specified, the aggregation process treats any columns flagged as start and end times in the data as data columns, not timestamp columns.

In this case when comparing two rows to see if they can be merged, the aggregation function simply checks to see if all the columns with a function of match are the same, and if they are the two rows are merged into one by applying the appropriate function to each column in turn.

De-duplication

The following illustrates the aggregate statement being used to remove duplicate rows from a DSET:

# The first column in the DSET being aggregated
# is called subscription_id

aggregate notime default_function match subscription_id match

The analysis of the statement above is as follows:

notime - we are not interested in timestamps
default_function match - by default every column has to match before records can be aggregated
subscription_id match - this is effectively redundant as the default_function is match but needs to be present because at least one pair of colname function parameters is required by the aggregate statement

The resulting DSET will have no duplicate data rows, as each group of rows whose column values were the same were collapsed into a single record.

Row reduction while preserving data

The example shown at the top of this article used the sum function to add up the two quantity values, resulting in the same total at the expense of being able to say which source record contributed which value to that total.

The sum function can therefore accurately reflect the values in a number of source rows, albeit with the above limitation. By using a function of sum, max or min, various columns can be processed by aggregate in a meaningful manner, depending on the specific use case.

Time-sensitive aggregation

When aggregating, columns containing start time and end time values in UNIX epoch format can be specified. Each record in the DSET therefore has start and end time markers defining the period of time that the usage in the record represents. As well as taking the start times and end times into account, time-sensitive aggregation can perform additinal manipulations on these start and end times.

A quick example

Consider the following CSV file called aggregate_test.csv:

startUsageTime,endUsageTime,id,subscription_id,service,quantity
2017-11-03:00.00.00,2017-11-03:02.00.00,ID_1234,SUB_abcd,Large VM,2
2017-11-03:00.00.00,2017-11-03:03.00.00,ID_1234,SUB_abcd,Large VM,2
2017-11-03:00.00.00,2017-11-03:06.00.00,ID_3456,SUB_efgh,Medium VM,2
2017-11-03:00.00.00,2017-11-03:04.00.00,ID_1234,SUB_abcd,Large VM,2
2017-11-03:00.00.00,2017-11-03:05.00.00,ID_1234,SUB_abcd,Large VM,2
2017-11-03:00.00.00,2017-11-03:06.00.00,ID_1234,SUB_abcd,Large VM,2
2017-11-03:00.00.00,2017-11-03:07.00.00,ID_1234,SUB_abcd,Large VM,2
2017-11-03:00.00.00,2017-11-03:02.00.00,ID_3456,SUB_efgh,Large VM,2
2017-11-03:00.00.00,2017-11-03:03.00.00,ID_3456,SUB_efgh,Medium VM,2
2017-11-03:00.00.00,2017-11-03:04.00.00,ID_3456,SUB_efgh,Large VM,2
2017-11-03:00.00.00,2017-11-03:05.00.00,ID_3456,SUB_efgh,Large VM,2
2017-11-03:00.00.00,2017-11-03:07.00.00,ID_3456,SUB_efgh,Large VM,2
2017-11-03:00.00.00,2017-11-03:06.00.00,ID_3456,SUB_efgh,Medium VM,2

It is possible to aggregate these into 3 output records with adjusted timestamps using the following Transcript task:

import system/extracted/aggregate_test.csv source aggr alias test

var template = YYYY.MM.DD.hh.mm.ss
timestamp START_TIME using startUsageTime template ${template}
timestamp END_TIME using endUsageTime template ${template}
timecolumns START_TIME END_TIME
delete columns startUsageTime endUsageTime

aggregate aggr.test daily nudge default_function first id match subscription_id match service match quantity sum

timerender START_TIME as FRIENDLY_START
timerender END_TIME as FRIENDLY_END

Resulting in:

id,subscription_id,service,quantity,START_TIME,END_TIME,EXIVITY_AGGR_COUNT,FRIENDLY_START,FRIENDLY_END
ID_1234,SUB_abcd,Large VM,12,1509667200,1509692399,6,20171103 00:00:00,20171103 06:59:59
ID_3456,SUB_efgh,Medium VM,6,1509667200,1509688799,3,20171103 00:00:00,20171103 05:59:59
ID_3456,SUB_efgh,Large VM,8,1509667200,1509692399,4,20171103 00:00:00,20171103 06:59:59

As can be seen, for each unique combination of the values in the id,subscription-id and service columns, the start and end times have been adjusted as described above and the quantity column contains the sum of all the values in the original rows.

When performing time-sensitive aggregation, any records with a start or end time falling outside the current data date will be discarded.

Further notes

The daily parameter to aggregate means that the START_TIME and END_TIME columns are now recognised as containing timestamps. When aggregating with the daily option, timestamps within the current dataDate are combined to result in an output record that has the earliest start time and the latest end time seen within the day.

Optionally, following daily an offset may be specified as follows:

aggregate aggr.test daily offset 2 id match subscription_id match quantity sum

In this case the start and end timestamps are adjusted by the number of hours specified after the word offset before aggregation is performed. This permits processing of data which has timestamps with timezone information in them, and which may start at 22:00:00 of the first day and end at 21:59:59 of the second day, as an offset can be applied to realign the records with the appropriate number of hours to compensate.

The nudge parameter shaves 1 second off end times before aggregating in order to avoid conflicts where hourly records start and end on 00:00:00 9the last second of the current hour is the same as the first second of the next hour)

PreviousLanguage Nextappend

Last updated 3 years ago

Was this helpful?