import
Overview
The import
statement is used to read a CSV file (which must conform to Dataset standards) from disk in order to create a DSET which can then be processed by subsequent Transcript statements.
To import CSV files that do not use a comma as the delimiter, please refer to the Quote and Separator Options section further down in this article.
Syntax
Import from CSV files
import
filename
from
source
[alias
alias]
[options { ... }]
import
filename
source
custom_source
[alias
alias]
[options { ... }]
Import from CCR files
import
filename.ccr
source
custom_source
[alias
alias]
Import from database tables
import ACCOUNT source
custom_source
[alias
alias]
import USAGE for
date
from
db_name
source
custom_source
[alias
alias]
Details
If using the Windows path delimiter - \
- it is advisable to put the path and filename in double quotes to avoid the backslash being interpreted as an escape character
Options
Data imported from a CSV file can be filtered as it is being read in order to reduce post-import processing time and memory overhead.
The import filters are configured via the options parameter, which if present is imemdiately followed by one or more name = value pairs enclosed within braces, for example:
The =
sign is optional, but if present it must be surrounded by whitespace.
Multiple options may be specified and each option must be placed on separate line, for example:
The options supported by import
are as follows:
Name | Type/possible values | Description |
skip | numeric | Number of leading non-blank rows to skip |
omit | numeric | Number of trailing non-blank rows to omit |
heading_dupe_check header_dupe_check |
| Skip rows that are a duplicate of the header row |
select include | column list | A space-separated list of column names to include in the import |
ignore exclude | column list | A space-separated list of column names to ignore when importing |
filter | expression | If specified, only rows that match the expression will be imported |
escaped escape |
| Treat backslash ( |
pattern |
| Permits the filename to be specified as a regular expression |
encoding | encoding | Specifies the encoding of the file to be imported. Any encoding supported by libiconv may be used. |
skip_corrupted_file |
| If this option is enabled and pattern is set, any file that is not formatted correctly is skipped. If this option is not enabled then any file that is not formatted correctly will cause the script to terminate with an error. |
skip_corrupted_record |
| If this option is enabled, corrupted or malformed records (rows) are skipped |
filename_column |
| If enabled, a column called |
If the skip option is specified then the column headings in the imported data will be set to those of the first row following the skipped rows
The filter expression, if specified, is of identical format to that used by the where statement and it must reference at least one column name, enclosed within square brackets:
To specify a parameter or column name that contains spaces in an expression, use quotes within the square brackets as follows:
File name / pattern
If a pattern is specified in the import options, then the filename parameter is treated as an ECMAScript-type regular expression, and all files matching that pattern are imported and appended to one another to result in a single DSET.
Only filenames may contain a regular expression, directory names are always treated literally.
If using a regular expression, the import options are applied to all files matching that expression. All files must have the same structure (after any select/ignore options have been applied).
If any file in the matching set being imported has different columns to the first file that matched then an error will be generated and the task will fail.
Database tables
Account data
To import all accounts from Exivity the following statement is used:
import ACCOUNT source
custom_source
[alias
alias]
This functionality is intended for advanced use cases where it is necessary to correlate information about existing accounts into the new data being processed as part of the Transform step.
Usage data
To import usage data previously written to an RDF using finish the following statement is used:
import USAGE for
date
from
dset
source
custom_source
[alias
alias]
This functionality may be useful in the event that the original data retrieved from an API has been deleted or is otherwise unavailable.
The date must be inyyyyMMdd
format. The RDF from which the usage will be imported is:
<basedir>/system/report/<year>/<month>/<day>_<dset>_usage.rdf
.
To illustrate this in practice, the statement ...
import USAGE for 20180501 from azure.usage source test alias data
... will load the usage data from ...
<basedir>/system/report/2018/05/01_azure.usage_usage.rdf
... into a DSET calledtest.data
.
Source and Alias tags
A file imported from disk contains one or more named columns which are used to create an index for subsequent processing of the data in that file. As multiple files may be imported it is necessary to use namespaces to distinguish between the DSETs created from the files (each imported file is converted into a memory-resident DSET before it can be processed further). This is accomplished through the use of source and alias tags. Each file imported is given a unique source and alias tag, meaning that any column in the data imported from that file can be uniquely identified using a combination of source.alias.column_name.
There are two main variations of the import
statement which are as follows:
Automatic source tagging
import
filename
from
source
[alias
alias]
By convention, data retrieved by the USE extractor from external sources is located in a file called
<basedir>/system/extracted/<source>/<yyyy>/<MM>/dd>_.csv
where <yyyy>
is the year, <MM>
is the month and <dd>
is the day of the current data date.
Typically, the <source> portion of that path will reflect the name or identity of the external system that the data was collected from.
By default the alias tag will be set to the filename, minus the `_ portion of that filename and _
.csv` extension but an alias can be manually specified using the optional _alias parameter.
As an example, assuming the data date is 20170223
, the statement:
import usage from Azure
will create a DSET called Azure.usage
from the file /system/extracted/Azure/2017/02/23_usage.csv
.
The statement:
import usage from Azure alias custom
will create a DSET called Azure.custom
from the file
<basedir>/system/extracted/Azure/2017/02/23_usage.csv
.
Manual source tagging
import
filename
source
custom_source
[alias
alias]
This form of the statement will import filename and the source tag will be set to the value of the custom_source parameter.
By default the alias tag will be set to the filename minus the .csv
extension but an alias can be manually specified using the optional alias parameter.
As an example, assuming the data date is 20170223
, the statement:
import "system/extracted/Azure/${dataDate}.csv" source Azure alias usage
will create a DSET called Azure.usage
fom the file
<basedir>/system/extracted/Azure/20170223.csv
Importing CCR files
Options will be ignored when importing CCR files
When performing an import using manual tagging it is possible to import a Cloud Cruiser CCR file. This is done by specifying the .ccr
filename extension as follows:
Full information regarding CCR files is beyond the scope of this article. Cloud Cruiser documentation should be consulted if there is a requirement to know more about them.
In order to create a DSET from a CCR file, Transcript coverts the CCR file to a Dataset in memory as an interim step. The resulting DSET will have blank values in any given row for any heading where no dimension or measure in the CCR file existed that matched that heading. No distinction is made between dimensions and measures in the CCR file, they are imported as columns regardless.
When importing a CCR file, the quote and separator characters are always a qouble quote - "
- and a comma - ,
- respectively. Fields in the CCR file may or may not be quoted. The conversion process will handle this automatically and any quotes at the start and end of a field will be removed.
Quote and Separator options
The fields in a dataset file may or may not be quoted and the separator character used to delineate fields can be any character apart from an ASCII NUL (0) value. It is therefore important to ensure that the correct options are set before importing a dataset in order to avoid unwanted side effects.
The options relating to import
are quote and separator (or delimiter). By default, these are set to a double quote and a comma respectively.
If defined the quote character will be stripped from any fields beginning and ending with it. Any additional quote characters inside the outer quotes are preserved. Fields that do not have quotes around them will be imported correctly, unless they contain a quote character at only one end of the field.
If no quote character is defined, then quote characters are ignored during import, but any separator characters in the column headings will be converted to underscores.
Embeds
If the embed
option is set, then a curly bracket in the data - {
- will cause all characters (including newlines) up until the closing bracket - }
- to be imported. Nested brackets are supported, although if there are an uneven number of brackets before the end of the line an error will be generated in the logfile and the task will fail.
Embeds are not supported in CCR files
Examples
Create a DSET called Azure.usage
from the file /system/extracted/Azure/2017/02/23_usage.csv
import usage from Azure
Create a DSET called Azure.custom
from the file /system/extracted/Azure/2017/02/23_usage.csv
import usage from Azure alias custom
Create a DSET called Azure.usage
fom the file /system/extracted/Azure/20170223.csv
import "system/extracted/Azure/${dataDate}.csv" source Azure alias usage
Create a DSET called Azure.usage
from the file /system/extracted/Azure/20170223.csv
import "system/extracted/Azure/${dataDate}.csv" source Azure alias usage
Last updated