Data Project Processing
Overview
The following sections provide more detail on the activities undertaken by RPI during data project definition and execution.
File Analysis
The current RPI release supports the processing of delimited and fixed-width files. RPI attempts to determine the nature of a file during file analysis. Checks are undertaken in the following sequence:
If all lines are the same length, and noticeable spacing patterns (at least 50% of lines have a minimum of one common space) are present, the file is deemed to be fixed-width.
If the file can be parsed as delimited, it is defined as such.
If the file has variable line lengths, but also contains noticeable spacing patterns, it is determined to be fixed width.
Delimited File Analysis
If the file is delimited, the system analyzes the first [x] records, then every 1 record in [y] (where [x] is defined by system configuration setting NumberFirstLinesToRead, and [y] is defined by system configuration setting AnalyzeEveryNLines).
RPI is able to determine the following high-level information during analysis of a delimited file:
Delimiter: supported delimiters are defined by the system configuration settings FileAnalysisDelimiters and FileAnalysisDelimitersSeparator (the latter is used to parse the list of delimiters provided in the former). If the file happens to be delimited using another character, RPI is unable to analyze it. In this case, it is necessary to define the delimiter manually, and invoke Re-analyze. Note that the Tab character is defined using “\t”.
Header row: RPI is able to determine whether the file has a header row by testing for the following conditions:
If all of the first row's fields contain string values, and at least one other row contains a non-string value, the file is determined to contain a header row.
Skip lines: RPI can make a rudimentary determination of the number of rows at the beginning of a file to disregard as non-data in nature. If a header row is present, Skip lines is set to 1; if a header rows is not present, it is set to 0.
Fixed-width File Analysis
If the file is fixed-width, RPI undertakes initial determination of its field boundaries. The first [x] rows of file are analyzed (where [x] is defined by system configuration setting NumberFirstLinesToRead). Note that the file’s first line is not skipped.
RPI looks for the occurrence of spaces at the same location within more than [y]% of file (where [y] is defined by system configuration setting FixedWidthSpaceThreshold) to determine locations of likely field boundaries. No initial schema or content analysis is performed. The file’s schema is only analyzed during re-analysis.
Schema Analysis
As well as determining high-level information relating to the file as a whole, RPI is able to make an “educated guess” as to the fields the file contains.
Note that a file’s schema is only analyzed at initial analysis when the file is delimited. If the file is fixed-width, its schema is analyzed at re-analysis (post-confirmation of the file’s field boundaries).
Note also that there exist a few subtle differences between the actions undertaken during delimited and fixed-width schema analysis – these are described at the end of this section.
For each field in the file, RPI can determine:
Name: if the file has a header row, the field names it contains are used. Any database-incompatible characters are replaced with underscores. If any non-unique names are present, a space and integer (starting with 2, and incremented if necessary) are appended to ensure field name uniqueness.
If a field name in the header is blank, a “filler” name is used. The filler name consists of a prefix (defined by system configuration setting DefaultFillerFieldNames) and an integer with leading zeroes (starting at 001 and incremented as required).
If no header row is present in the file, fields are named automatically. Automatic field names consist of a prefix (defined by system configuration setting DefaultFieldNames) and an integer with leading zeroes (starting at 001 and incremented as required).
Data type: RPI is able to determine a field’s data type using the following rules:
String: a field is determined to be a string if it contains alphanumeric data or a mixture of data. A completely blank column is determined to be String(64).
Integer: an integer field contains only integer data. In addition, the field name must end with one of the following:
CNT
COUNT
PRICE
AMOUNT
NUMBER
AMT
If field appears to be Integer but contains one or more leading zeroes, it is typed as String.
Decimal: all data within the fields must be either integer or decimal. Integers with commas or currency symbols are parsed as decimal. Leading zeroes are disregarded. No field name checking is performed.
Date: all values must conform to one of the following date formats:
M/d/yyyy
M/d/yy
dddd, MMMM d, yyyy
d.M.yyyy
d-M-yyyy
d/M/yyyy
d.M.yy
d-M-yy
d/M/yy
d MMM yyyy
d MMMM yyyy
d MMMM yy
d MMM yy
d-MMM-yy
yyyy-M-d
If two competing formats are found, the field defaults to String.
Time: values must conform to one of the following time formats:
H:m:s.fff
H:m:s
h:m:s.ffftt
h:m:stt
h:m:s.fff tt
h:m:s tt
H:m
h:mtt
h:m tt
If two competing formats are found, the field defaults to String.
DateTime: all data conforms to a combination of one of the supported Date and one of the Time formats, separated by a space. If two competing formats are found, the field defaults to String.
Size: only relevant for data types String, Integer and Decimal.
String: size is set in accordance with the maximum observed string length within the file. If the size exceeds 4000 characters, data is truncated. If greater than 4, size is rounded up to the nearest power of 2 (i.e. 8, 16, 32, 64, 128, 256, 512, 1024, 2048 up to 4000).
Integer: size is set in accordance with maximum number of digits observed.
Decimal: size is set to the field’s precision – the maximum number of pre-decimal digits plus the maximum number of post-decimal digits.
Scale: only relevant for Decimal data. Scale is set to the maximum number of post-decimal digits.
Schema analysis performed in respect of fixed-width files differs from delimited files as follows:
Fields cannot be analyzed as integer due to the absence of a header record.
A numerical field is determined to be a decimal if it contains at least one decimal value (i.e. one value with at least one digit to the right of the mantissa).
An empty string field’s size is determined by the actual field width, rather than being set to 64.
A string field’s size is set to the actual field width, as opposed to using the “power of two” rule.
Validation
RPI performs validation to ensure that records within the file to be loaded conform to the rules specified within the data project’s definition.
The first [n] records from the file are validated, followed by every 1 record in [x]. [n] is defined by system configuration setting NumberFirstLinesToRead. [x] is defined by system configuration setting ReadEveryXLines.
A number of types of validation are performed:
Row-level validation: RPI must be able to parse the row. Validation fails if too many or too few fields are present.
Data type validation: the validation checks undertaken are determined by the type of data:
String: the data must fit the defined size
Integer: the data must be a valid integer, and must fit the defined size
Decimal: the data must be a valid decimal or integer, and must fit the defined size
Date: the data must be a valid Date, and no Time component may be present
Time: the data must be a valid Time, and no Date component may be present
DateTime: the data must be a valid DateTime, Date or Time.
Key constraint validation: a record must not violate any user-specified primary key constraint.
Load
When RPI loads a file into the data warehouse, a number of steps are undertaken:
When loading the data project’s first file:
A database table is created. The table is named “UD_ + [table name]” ([table name] being supplied by the user in the Definition tab’s File Analysis panel). All fields within the table are created as nullable. Fields are created in accordance with the data type and size stipulated at definition. If a table of the same name already exists, it is dropped and replaced.
Data is loaded into the table.
To support future Redpoint Data Management functionality, the following fields are always added to the custom table into which data is loaded:
BRAND_FILE (Nvarchar (128))
SEQ_NUM (Bigint)
BRAND_DT (DateTime)
JOBID (Nvarchar (64))
A SQL Server primary key is created in accordance with any specified Key settings.
If loading a second or subsequent file within a data project:
The actions undertaken are dependent on the value selected for Post-initial load behavior in the Definition tab’s File Analysis panel:
Insert new, update existing: new records are inserted into the table. Existing records (as defined by the data project’s key) are updated.
Insert only: new records are inserted. Any existing records are treated as errors.
Update only: existing records are updated. Any new records are treated as errors.
Complete refresh: the existing contents of the table are erased and replaced by the records in the file.
Note that, in all cases, only valid records are loaded. Any trailing and preceding spaces within the data are trimmed.
The data file is archived, being copied to the folder defined by system configuration setting DataIntakeArchiveDirectory. Each data file is stored in a uniquely-name subfolder.