Parquet tools

Overview

Apache Parquet is a columnar storage format used by many query engines for analytics workloads. Parquet features per-column compression and encoding schemes that offer significant performance benefits compared to a traditional row oriented format.

Parquet Input

The Parquet Input tool is used to read data from the Parquet storage format.

When reading from cloud storage (such as AWS S3 or Azure Blob Storage), Data Management cannot analyze Parquet files larger than 100MB in size.

Parquet is not supported on MapR 4.0.x.

Configuration parameters

The Parquet Input tool has two sets of configuration parameters in addition to the standard execution options.

Configuration

Parameter	Description
Input file	An hdfs file that includes the metadata for the file. It does not need to actually contain the data.

Options

Parameter	Description
Limit records	If selected, limits the number of records read.
Read only the first	If Limit records is selected, specifies the number of records to be read.
JSON/BSON conversion	The Parquet file format supports JSON and BSON data types. This option controls how Parquet values are converted to corresponding Data Management fields: Text: format the data as JSON and store it into a Unicode field. Document: copy the BSON or convert the JSON into a Document field, which stores BSON data.
Use local timezone	Parquet files can store timestamps, which are usually interpreted as UTC time instants. Data Management has no Timestamp data type and instead uses a DateTime data type, which is timezone-free. This option controls how Parquet timestamp values are converted to Data Management DateTime fields: Only when marked UTC: if the Parquet timestamp is marked as belonging in the UTC timezone, perform the conversion to the server-local timezone; otherwise leave it in UTC. Always: always assume the Parquet timestamp is UTC, and convert it to the server-local timezone. Never: leave the result in the UTC timezone.
Produce file name field	If selected, the file name will be output as a record field.
Output full path	If Produce file name field is selected, optionally outputs the entire path to the record file name field.
Output URI path	If Output full path is selected, express path as a Uniform Resource Identifier (URI).
Field name	If Produce file name field is selected, name of the column to be used for the file name. This is optional and defaults to `FILENAME`.
Field size	If Produce file name field is selected, size of the field to be used for the file name. This is optional and defaults to `255`.

Configure the Parquet Input tool

Select the Parquet Input tool.
Go to the Configuration tab.
Specify the Input file. Once you specify a file, Data Management displays a sample of the input data, automatically detecting field definitions and record layout. Field definitions and details are read-only; you cannot edit them.
Optionally, select the Options tab and configure advanced options:
- To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification. This can be useful when reading a wildcarded set of files. Select Output URI path to express the complete file specification as a Uniform Resource Identifier.
- To specify how how Parquet values are converted to corresponding Data Management fields, configure JSON/BSON conversion.
- To specify how Parquet timestamp values are converted to Data Management DateTime fields, configure Use local timezone.
- To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.

When reading from cloud storage (such as AWS S3 or Azure Blob Storage), Data Management cannot analyze Parquet files larger than 100MB in size.

Parquet Output

The Parquet Output tool is used to write data to the Parquet storage format.

Note that Parquet is not supported on MapR 4.0.x.

Configuration parameters

The Parquet Output tool has a single set of configuration parameters in addition to the standard execution options.

Parameter	Description
Output file	The output file name.
Open file time	Specifies when the output file will be opened: Default Uses the site/execution server setting When project is started When first record is read When last record is read
Write empty file if no records are read	If selected, writes an output file even when no records are read.
Document conversion	The Parquet file format supports JSON and BSON data types. This option controls how Data Management fields are converted to the corresponding Parquet values: Text: format the data as JSON and store it into a Unicode field. BSON: store the Data Management Document in BSON format. JSON: store the Data Management Document in JSON format.
Use local timezone	If selected, Data Management DateTime fields are not converted to UTC before being stored as a Parquet timestamp. Data Management DateTime values are timezone free, meaningful only when interpreted in the context of an implied timezone. If Use local timezone is not selected, Data Management DateTime fields are converted to UTC stored as Parquet timestamps.
Compression codec	The compression codec used to compress blocks: None: passes through data uncompressed. Deflate: writes the data block using the deflate algorithm as specified in RFC 1951. This creates smaller files but is slower. Snappy: (description) writes the data block using Google's Snappy compression library. Each compressed block is followed by the 4-byte, big-endian CRC32 checksum of the uncompressed data in the block. LZO: writes the data block using the Lempel-Ziv-Oberhumer data compression protocol.
Split files	If selected, splits the output file by Size, Record count, or Data.
Split size (MB)	If Split files by size is selected, specifies the maximum size of the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1024 MB (1Gb). Note that the Parquet file format has an internal block structure that may prevent file splits from occurring on files of less than 100 MB.
Split count	If Split files by record count is selected, specifies the maximum number of records in the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1,000,000.
Split field	If Split files by data is selected, name of the field to be used to split the data. A separate file will be created for each unique value of the specified field. Data must be grouped by the split field.
Suppress split field	If selected, the Split field is omitted from files created using Split files by data.
Replication factor	Number of copies of each block that will be stored (on different nodes) in the distributed file system. The default is 1.
Block size (MB)	The minimum size of a file division. The default is 128 MB.

Configure the Parquet Output tool

Select the Parquet Output tool.
Go to the Configuration tab.
Specify the Output file.
Optionally, specify Open file time.

Option	Description
Default	Use the site/execution server setting. If you select this, you can optionally select Write empty file if no records are read. A warning will be issued if the tool setting conflicts with the site/execution server setting.
When project is started	Open output file when the project is run.
When the first record is read	Open output file when the first record is read. If you select this, you can optionally select Write empty file if no records are read.
After the last record is read	Output records are cached and not written to the output file until the tool receives the final record. If you select this, you can optionally select Write empty file if no records are read.

Optionally, select Write empty file if no records are read if selected to write an output file even when no records are read.
To specify how how Parquet values are converted to corresponding Data Management fields, configure Document conversion.
To specify how Parquet timestamp values are converted from Data Management DateTime fields, configure Use local timezone.
Optionally, select the Compression codec.
- Null: passes through data uncompressed.
- Deflate: writes the data block using the deflate algorithm, which results in smaller files but is lower than Snappy.
- Snappy: (default) writes the data block using Google's Snappy compression library, which is faster than Deflate.
- LZO: writes the data block using the Lempel-Ziv-Oberhumer data compression protocol.
Optionally, you can split the output file into smaller, more manageable pieces. Check the Split files box, and then select By size, By record count, or By data.
- If you select Split files by size, specify Split size as the maximum file size (in megabytes). The resulting output files will have the name you specified, with a sequential number appended.
- If you select Split files by record count, specify Split count as maximum number of records. The resulting output files will have the name you specified, with a sequential number appended.
- If you select Split files by data, select the desired Split field name from the drop-down list. Data must be grouped by the split field. The resulting output files will have the name you specified, augmented by inserting the value of the specified field before the extension. For example, splitting output by ZIP Code produces file names of the form output_file01234.parquet.
To generate file names where the entire name is determined by the value of the specified field, select Treat output file as folder and specify the output directory in the Output file box, using the form: F:\output_directory.
If you do not want the specified field to appear in the output, select Suppress split field.
Optionally, go to the Execution tab and Enable trigger output, configure reporting options, or set Web service options.