Parquet tools
Overview
Apache Parquet is a columnar storage format used by many query engines for analytics workloads. Parquet features per-column compression and encoding schemes that offer significant performance benefits compared to a traditional row oriented format.
Parquet Input
The Parquet Input tool is used to read data from the Parquet storage format.
When reading from cloud storage (such as AWS S3 or Azure Blob Storage), Data Management cannot analyze Parquet files larger than 100MB in size.
Parquet is not supported on MapR 4.0.x.
Configuration parameters
The Parquet Input tool has two sets of configuration parameters in addition to the standard execution options.
Configuration
Parameter | Description |
---|---|
Input file | An hdfs file that includes the metadata for the file. It does not need to actually contain the data. |
Options
Parameter | Description |
---|---|
Limit records | If selected, limits the number of records read. |
Read only the first | If Limit records is selected, specifies the number of records to be read. |
JSON/BSON conversion | The Parquet file format supports JSON and BSON data types. This option controls how Parquet values are converted to corresponding Data Management fields:
|
Use local timezone | Parquet files can store timestamps, which are usually interpreted as UTC time instants. Data Management has no Timestamp data type and instead uses a DateTime data type, which is timezone-free. This option controls how Parquet timestamp values are converted to Data Management DateTime fields:
|
Produce file name field | If selected, the file name will be output as a record field. |
Output full path | If Produce file name field is selected, optionally outputs the entire path to the record file name field. |
Output URI path | If Output full path is selected, express path as a Uniform Resource Identifier (URI). |
Field name | If Produce file name field is selected, name of the column to be used for the file name. This is optional and defaults to |
Field size | If Produce file name field is selected, size of the field to be used for the file name. This is optional and defaults to |
Configure the Parquet Input tool
Select the Parquet Input tool.
Go to the Configuration tab.
Specify the Input file. Once you specify a file, Data Management displays a sample of the input data, automatically detecting field definitions and record layout. Field definitions and details are read-only; you cannot edit them.
Optionally, select the Options tab and configure advanced options:
To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification. This can be useful when reading a wildcarded set of files. Select Output URI path to express the complete file specification as a Uniform Resource Identifier.
To specify how how Parquet values are converted to corresponding Data Management fields, configure JSON/BSON conversion.
To specify how Parquet timestamp values are converted to Data Management DateTime fields, configure Use local timezone.
To include the name of the input file as a new field, select Produce file name field and specify a Field name and Field size. Select Output full path to include the complete file specification.
Optionally, go to the Execution tab and Enable trigger input, configure reporting options, or set Web service options.
When reading from cloud storage (such as AWS S3 or Azure Blob Storage), Data Management cannot analyze Parquet files larger than 100MB in size.
Parquet Output
The Parquet Output tool is used to write data to the Parquet storage format.
Note that Parquet is not supported on MapR 4.0.x.
Configuration parameters
The Parquet Output tool has a single set of configuration parameters in addition to the standard execution options.
Parameter | Description |
---|---|
Output file | The output file name. |
Open file time | Specifies when the output file will be opened:
|
Write empty file if no records are read | If selected, writes an output file even when no records are read. |
Document conversion | The Parquet file format supports JSON and BSON data types. This option controls how Data Management fields are converted to the corresponding Parquet values:
|
Use local timezone | If selected, Data Management DateTime fields are not converted to UTC before being stored as a Parquet timestamp. Data Management DateTime values are timezone free, meaningful only when interpreted in the context of an implied timezone. If Use local timezone is not selected, Data Management DateTime fields are converted to UTC stored as Parquet timestamps. |
Compression codec | The compression codec used to compress blocks:
|
Split files | If selected, splits the output file by Size, Record count, or Data. |
Split size (MB) | If Split files by size is selected, specifies the maximum size of the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1024 MB (1Gb). Note that the Parquet file format has an internal block structure that may prevent file splits from occurring on files of less than 100 MB. |
Split count | If Split files by record count is selected, specifies the maximum number of records in the split files. Output file names are appended with a sequence number between the file root and the extension. Defaults to 1,000,000. |
Split field | If Split files by data is selected, name of the field to be used to split the data. A separate file will be created for each unique value of the specified field. Data must be grouped by the split field. |
Suppress split field | If selected, the Split field is omitted from files created using Split files by data. |
Replication factor | Number of copies of each block that will be stored (on different nodes) in the distributed file system. The default is 1. |
Block size (MB) | The minimum size of a file division. The default is 128 MB. |
Configure the Parquet Output tool
Select the Parquet Output tool.
Go to the Configuration tab.
Specify the Output file.
Optionally, specify Open file time.
Option | Description |
---|---|
Default | Use the site/execution server setting. If you select this, you can optionally select Write empty file if no records are read. A warning will be issued if the tool setting conflicts with the site/execution server setting. |
When project is started | Open output file when the project is run. |
When the first record is read | Open output file when the first record is read. If you select this, you can optionally select Write empty file if no records are read. |
After the last record is read | Output records are cached and not written to the output file until the tool receives the final record. If you select this, you can optionally select Write empty file if no records are read. |
Optionally, select Write empty file if no records are read if selected to write an output file even when no records are read.
To specify how how Parquet values are converted to corresponding Data Management fields, configure Document conversion.
To specify how Parquet timestamp values are converted from Data Management DateTime fields, configure Use local timezone.
Optionally, select the Compression codec.
Null: passes through data uncompressed.
Deflate: writes the data block using the deflate algorithm, which results in smaller files but is lower than Snappy.
Snappy: (default) writes the data block using Google's Snappy compression library, which is faster than Deflate.
LZO: writes the data block using the Lempel-Ziv-Oberhumer data compression protocol.
Optionally, you can split the output file into smaller, more manageable pieces. Check the Split files box, and then select By size, By record count, or By data.
If you select Split files by size, specify Split size as the maximum file size (in megabytes). The resulting output files will have the name you specified, with a sequential number appended.
If you select Split files by record count, specify Split count as maximum number of records. The resulting output files will have the name you specified, with a sequential number appended.
If you select Split files by data, select the desired Split field name from the drop-down list. Data must be grouped by the split field. The resulting output files will have the name you specified, augmented by inserting the value of the specified field before the extension. For example, splitting output by ZIP Code produces file names of the form
output_file01234.parquet
.
To generate file names where the entire name is determined by the value of the specified field, select Treat output file as folder and specify the output directory in the Output file box, using the form:
F:\output_directory
.If you do not want the specified field to appear in the output, select Suppress split field.
Optionally, go to the Execution tab and Enable trigger output, configure reporting options, or set Web service options.