Document tools
Overview
Data Management is optimized for efficient processing of flat, strongly-typed data structures, while document databases employ flexible hierarchical data structures. Use the Document Extractor and Document Injector tools to translate Document type data into and out of Data Management's DLD format or the MongoDB connection tools.
Document Extractor
The Document Extractor tool "flattens" Document type fields into structured records, and can optionally output arrays as series of child records with relational linkage fields on separate output connections. Use this tool to make data embedded in Document fields accessible to Data Management processing, which is designed to handle flat records.
Document Extractor tool configuration parameters
The Document Extractor tool has two sets of configuration parameters in addition to the standard execution options.
Configuration
Parameter | Description |
---|---|
Input document field | Field from the input connection that contains document values. This field must be of type Document. |
Analyze from | Method for creating a recommended field mapping for the document:
|
Capture size | If Analyze from is Captured input, the number of input documents to analyze. If your documents are uniform and densely populated, the default setting of 100 should be fine. If your documents tend to be sparsely populated, with missing field values, you may need to capture more documents. Be aware that captured documents are stored in memory, and setting this value large may exhaust memory available to the JVM and/or require that you increase the JVM memory setting. |
Multi-level | Select this to look for nested fields in the document during analysis and create sub-documents. Clearing this will leave sub-documents mapped as type Document. |
Detect arrays | Select this to look for fields with array-typed values during analysis, analyze the contents of the array elements, and generate array output mappings and associated output connector types. If arrays are detected, Enable array outputs and Linkage field will be enabled, and Array mapping will be populated. Clearing this will map any array-typed fields to Unicode, and the result will be a JSON-formatted representation of the array rather than separate array outputs. |
Enable array outputs | Select this to send extracted document arrays to child connectors as a series of records. Data Management has no direct support for an "array" field type, so documents containing arrays must be broken apart into separate records related by Linkage field. |
Linkage field | Relational linkage field generated to link array records to their parent records. This field will contain sequentially generated numbers. You may optionally change the default |
Field mapping | Mapping between document input values and output fields and types:
|
Array mapping | Optionally, mapping between document input array values and output fields and types:
|
Options
Parameter | Description |
---|---|
Include input fields | Select this to copy all fields from the main input connection to the output. |
Ignore timezone when converting String to DateTime | Select to cause timezones present in string values from affecting conversions to type |
Configure the Document Extractor tool
To configure a Document Extractor tool, you need the following information:
The name of the input field that contains document values.
Either an upstream connection to a document source, or a sample document.
If you have... | Do this |
---|---|
An upstream connection to a source of documents (for example, you can connect to a MongoDB database and read from a collection). | Configure the Document Extractor tool by analyzing input. |
A sample document. The document must be complete, with all fields populated with representative data. If your sample document has any empty or null fields, you must replace these with actual values before proceeding. | Configure the Document Extractor tool by analyzing a sample. |
...by analyzing a sample
To configure the Document Extractor tool by analyzing a sample document:
Select the Document Extractor tool.
Go to the Configuration tab on the Properties pane.
Select Input document field and select the field from the input connection that contains document values.
This field must be of type “Document”.
In the Analysis section, select Analyze from and choose Sample entered below.
Optionally, select analysis options:
Select Multi-level to check sub-documents for nested fields.
Select Detect arrays to look for fields with array-typed values, analyze the contents of the array elements, and generate array output mappings and associated output connector types.
Open your sample document in a text editor, and copy it to the Clipboard.
The document must be complete, with all fields populated with representative data.
In Data Management, select Enter sample source document, and then paste the document into the text box.
Select Analyze.
Examine the Field mappings grid (and optionally the Array mapping grid) to add, remove, or adjust mappings.
When you are satisfied with the mappings, select Analyze from and choose None.
Select the Options tab and configure options.
Optionally, select the Execution tab, and then set Report options and Web service options.
...by analyzing input
To configure the Document Extractor tool by analyzing the input schema:
Connect the Document Extractor tool to a source of Document values, such as a configured MongoDB Input tool.
Select the Document Extractor tool, and then go to the Configuration tab on the Properties pane.
Select Input document field and select the field from the input connection that contains document values. This field must be of type “Document”.
In the Analysis section, select Analyze from and choose Captured input.
Optionally, select analysis options:
Select Multi-level to check sub-documents for nested fields.
Select Detect arrays to look for fields with array-typed values, analyze the contents of the array elements, and generate array output mappings and associated output connector types.
Select Analyze.
Examine the Field mappings grid (and optionally the Array mapping grid) to add, remove, or adjust mappings.
When you are satisfied with the mappings, select Analyze from and choose None.
Select the Options tab and configure options.
Optionally, go to the Execution tab, and then set Web service options.
Edit field mapping
Sometimes the Document Extractor tool's Analyze function does not produce the correct field mapping. Common issues include:
There are sparsely-populated fields.
The analyzer doesn't choose the desired field type.
You want finer control over which sub-documents result in fine structure, and which map to fields of type Document.
To edit the field mapping, use the buttons above the grid to add , remove , and move rows. For each row:
Path: the path into the Document structure where values are found. Separate nested structure fields with a period (
.
).Output field: the name of the Data Management output field.
Type: the type of the Data Management output field. It must be possible to convert the document values to the target field type, or the field will get an
ERROR
value. Note that everything is convertible to Unicode. Textvar is not recommended because of its code page limitations.
Edit array mapping
If your document Extractor tool is configured to Enable array outputs, the array mappings are displayed in the Array mappings grid. If the Document Extractor tool's Analyze function does not produce the correct array mapping, or you have no arrays in your sample document, you may wish to edit the array mapping.
To edit the array mapping, use the buttons above the grid to add , remove , and move rows. For each row:
Path: the path into the Document structure where values are found. Separate nested structure fields with a period (
.
).Output field: the name of the Data Management output field in each child record where the child record will be stored.
Type: the type of the Data Management output field. This is usually Document because most arrays contain documents like this:
[{"name":"john"},{"name":"betty"},{"name":"sally"}]
.
However, Type may be another type if the array contains scalar values. For example, the following array would map to child records whose field type is Unicode: ["john","better","sally"]
.
If you want to further flatten the Documents contained in child records, connect another Document Extractor tool to the child connector and repeat the analysis and configuration process.
Data Management cannot directly extract the contents of two-dimensional or higher arrays. In such cases the nested arrays will be mapped to Unicode and formatted as JSON.
Document Injector
The Document Injector tool is the opposite of the Document Extractor—it reads records from its input(s) and copies field values from input records to a new or existing document. It can also inject arrays into the document by integrating documents from additional "child" inputs. Use this tool to transform flat Data Management records into structured documents.
Document Injector tool configuration parameters
The Document Injector tool has two sets of configuration parameters in addition to the standard execution options.
Configuration
Parameter | Description |
---|---|
Mode | Processing mode:
|
Output document field | Field on the output connection to receive document values. |
Analyze from | Method for creating a recommended field mapping for the document:
|
Enable array inputs | Select this to enable child array connectors on input. |
Linkage field | Relational linkage field linking child array records to their parent records. |
Field mapping | Mapping between input values and document outputs:
|
Array mapping | Optionally, mapping between document input array values and output fields and types:
|
Options
Parameter | Description |
---|---|
Include input fields | Select this to copy all fields from the main input connection to the output. |
Include nulls in document |
Configure the Document Injector tool
To configure a Document Injector tool, you need the following information:
The name of the input field that contains document values.
Either an upstream connection to a document source, or a sample document that you want to produce.
If you have... | Do this |
---|---|
An upstream connection to a source of documents (for example, you can connect to a MongoDB database and read from a collection). | Configure the Document Injector tool by analyzing input. |
A sample document. The document must be complete, with all fields populated with representative data. If your sample document has any empty or null fields, you must replace these with actual values before proceeding. | Configure the Document Injector tool by analyzing a sample. |
...by analyzing a sample
To configure the Document Injector tool by analyzing a sample document:
Select the Document Injector tool.
Go to the Configuration tab on the Properties pane.
Select Mode, and then choose Create new document or Write to existing document.
Select Output document field and specify the field that will receive document values.
In the Analysis section, select Analyze from and choose Sample.
Optionally, select analysis options:
Select Multi-level to check sub-documents for nested fields.
Select Detect arrays to look for fields with array-typed values, analyze the contents of the array elements, and generate array output mappings and associated output connector types.
Open your sample document in a text editor, and copy it to the Clipboard.
The document must be complete, with all fields populated with representative data.
In Data Management, select Enter sample source document, and then paste the document into the text box.
Select Analyze.
Examine the Field mappings grid (and optionally the Array mapping grid) to add, remove, or adjust mappings.
When you are satisfied with the mappings, select Analyze from and choose None.
Optionally, select the Options tab and configure output options:
Select Include input fields to copy all fields from the main input connection to the output.
Optionally, select the Execution tab, and then set Report options and Web service options.
...by analyzing input
With this configuration method, you can automatically configure a Document Injector tool to match the fields available on the upstream connection. This is useful when you do not know the target document format.
To configure the Document Injector tool by analyzing the input schema:
Connect the Document Injector tool to a source of Document values.
Select the Document Injector tool.
Go to the Configuration tab on the Properties pane.
Select Mode, and then choose Create new document or Write to existing document.
Select Output document field and specify the field that will receive document values.
In the Analysis section, select Analyze from and choose Input.
Select Analyze.
Examine the Field mappings grid to add, remove, or adjust mappings.
When you are satisfied with the mappings, select Analyze from and choose None.
Optionally, select the Options tab and configure output options:
Select Include input fields to copy all fields from the main input connection to the output.
Optionally, go to the Execution tab, and then set Web service options.
Edit field mapping
Sometimes the Document Injector tool's Analyze function does not produce the correct field mapping. Common issues include:
There are sparsely-populated fields.
The analyzer doesn't choose the desired field type.
You want finer control over which sub-documents result in fine structure, and which map to fields of type Document.
To edit the field mapping, use the buttons above the grid to add , remove , and move rows. For each row:
Path: the path into the Document structure where values will be injected. Separate nested structure fields with a period (
.
).Input field: the name of the Data Management input field.
Edit array mapping
If your Document Injector tool is configured to Enable array outputs, the array mappings are displayed in the Array mappings grid. If the Document Injector tool's Analyze function does not produce the correct array mapping, or you have no arrays in your sample document, you may wish to edit the array mapping.
To edit the array mapping, use the buttons above the grid to add , remove , and move rows. For each row:
Path: the path into the Document structure where values will be injected. Separate nested structure fields with a period (
.
).Output field: is the name of the Data Management output field in each child record where the child record will be stored.