Document tools

Overview

Data Management is optimized for efficient processing of flat, strongly-typed data structures, while document databases employ flexible hierarchical data structures. Use the Document Extractor and Document Injector tools to translate Document type data into and out of Data Management's DLD format or the MongoDB connection tools.

Document Extractor

The Document Extractor tool "flattens" Document type fields into structured records, and can optionally output arrays as series of child records with relational linkage fields on separate output connections. Use this tool to make data embedded in Document fields accessible to Data Management processing, which is designed to handle flat records.

Document Extractor tool configuration parameters

The Document Extractor tool has two sets of configuration parameters in addition to the standard execution options.

Configuration

Parameter	Description
Input document field	Field from the input connection that contains document values. This field must be of type Document.
Analyze from	Method for creating a recommended field mapping for the document: Sample entered below: If you have a sample document, you can analyze that document and create a recommended Field mapping. This method can analyze arrays and create configurations for child array connectors. The document must be complete, with all fields populated with representative data. If your sample document has any empty/null fields, for example: `{ "A":"", "B":null }` you must replace empty/null with actual values before proceeding. Captured input: If your upstream connector already has a source of documents (for example, you can connect to a MongoDB database and read from a collection), you can analyze that input.
Capture size	If Analyze from is Captured input, the number of input documents to analyze. If your documents are uniform and densely populated, the default setting of 100 should be fine. If your documents tend to be sparsely populated, with missing field values, you may need to capture more documents. Be aware that captured documents are stored in memory, and setting this value large may exhaust memory available to the JVM and/or require that you increase the JVM memory setting.
Multi-level	Select this to look for nested fields in the document during analysis and create sub-documents. Clearing this will leave sub-documents mapped as type Document.
Detect arrays	Select this to look for fields with array-typed values during analysis, analyze the contents of the array elements, and generate array output mappings and associated output connector types. If arrays are detected, Enable array outputs and Linkage field will be enabled, and Array mapping will be populated. Clearing this will map any array-typed fields to Unicode, and the result will be a JSON-formatted representation of the array rather than separate array outputs.
Enable array outputs	Select this to send extracted document arrays to child connectors as a series of records. Data Management has no direct support for an "array" field type, so documents containing arrays must be broken apart into separate records related by Linkage field.
Linkage field	Relational linkage field generated to link array records to their parent records. This field will contain sequentially generated numbers. You may optionally change the default `__LINKAGE` field name it collides with other fields (for example, in the second level of an extraction hierarchy where `__LINKAGE` was already used at the parent level). This linkage field will be added to all main records and array records.
Field mapping	Mapping between document input values and output fields and types: Path: the path into the Document structure where values are found. Separate nested structure fields with a period (`.`). Output field: the name of the Data Management output field. Type: the type of the Data Management output field. It must be possible to convert the document values to the target field type, or the field will get an `ERROR` value. Note that everything is convertible to Unicode. Textvar is not recommended because of its code page limitations.
Array mapping	Optionally, mapping between document input array values and output fields and types: Path: the path into the Document structure where values are found. Separate nested structure fields with a period (`.`). Output field: the name of the Data Management output field in each child record where the child record will be stored. Type: the type of the Data Management output field. This is usually Document.

Options

Parameter	Description
Include input fields	Select this to copy all fields from the main input connection to the output.
Ignore timezone when converting String to DateTime	Select to cause timezones present in string values from affecting conversions to type `DateTime`.

Configure the Document Extractor tool

To configure a Document Extractor tool, you need the following information:

The name of the input field that contains document values.
Either an upstream connection to a document source, or a sample document.

If you have...	Do this
An upstream connection to a source of documents (for example, you can connect to a MongoDB database and read from a collection).	Configure the Document Extractor tool by analyzing input.
A sample document. The document must be complete, with all fields populated with representative data. If your sample document has any empty or null fields, you must replace these with actual values before proceeding.	Configure the Document Extractor tool by analyzing a sample.

...by analyzing a sample

To configure the Document Extractor tool by analyzing a sample document:

Select the Document Extractor tool.
Go to the Configuration tab on the Properties pane.
Select Input document field and select the field from the input connection that contains document values.
- This field must be of type “Document”.
In the Analysis section, select Analyze from and choose Sample entered below.
Optionally, select analysis options:
- Select Multi-level to check sub-documents for nested fields.
- Select Detect arrays to look for fields with array-typed values, analyze the contents of the array elements, and generate array output mappings and associated output connector types.
Open your sample document in a text editor, and copy it to the Clipboard.
- The document must be complete, with all fields populated with representative data.
In Data Management, select Enter sample source document, and then paste the document into the text box.
Select Analyze.
Examine the Field mappings grid (and optionally the Array mapping grid) to add, remove, or adjust mappings.
When you are satisfied with the mappings, select Analyze from and choose None.
Select the Options tab and configure options.
Optionally, select the Execution tab, and then set Report options and Web service options.

...by analyzing input

To configure the Document Extractor tool by analyzing the input schema:

Connect the Document Extractor tool to a source of Document values, such as a configured MongoDB Input tool.
Select the Document Extractor tool, and then go to the Configuration tab on the Properties pane.
Select Input document field and select the field from the input connection that contains document values. This field must be of type “Document”.
In the Analysis section, select Analyze from and choose Captured input.
Optionally, select analysis options:
- Select Multi-level to check sub-documents for nested fields.
- Select Detect arrays to look for fields with array-typed values, analyze the contents of the array elements, and generate array output mappings and associated output connector types.
Select Analyze.
Examine the Field mappings grid (and optionally the Array mapping grid) to add, remove, or adjust mappings.
When you are satisfied with the mappings, select Analyze from and choose None.
Select the Options tab and configure options.
Optionally, go to the Execution tab, and then set Web service options.

Edit field mapping

Sometimes the Document Extractor tool's Analyze function does not produce the correct field mapping. Common issues include:

There are sparsely-populated fields.
The analyzer doesn't choose the desired field type.
You want finer control over which sub-documents result in fine structure, and which map to fields of type Document.

To edit the field mapping, use the buttons above the grid to add , remove , and move rows. For each row:

Path: the path into the Document structure where values are found. Separate nested structure fields with a period (.).
Output field: the name of the Data Management output field.
Type: the type of the Data Management output field. It must be possible to convert the document values to the target field type, or the field will get an ERROR value. Note that everything is convertible to Unicode. Textvar is not recommended because of its code page limitations.

Edit array mapping

If your document Extractor tool is configured to Enable array outputs, the array mappings are displayed in the Array mappings grid. If the Document Extractor tool's Analyze function does not produce the correct array mapping, or you have no arrays in your sample document, you may wish to edit the array mapping.

To edit the array mapping, use the buttons above the grid to add , remove , and move rows. For each row:

Path: the path into the Document structure where values are found. Separate nested structure fields with a period (.).
Output field: the name of the Data Management output field in each child record where the child record will be stored.
Type: the type of the Data Management output field. This is usually Document because most arrays contain documents like this: [{"name":"john"},{"name":"betty"},{"name":"sally"}].

However, Type may be another type if the array contains scalar values. For example, the following array would map to child records whose field type is Unicode: ["john","better","sally"].

If you want to further flatten the Documents contained in child records, connect another Document Extractor tool to the child connector and repeat the analysis and configuration process.

Data Management cannot directly extract the contents of two-dimensional or higher arrays. In such cases the nested arrays will be mapped to Unicode and formatted as JSON.

Document Injector

The Document Injector tool is the opposite of the Document Extractor—it reads records from its input(s) and copies field values from input records to a new or existing document. It can also inject arrays into the document by integrating documents from additional "child" inputs. Use this tool to transform flat Data Management records into structured documents.

Document Injector tool configuration parameters

The Document Injector tool has two sets of configuration parameters in addition to the standard execution options.

Configuration

Parameter	Description
Mode	Processing mode: Create new document Write to existing document
Output document field	Field on the output connection to receive document values.
Analyze from	Method for creating a recommended field mapping for the document: Sample: If you have a sample document, you can analyze that document and create a recommended Field mapping. This method can analyze arrays and create configurations for child array connectors. The document must be complete, with all fields populated with representative data. If your sample document has any empty/null fields, for example: `{ "A":"", "B":null }` you must replace empty/null with actual values before proceeding. Input: If your upstream connector already has a source of documents, you can analyze that input schema and create a matching Field mapping.
Enable array inputs	Select this to enable child array connectors on input.
Linkage field	Relational linkage field linking child array records to their parent records.
Field mapping	Mapping between input values and document outputs: Path: the path into the Document structure where values will be injected. Separate nested structure fields with a period ( `.` ). Input field: the name of the Data Management input field.
Array mapping	Optionally, mapping between document input array values and output fields and types: Path: the path into the Document structure where values will be injected. Input field: the name of the Data Management Input field in each child record where the child record will be stored.

Options

Parameter	Description
Include input fields	Select this to copy all fields from the main input connection to the output.
Include nulls in document DateTime conversion Date conversion Decimal conversion Interpret `binary _id` field as `ObjectID`	See Document database field conversion options.

Configure the Document Injector tool

To configure a Document Injector tool, you need the following information:

The name of the input field that contains document values.
Either an upstream connection to a document source, or a sample document that you want to produce.

If you have...	Do this
An upstream connection to a source of documents (for example, you can connect to a MongoDB database and read from a collection).	Configure the Document Injector tool by analyzing input.
A sample document. The document must be complete, with all fields populated with representative data. If your sample document has any empty or null fields, you must replace these with actual values before proceeding.	Configure the Document Injector tool by analyzing a sample.

...by analyzing a sample

To configure the Document Injector tool by analyzing a sample document:

Select the Document Injector tool.
Go to the Configuration tab on the Properties pane.
Select Mode, and then choose Create new document or Write to existing document.
Select Output document field and specify the field that will receive document values.
In the Analysis section, select Analyze from and choose Sample.
Optionally, select analysis options:
- Select Multi-level to check sub-documents for nested fields.
- Select Detect arrays to look for fields with array-typed values, analyze the contents of the array elements, and generate array output mappings and associated output connector types.
Open your sample document in a text editor, and copy it to the Clipboard.
- The document must be complete, with all fields populated with representative data.
In Data Management, select Enter sample source document, and then paste the document into the text box.
Select Analyze.
Examine the Field mappings grid (and optionally the Array mapping grid) to add, remove, or adjust mappings.
When you are satisfied with the mappings, select Analyze from and choose None.
Optionally, select the Options tab and configure output options:
- Select Include input fields to copy all fields from the main input connection to the output.
- Configure document database field conversion options.
Optionally, select the Execution tab, and then set Report options and Web service options.

...by analyzing input

With this configuration method, you can automatically configure a Document Injector tool to match the fields available on the upstream connection. This is useful when you do not know the target document format.

To configure the Document Injector tool by analyzing the input schema:

Connect the Document Injector tool to a source of Document values.
Select the Document Injector tool.
Go to the Configuration tab on the Properties pane.
Select Mode, and then choose Create new document or Write to existing document.
Select Output document field and specify the field that will receive document values.
In the Analysis section, select Analyze from and choose Input.
Select Analyze.
Examine the Field mappings grid to add, remove, or adjust mappings.
When you are satisfied with the mappings, select Analyze from and choose None.
Optionally, select the Options tab and configure output options:
- Select Include input fields to copy all fields from the main input connection to the output.
- Configure document database field conversion options.
Optionally, go to the Execution tab, and then set Web service options.

Edit field mapping

Sometimes the Document Injector tool's Analyze function does not produce the correct field mapping. Common issues include:

There are sparsely-populated fields.
The analyzer doesn't choose the desired field type.
You want finer control over which sub-documents result in fine structure, and which map to fields of type Document.

To edit the field mapping, use the buttons above the grid to add , remove , and move rows. For each row:

Path: the path into the Document structure where values will be injected. Separate nested structure fields with a period ( . ).
Input field: the name of the Data Management input field.

Edit array mapping

If your Document Injector tool is configured to Enable array outputs, the array mappings are displayed in the Array mappings grid. If the Document Injector tool's Analyze function does not produce the correct array mapping, or you have no arrays in your sample document, you may wish to edit the array mapping.

To edit the array mapping, use the buttons above the grid to add , remove , and move rows. For each row:

Path: the path into the Document structure where values will be injected. Separate nested structure fields with a period ( . ).
Output field: is the name of the Data Management output field in each child record where the child record will be stored.