Reference: Equivalents to SQL queries
Overview
SQL queries are a kind of "kitchen sink" approach to data extraction—everything you need is in one place. This can lead to large, complex SQL queries that are difficult to understand. With Data Management, the components of an SQL query are divided into functional steps and addressed with separate tools. This results in an understandable and self-documenting process.
Many SQL query components have equivalent Data Management tools:
SELECT
= Select toolFROM
= Input toolsORDER BY
= Sort toolGROUP BY
= Summarize toolHAVING
= Filter tool
SELECT
Data Management's Select tool is roughly equivalent to the SELECT
clause of an SQL query. The Select tool specifies which input fields (columns) you wish you send through the project. It also allows you to change the names of selected fields.
Select can be connected directly downstream from an input tool. Connect a Select tool to each input tool to reduce the set of fields. You can also connect a Select tool to the output of a Join tool to simulate a "natural" Join. The sample project sql_equivalent2 demonstrates this technique.
FROM
Data Management's input tools do the work of SQL's FROM
clause. In the SQL statement SELECT * FROM CUSTOMERS, TRANSACTIONS
, the FROM clause names the database tables from which the raw records for the query are obtained. In Data Management, this corresponds to the data-input tools, such as DBF Input and Flat File Input. You configure an input tool for each table you wish to use as input to a Data Management project.
If you are extracting data from an SQL system to use as input data for a Data Management project, use query and filter settings to minimize the amount of data. You should select only the data you need for the tasks you plan to perform. Reducing the amount of data will speed up your Data Management processing.
WHERE
The SQL WHERE
clause has serves both relational and filtering functions. Two Data Management tools perform the work of the WHERE
clause: the Join tool and the Filter tool.
Join
The Join tool replaces the relational portions of the SQL WHERE
clause. As the name implies, it joins together two tables by matching the values of a single field.
Unlike the join operation implemented by the SQL WHERE
clause, Data Management's Join tool simultaneously computes the left outer, inner, and right outer joins, supplying those tables on outputs labeled "L", "J" and "R" respectively. The availability of all three outputs makes a variety of powerful operations possible. You can:
Find out which records don't match between two tables
Perform a different operation on matching records than non-matching records.
Perform a "table lookup", and supply a default value for non-matching records.
To join more than one table, connect a cascading series of Join tools, where each Join tool adds another table to the results. The sample project publishers_with_fem_authors
demonstrates this technique.
To join on more than one text field, attach a Calculate tool upstream of the Join tool on both the "L" and "R" inputs, and create a new field by combining all of the Join fields. For example, you could create a new field called FULLNAME
and assign the expression FIRST_NAME+LAST_NAME
. After the Join tool, use Select to remove the temporary joining fields.
Filter
The Filter tool filters records from source tables or join results. You can:
Apply filters involving a single table (as in the SQL query
WHERE company.state = "AZ")
.Apply filters involving multiple tables (as in the SQL query
WHERE company.state = "AZ" OR employee.hire_year < 1990)
.
To filter input records with expressions involving the fields of a single table, place a Filter tool immediately "downstream" of the Input tool (before any joins). Note that some database Input tools may have built-in query settings that you can use instead of an explicit Filter tool. If you are extracting a small subset of a large database, this will usually be faster than reading all of the data and then using a separate Filter tool.
To filter input records with expressions involving the fields of multiple tables, place a Filter tool immediately "downstream" of the Join tool that matches the tables involved in the expression.
GROUP BY
Data Management's Summarize tool replaces the SQL GROUP BY
clause. It groups records by a one or more fields, and reduces each group to a single representative record. You can specify how records in the groups are combined by selecting an operation for each desired field. Operations include Total, Average, Maximum, Minimum, and Count.
To group and aggregate your records, connect a Summarize tool downstream of any Join tool.
ORDER BY
Data Management's Sort tool performs the work of the SQL ORDER BY
clause. It globally sorts the input records by one or more fields, each of which may be ordered ascending or descending.
To replicate the SQL ORDER BY
clause, connect a Sort tool downstream of any Join or Filter tool.
HAVING
The SQL HAVING
clause is similar to WHERE, but is applied to groups rather than fields.
To replicate the HAVING
clause, connect a Filter tool immediately "downstream" of the Summarize tool implementing the GROUP BY clause.