Concept¶
Python’s okay, but my mind just seems to work in SQL
That’s nice, but that’s not a good reason to build temporary SQLite databases full of transient data that’s going to be reorganized or summarized.
The cruft of table schemas, bulk inserts, queries, and bulk extracts is a heavy, extraneous burden to permit the use of a SELECT
statement.
SQL is a good way to start the design process. There’s a lot to be said the Essential SQL Algorithm. It’s not the best target implementation.
We can leverage SQL thinking to create Python code that avoids SQL database overheads and complications.
An Example¶
Some SQL
SELECT n.name, v.c2
FROM names_table n, values_table v
WHERE n.code = v.c1
Some Python that does the same thing.
Select(name=lambda cr: cr.n.name, value=lambda cr: cr.v.c2)
.from_(n=names_table, v=values_table)
.where(lambda cr: cr.n.code == cr.v.c1)
Yes. The Python is longer. Yes it has lambda cr: cr.
scattered around.
The Python produces the same results as the SQL query, using essentially the same algorithm.
However, It operates directly on native Python data structures. This means that all Python functions and object classes are available. There’s no looking for a SQL equivalent, or – worse – a hybrid semi-SQL/semi-Python monstrosity.
The lambda
objects are necessary for building expressions that work on the “composite-row” objects fetched from tables.
We don’t want the Python run-time to immediately compute an answer to n.code == v.c1
.
We want to provide a version of this expression to be evaluated for each row of the underlying tables.
Since a lambda is a function, any one-argument function can be used. That’s a real any – not an anything you can put into a wrapper or some other hack – since this code is native Python.
What’s The Point?¶
The SQL algorithm is a helpful mental model. Folks who find it easier to think in SQL are leveraging an elegant algorithm that has a number of related design patterns. The point of this module is not to stop using SQL. The point is to stop using a database.
What follows is a deep dive into the SQL Algorithm, and some details of SQL features.
Feel free to skip to Demonstration Code to see a few examples of how this works.
The SQL Algorithm¶
We’ll decompose the hugely complicated SELECT
statement into three parts:
The
ORDER BY
andLIMIT
options that are appended to the result of the set operations.The set operations like
UNION
that apply to multiple sub-select statements.The sub-select statement – the essential
SELECT FROM WHERE GROUP BY HAVING
part.
We’ll focus on the sub-select, becuase (a) it’s central, and (b) the other parts are add-ons.
We’ll work with a collection of tables, \(T_1\) through \(T_{n}\). Each table, \(T_{x}\), is a set of rows, \(T_x = \{ r_1, r_2, \dots, r_{m} \}\). Each row, \(r\), is a tuple of atomic values, \(r = \langle v_1, v_2, \dots \rangle\).
A Query, \(Q(\mathbb{T})\), over the collection of tables, \(\mathbb{T} = \{T_1, T_2, \dots, T_{n}\}\), is the following:
There are five higher-order functions, \(H_f(R)\), \(G_{r}(T)\), \(S_a(T)\), \(S_m(T)\), and \(W_f(T)\). There’s one ordinary function \(F(T)\). The higher-order functions are parameterized by other functions. These parameter functions are applied to the individual rows of table collections.
This requires rethinking the order of the clauses.
The order FROM t WHERE C_w SELECT e_1, e_2, a_1, a_2, ... GROUP BY k HAVING C_h
reflects the order in which processing occurs.
The algorithm has the following stages:
Create a “from product”, \(\mathbb{T}^{*} = F(\mathbb{T})\), from the collection of tables \(\mathbb{T}\).
This joins all the tables. It’s a kind of \(\prod \mathbb{T}\), sometimes stated as \(\mathbb{T}^{*} = T_1 \bowtie T_2 \bowtie \dots \bowtie T_n\).
Apply a “where filter”, \(\mathbb{T}^{*^\prime} = W_{C_w(r)}(\mathbb{T}^{*})\).
This applies a condition, \(C_w(r)\), to each row in the joined tables, \(r \in \mathbb{T}^{*}\), to pass rows matching the condition. This creates a subset, \(\mathbb{T}^{*^\prime} \subseteq \mathbb{T}^{*}\). \(\mathbb{T}^{*^\prime} = \{ r \mid r \in \mathbb{T}^{*} \textbf{ and } C_w(r) \}\). In effect, omitting a
WHERE
clause makes \(C_w(r) = \mathtt{True}\), leading to a cartesian product result.A
JOIN ON
clause introduces additional logic to \(C_w(r)\). In principle, it might imply a sequence of pair-wise joins as the algorithm. While this might be a nice optimization, it doesn’t materially alter the processing. It is often easier to read than a massiveWHERE
clause.Select mapping, \(R = S_{E_m(r), \dots}(\mathbb{T}^{*^\prime})\).
This transforms the filtered rows of \(\mathbb{T}^{*^\prime}\) to a new table structure, \(R\), by computing values or “projecting” values from the source tables. Each \(E_m(r)\) function maps one composite row of values to a single target value. The original table identities are lost at this point. \(R = \{ \langle E_0(r), \dots, E_m(r) \rangle \mid r \in \mathbb{T}^{*^\prime} \}\).
Note that the \(E_m(r)\) collection of expressions is a subset of the expressions in the SQL
SELECT
clause. SQLSELECT
syntax also includes aggregate expressions, \(A_p(g)\), even though they’re properly part ofGROUP BY
processing.Each expression, \(E_m(r)\), can be viewed as a one-argument lambda, \(E_m(r) = \lambda r.E_m\). Using this alternate notation might simplify the overall presentation.
Group By reduction, \(\mathbb{R} = \{R_k \mid k \in \mathbb{K}\}\), where \(R_k = \{ r \mid r \in R \textbf{ and } K(r) = k \}\).
The given \(K(r)\) function maps each row, \(r\), to a key value. (This may be a tuple of atomic values.) This function has some universe of values over the rows in \(R\), \(\mathbb{K} = \{K(r) \mid r \in R\}\). This universe of actual values defines the groups that are created.
Absent any explicit group-by clause, all the data belongs to a single group, and the universe of key values is a single, anonymous value, \(\mathbb{K} = \{\bot\}\). This means there’s only one sub-table, which is the original table, \(\mathbb{R} = \{R_\bot\} = \{R\}\).
Select aggregate mapping, \(\biguplus\mathbb{R} = \{ \langle A_0(R_k), A_1(R_k), \dots, A_p(R_k) \rangle \mid R_k \in \mathbb{R} \}\). This creates a new table, \(\biguplus\mathbb{R}\), with rows built from applying aggregate functions to each sub-table, \(R_k\). Additionally, the key value, \(k\), is also available aspart of the constructed row. (This can be a tuple of atomic values.)
While the aggregate functions are specified in the SQL
SELECT
clause, they are applied to each sub-table created by the group-by function. The \(A_0(R_k), \dots, A_p(R_k)\) functions compute the collection of aggregate values for all rows in a sub-table.Each expression, \(A_p(R_k)\), is a lambda over a sub-table, \(\lambda R_k.A_p(R_k)\). Consider summation, for example: \(\lambda R_k.\sum\limits_{r \in R_k}r\). The count, as another example, is \(\lambda R_k.\sum\limits_{r \in R_k} 1\).
Aggregate functions include sum, mean, count, min, max, among others. There are two variants: the default behavior creates a list of values; the
DISTINCT
variant creates a set of only the distinct values.When there’s no group-by, the resulting table has a single row computed by the aggregate functions.
Important. If there is no
GROUP BY
clause and no aggregate functions, no transformation occurs: \(\biguplus\mathbb{R} = R\). (Since there can be noHAVING
clause without aGROUP BY
clause, with noGROUP BY
, processing is effectively finished when there’s noGROUP BY
.)Note. If there’s a
GROUP BY
clause and no aggregate functions, the SQL seems invalid.Apply a “having filter”, \(\biguplus\mathbb{R}^{\prime} = H_{C_h(r)}(\biguplus\mathbb{R})\). This is identical to where-clause processing.
This applies a condition, \(C_h(r)\), to each row in the group-by result, \(\biguplus\mathbb{R}\), to pass matching rows. This creates a subset, \(\biguplus\mathbb{R}^{\prime} \subseteq \biguplus\mathbb{R}\). \(\biguplus\mathbb{R}^{\prime} = \{ r \mid r \in \biguplus\mathbb{R} \textbf{ and } C_h(r) \}\).
Absent a
HAVING
clause, \(C_h(r) = \mathtt{True}\), and all rows are kept.
Yes, the order these functions are applied is not the order SELECT
statements are commonly written.
We can think of this as a composite function.
We can rearrange things so it is closer to SELECT
syntax.
The above expression is similar to the more commonly-used order of clauses.
What’s important are these features:
The sequence of operations is based on higher-order functions
filter()
,map()
reduce()
, and one ordinaryproduct()
function.The sequence applies to “composite” rows from a number of tables prior to the
SELECT
and new rows from a single table after theSELECT
.All SQL expressions are functions that apply to rows of a table. In the case of the
SELECT
expressions that are scalar, and theWHERE
expression, the “row” is a composite object from the \(T^{*}\) interim result. In the case of the aggregateSELECT
epxressions and theHAVING
expression, the row is a simple row of values.
Here’s the way higher-order functions apply to SQL clauses:
Clause |
Function |
---|---|
|
|
|
|
|
|
|
Partition using \(K(r)\) to create \(\mathbb{R}\). |
|
|
|
|
Therefore,:
filter(h,
reduce(s_a,
parition(k,
map(s_e,
filter(w,
itertools.product(T))))))
This is a conceptual overview of the SQL operation.
The tricky part is the partition()
function isn’t built-in to itertools
.
This function can be more usefully created around a defaultdict(list)
structure to create lists of rows in each partition.
Each of these lists can then be aggregated and filtered.
The Group By Alternatives¶
There are four cases for GROUP BY
and aggregate functions in the SELECT
clause:
Neither
GROUP BY
, nor aggregates inSELECT
. The results of \(R = (S_E \circ W \circ F)(\mathbb{T})\) are complete.No
GROUP BY
, but one or more aggregates inSELECT
. The result is a single summary row. It’s \(R = (S_A \circ S_E \circ W \circ F)(\mathbb{T})\), but the group-by operation is a kind of degenerate case; it creates a single group and therefore a single result row from the aggregate computation, \(S_A\). There can be noHAVING
without aGROUP BY
.A
GROUP BY
clause, and aggregates inSELECT
. The result is a new table of summary rows, \(R = (S_A \circ G \circ S_E \circ W \circ F)(\mathbb{T})\). This can then be processed by theHAVING
clause.A
GROUP BY
clause, but no aggregates inSELECT
. Not sure what this means. SQLite3 appears to ignore theGROUP BY
key definition and produce all rows. This doesn’t seem completely sensible; it seems more sensible to emit the distinct combinations of key values.
The Subqueries and the Exists Function¶
A subquery can appear in a number of places:
FROM
SELECT
WHERE
HAVING
See https://www.w3resource.com/sql/subqueries/understanding-sql-subqueries.php for examples.
For the FROM
clause, the subquery provies a table. This is consistent with the definition of \(Q(\mathbb{T})\) above.
For the other clauses, there are three kinds of results: a set of values, a single value, or a boolean.
The subquery produces a set of values used for collection operators like
IN
orNOT IN
. This suggests a value selector function can pick values from a single column of the result. The value selector, \(V_c(Q(\mathbb{T}))\), will pick one column, \(c\) from all rows, \(V_c(Q(\mathbb{T})) = \{r_c \mid r \in Q(\mathbb{T}) \}\), to create a set of values.The subquery produces a single value for scalar operators. This suggests a wrapper function to pick a row from the result and then pick one value from the chosen row. The resulting table has one or more rows, \(T = \{r_1, r_2, \dots, r_n\}\). The first row has one or more values, \(r_1 = \langle v_1, v_2, \dots, v_n \rangle\). The value selector function, \(V_{n, c}(Q(\mathbb{T}))\), can pick row \(n\), and column \(c\) of the table to retrieve the scalar value. Most of the time, this is \(V_{1, c}\) to pick a column from the first (or only) row.
In an
EXISTS()
context the subquery producing any result at all meansEXISTS()
isTrue
. Failing to produce a result meansEXISTS()
isFalse
. A function, \(\exists(Q(\mathbb{T}))\) is applied to see if there was at least one row in the subquery result.
These wrapper functions to get all values from a columns or a specific value from a row and a column are implicit in SQL.
The EXISTS()
function is the only one that’s explicit.
The implicit value-extraction is a handy assumption that simplifies SQL slightly.
There are two subquery contexts:
Independent. In this case, the subquery has no expressions that reference tables from the parent query. The subquery must be executed first, and the resulting value provided to the parent query.
Bound. In this case, a subquery has one or more expressions that reference tables from the context query. This means the \(Q(\mathbb{T})\) function requires a second argument value: \(Q(\mathbb{T}; r)\), where \(r\) is the current row in the context query. This also means any of the functions \(E_m(r)\), \(C_w(r)\), or \(C_h(r)\) may include the results of a subquery. For example, \(E_m(x) = V_{1,1}(Q_b(\mathbb{T}; x))\), extracts a scalar result of a bound subquery, \(Q_b\), when applied to a collection of tables with a context row, \(x\). This can happen when a
SELECT
clause expression, \(E_m(x)\) has a reference to a subquery.One commonly-used function for the
WHERE
andHAVING
clauses is theEXISTS
test, \(\exists(Q(\mathbb{T}; r)\)). This may include a function of a scalar result of a bound subquery, \(V_{1,1}(Q_b(\mathbb{T}; r))\), in an more complex condition.
Rows from the context query are available in a bound subquery, \(Q_b\), in SQL without any special syntax. What’s essential here is the subquery processing has very handy implicit behavior.
Common Table Expressions¶
A Common Table Expression (CTE) has a creation query, \(Q_w\), prior to a target query.
These are specified in a WITH
clause, prior to the target select.
The creation query prepares a table-like structure that can be incorporated into another query.
Additionally, the creation query can involve recursion.
Note there are two variants of the subquery: an initialization clause, \(Q_{W0}\), and a recursion clause, \(Q_{W*}\).
Often the initialization clause is a VALUES
clause providing literal values.
The recursion, \(Q_{W*}\), is specified as a UNION
or UNION ALL
clause that’s syntactically part of the initial VALUES
clause.
The choice between breadth-first and depth-first traversal of the query results is specified with an ORDER BY
clause.
The default is breadth-first.
Other Query Features¶
Some “other” features of SQL queries include the following:
Order BY. This is best handled by Python’s native
sorted()
function.sorted(fetch(Q), key=lambda row: ...)
.Limit. This can be handled by Python’s native list slicing.
list(fetch(Q))[start:stop]
.However, this can be handled better by applying a filter to the enumerated rows.
(r for n, r in enumerate(fetch(Q), start=1) if start <= n < stop)
. We can imagine alimit(query, slice(start, stop, step))
iterator that yields the expected subset.Union, Intersect, etc. There are set operations that are part of Python. The complication here is that the underlying
sqlful.Row
objects are mutable dictionaries. To do set operations, it’s best to make immtutable, frozen dataclasses.
These can all be done with relative ease. There isn’t any SQL-like syntax for these features.