- How important SQL is to the professional Java world.
- How common it is to forget about some basic SQL things.
- How well SQL-centric libraries such as jOOQ or MyBatis are responding to market needs, by embracing SQL. An amusing fact is that users have even mentioned my blog post on SLICK’s mailing list. SLICK is a non-SQL-centric database access library in Scala. Like LINQ (and LINQ-to-SQL) it focuses on language integration, not on SQL code generation.
1. Not using PreparedStatements
Interestingly, this mistake or misbelief still surfaces blogs, forums and mailing lists many years after the appearance of JDBC, even if it is about a very simple thing to remember and to understand. It appears that some developers refrain from using PreparedStatements for any of these reasons:- They don’t know about PreparedStatements
- They think that PreparedStatements are slower
- They think that writing a PreparedStatement takes more effort
- You can omit syntax errors originating from bad string concatenation when inlining bind values.
- You can omit SQL injection vulnerabilities from bad string concatenation when inlining bind values.
- You can avoid edge-cases when inlining more “sophisticated” data types, such as TIMESTAMP, binary data, and others.
- You can keep open PreparedStatements around for a while, reusing them with new bind values instead of closing them immediately (useful in Postgres, for instance).
- You can make use of adaptive cursor sharing (Oracle-speak) in more sophisticated databases. This helps prevent hard-parsing SQL statements for every new set of bind values.
- DELETED = 1
- STATUS = 42
- FIRST_NAME LIKE “Jon%”
- AMOUNT > 19.95
More background info:
- Caveats of bind value peeking: An interesting blog post by Oracle Guru Tanel Poder on the subject
- Cursor sharing. An interesting Stack Overflow question.
By default, always use PreparedStatements instead of static statements. By default, never inline bind values into your SQL.
2. Returning too many columns
This mistake is quite frequent and can lead to very bad effects both in your database’s execution plan and in your Java application. Let’s look at the second effect first:Bad effects on the Java application:
If you’re selecting * (star) or a “default” set of 50 columns, which you’re reusing among various DAOs, you’re transferring lots of data from the database into a JDBC ResultSet. Even if you’re not reading the data from the ResultSet, it has been transferred over the wire and loaded into your memory by the JDBC driver. That’s quite a waste of IO and memory if you know that you’re only going to need 2-3 of those columns.
This was obvious, but beware also of…
Bad effects on the database execution plan:
These effects may actually be much worse than the effects on the Java application. Sophisticated databases perform a lot of SQL transformation when calculating the best execution plan for your query. It may well be that some parts of your query can be “transformed away”, knowing that they won’t contribute to the projection (SELECT clause) or to the filtering predicates. I’ve recently blogged about this in the context of schema meta data:
How schema meta data impacts Oracle query transformations
Now, this is quite a beast. Think about a sophisticated SELECT that will join two views:
1
2
3
4
| SELECT * FROM customer_view c JOIN order_view o ON c.cust_id = o.cust_id |
1
2
3
4
| SELECT c.first_name, c.last_name, o.amount FROM customer_view c JOIN order_view o ON c.cust_id = o.cust_id |
The Cure:
Never execute SELECT *. Never reuse the same projection for various queries. Always try to reduce the projection to the data that you really need.
Note that this can be quite hard to achieve with ORMs.
3. Thinking that JOIN is a SELECT clause
This isn’t a mistake with a lot of impact on performance or SQL correctness, but nevertheless, SQL developers should be aware of the fact that the JOIN clause is not part of the SELECT statement per se. The SQL standard 1992 defines atable reference
as such:6.3 <table reference> <table reference> ::= <table name> [ [ AS ] <correlation name> [ <left paren> <derived column list> <right paren> ] ] | <derived table> [ AS ] <correlation name> [ <left paren> <derived column list> <right paren> ] | <joined table>The
FROM
clause and also the joined table
can then make use of such table references
:7.4 <from clause> <from clause> ::= FROM <table reference> [ { <comma> <table reference> }... ] 7.5 <joined table> <joined table> ::= <cross join> | <qualified join> | <left paren> <joined table> <right paren> <cross join> ::= <table reference> CROSS JOIN <table reference> <qualified join> ::= <table reference> [ NATURAL ] [ <join type> ] JOIN <table reference> [ <join specification> ]Relational databases are very table-centric. Many operations are performed on physical, joined or derived tables in one way or another. To write SQL effectively, it is important to understand that the
SELECT .. FROM
clause expects a comma-separated list of table references in whatever form they may be provided.Depending on the complexity of the table reference, some databases also accept sophisticated table references in other statements, such as INSERT, UPDATE, DELETE, MERGE. See Oracle’s manuals for instance, explaining how to create updatable views.
The Cure:
Always think of your
FROM
clause to expect a table reference as a whole. If you write a JOIN
clause, think of this JOIN
clause to be part of a complex table reference:
1
2
3
4
5
6
| SELECT c.first_name, c.last_name, o.amount FROM customer_view c JOIN order_view o ON c.cust_id = o.cust_id |
4. Using pre-ANSI JOIN syntax
Now that we’ve clarified how table references work (see the previous point), it should become a bit more obvious that the pre-ANSI JOIN syntax should be avoided at all costs. To execution plans, it usually makes no difference if you specify join predicates in theJOIN .. ON
clause or in the WHERE
clause. But from a readability and maintenance perspective, using the WHERE
clause for both filtering predicates and join predicates is a major quagmire. Consider this simple example:
1
2
3
4
5
6
| SELECT c.first_name, c.last_name, o.amount FROM customer_view c, order_view o WHERE o.amount > 100 AND c.cust_id = o.cust_id AND c.language = 'en' |
The Cure:
Always use the ANSI JOIN syntax. Never put JOIN predicates into the
WHERE
clause. There is absolutely no advantage to using the pre-ANSI JOIN syntax.5. Forgetting to escape input to the LIKE predicate
The SQL standard 1992 specifies thelike predicate
as such:8.5 <like predicate> <like predicate> ::= <match value> [ NOT ] LIKE <pattern> [ ESCAPE <escape character> ]The
ESCAPE
keyword should be used almost always when
allowing for user input to be used in your SQL queries. While it may be
rare that the percent sign (%) is actually supposed to be part of the
data, the underscore (_) might well be:
1
2
3
| SELECT * FROM t WHERE t.x LIKE 'some!_prefix%' ESCAPE '!' |
Always think of proper escaping when using the LIKE predicate.
6. Thinking that NOT (A IN (X, Y)) is the boolean inverse of A IN (X, Y)
This one is subtle but very important with respect to NULLs! Let’s review whatA IN (X, Y)
really means:A IN (X, Y) is the same as A = ANY (X, Y) is the same as A = X OR A = YWhen at the same time,
NOT (A IN (X, Y))
really means:NOT (A IN (X, Y)) is the same as A NOT IN (X, Y) is the same as A != ANY (X, Y) is the same as A != X AND A != YThat looks like the boolean inverse of the previous predicate, but it isn’t! If any of
X
or Y
is NULL
, the NOT IN
predicate will result in UNKNOWN
whereas the IN
predicate might still return a boolean value.Or in other words, when
A IN (X, Y)
yields TRUE
or FALSE
, NOT(A IN (X, Y))
may still yield UNKNOWN
instead of FALSE
or TRUE
. Note, that this is also true if the right-hand side of the IN
predicate is a subquery.Don’t believe it? See this SQL Fiddle for yourself. It shows that the following query yields no result:
1
2
3
4
5
| SELECT 1 WHERE 1 IN ( NULL ) UNION ALL SELECT 2 WHERE NOT (1 IN ( NULL )) |
The Cure:
Beware of the
NOT IN
predicate when nullable columns are involved!7. Thinking that NOT (A IS NULL) is the same as A IS NOT NULL
Right, so we remembered that SQL implements three-valued logic when it comes to handling NULL. That’s why we can use the NULL predicate to check for NULL values. Right? Right.But even the NULL predicate is subtle. Beware that the two following predicates are only equivalent for row value expressions of degree 1:
NOT (A IS NULL) is not the same as A IS NOT NULLIf A is a row value expression with a degree of more than 1, then the truth table is transformed such that:
- A IS NULL yields true only if all values in A are NULL
- NOT(A IS NULL) yields false only if all values in A are NULL
- A IS NOT NULL yields true only if all values in A are NOT NULL
- NOT(A IS NOT NULL) yields false only if all values in A are NOT NULL
The Cure:
When using row value expressions, beware of the NULL predicate, which might not work as expected.
8. Not using row value expressions where they are supported
Row value expressions are an awesome SQL feature. When SQL is a very table-centric language, tables are also very row-centric. Row value expressions let you describe complex predicates much more easily, by creating local ad-hoc rows that can be compared with other rows of the same degree and row type. A simple example is to query customers for first names and last names at the same time.
1
2
3
| SELECT c.address FROM customer c, WHERE (c.first_name, c.last_name) = (?, ?) |
1
2
3
4
| SELECT c.first_name, c.last_name, a.street FROM customer c JOIN address a ON (c.id, c.tenant_id) = (a.id, a.tenant_id) |
The Cure:
Use row value expressions whenever you can. They will make your SQL more concise and possibly even faster.
9. Not defining enough constraints
So, I’m going to cite Tom Kyte and Use The Index, Luke again. You cannot have enough constraints in your meta data. First off, constraints help you keep your data from corrupting, which is already very useful. But to me, more importantly, constraints will help the database perform SQL transformations, as the database can decide that- Some values are equivalent
- Some clauses are redundant
- Some clauses are “void” (i.e. they will not return any values)
The Cure:
Define as many constraints as you can. They will help your database to perform better when querying.
10. Thinking that 50ms is fast query execution
The NoSQL hype is still ongoing, and many companies still think they’re Twitter or Facebook in dire need of faster, more scalable solutions, escaping ACID and relational models to scale horizontally. Some may succeed (e.g. Twitter or Facebook), others may run into this:For the others who are forced (or chose) to stick with proven relational databases, don’t be tricked into thinking that modern databases are slow. They’re hyper fast. In fact, they’re so fast, they can parse your 20kb query text, calculate 2000-line execution plans, and actually execute that monster in less than a millisecond, if you and your DBA get along well and tune your database to the max.
They may be slow because of your application misusing a popular ORM, or because that ORM won’t be able to produce fast SQL for your complex querying logic. In that case, you may want to chose a more SQL-centric API like JDBC, jOOQ or MyBatis that will let you get back in control of your SQL.
So, don’t think that a query execution of 50ms is fast or even acceptable. It’s not. If you get these speeds at development time, make sure you investigate execution plans. Those monsters might explode in production, where you have more complex contexts and data.
Conclusion
SQL is a lot of fun, but also very subtle in various ways. It’s not easy to get it right as my previous blog post about 10 common mistakes has shown. But SQL can be mastered and it’s worth the trouble. Data is your most valuable asset. Treat data with respect and write better SQL.Source: http://blog.jooq.org/2013/08/12/10-more-common-mistakes-java-developers-make-when-writing-sql/