|
|
-
Since I saw a request, I'll post a short overview of the outer join vs inner join. There is nothing really complex about them and you can read about them in a million SQL books, but I'll take a stab at it anyway. The most common scenario where outer joins are needed is when you have a "one-to-many" relationship between two tables. Take "authors" and "titleauthor" in the pubs database. You can join them on au_id column to figure out all titles that any particular author has: select * from authors a join titleauthor t on a.au_id = t.au_id This query returns 25 rows. Some of the authors will be repeated - those that have more than one title in the database. By the way, you'll get exactly the same result from the following query: select * from authors a, titleauthor t where a.au_id = t.au_id This is because the "comma" in the FROM clause is considered a "cross product" operator, that will give you all possible combinations of "authors" rows and "titleauthor" rows. Filter those where au_id is the same on two sides - and you get your join. There is a slight problem with it - those authors that did not get to writing any titles yet will not show up in the results. It may be OK, or it may be not - depends on the requirements. If you'd like to see the outers without titles, you need to change your query somehow. This is where outer join comes in. If you write: select * from authors a left outer join titleauthor t on a.au_id = t.au_id you'll see 3 more rows returned. They correspond to authors that don't have corresponding rows in the titleauthor table. For each author, the "titleauthor" part of the result will contain fake rows that consist of NULLs in all columns. Outer joins have the following interesting properties: all rows from "outer" side of the join will be present in the result, even those that don't satisfy the "ON" condition. - columns coming from the inner side will always be nullable, even those that are declared as "not nullable" in the base table. This is needed to accomodate "fake" NULL values.
- outer joins cannot be expressed using comma and WHERE clause (as opposed to inner joins).
- as opposed to inner joins, in case of multiple-table joins, it matters for the result in what order the joins are performed. For inner joins, you may change the order of execution freely.
Note that SQL Server still supports an older (non-ANSI) notation for expressing outer joins that uses operators "*=" and "=*" to designate the outer side with a star. Unfortunately, this notation is ambiguous, because it does not specify the join order, and does not fully specify what conditions are part of the ON clause and what conditions are in the WHERE clause. Handling these "old-style" outer join operators puts a serious maintenance burden on the product, so beware that they will be deprecated very soon.
|
-
I am going to proceed to some lengthy and boring esoterica of ORDER BY, but as a reminder, you can influence the choice of topics pretty easily by posting comments with your favorites :-). This posting is provided "AS IS" with no warranties, and confers no rights.
|
-
So, when your query specification has an explicit GROUP BY, your choices in what to use in post-grouping clauses (SELECT, HAVING, ORDER BY) are limited. You can only use aggregate functions, expressions in the GROUP BY list, or constants. An interesting possibility is that you can combine them. For example, the following is valid: SELECT a+b FROM t GROUP BY a, b Here, each of "a" and "b" are allowed in the SELECT list because they are present in the GROUP BY list, and so their combination as "a+b" is also allowed. However, the same combination rule becomes a bit less trivial when applied to expressions in the GROUP BY list (and we allow arbitrary expressions, remember?). For example, the following is allowed: SELECT a+b FROM t GROUP BY a+b, because the "a+b" in the SELECT list is an exact replica of the "a+b" in the GROUP BY clause. However, if I try to apply the same "combination" technique I used for columns, it won't work: SELECT a+b+c+d FROM t GROUP BY a+b, c+d The reason this is not allowed is that a+b+c+d is parsed as ((a+b) + c) + d, and the fact that this is a combination of "a+b" and "c+d" won't be recognized. In fact, in all SQL Server versions up to SQL 2000, any combination of *expressions* (not simple variables) from the GROUP BY list is not recognized as valid, so that even if you write SELECT (a+b) + (c+d) FROM t GROUP BY a+b, c+d, or even simpler: SELECT (a+b) + 1 FROM t GROUP BY a+b, we won't recognize that as valid. This however changed in SQL 2005 Beta2, where both of the above examples will work. You should be careful however to compose your expressions so that the match is "obvious". For example, we are not smart enough to figure out that the following is valid: SELECT a+1+b FROM t GROUP BY a+b, but you can always structure your expressions so that the match is obvious (use parentheses if necessary). This posting is provided "AS IS" with no warranties, and confers no rights.
|
-
Earlier, I explained what exactly a query specification is. It is important to note that query specifications come in two very different varieties: grouped and not. A query specification is grouped if it has any of the following: GROUP BY clause, HAVING clause, or aggregates. Roughly speaking, this would mean that the result of the FROM and WHERE clauses gets "compressed": the query may deliver only one, aggregated row for each group of several rows coming out of FROM. What is important to note is that even without an explicit GROUP BY clause, a query specification may still be grouped if HAVING or aggregates are present. If there is no GROUP BY clause, the entire rowset is considered a single group, and the query delivers exactly one row (if not filtered away completely by HAVING). The SELECT clause is governed by quite different rules in grouped and non-grouped query specifications. The difference of course is that in the grouped query specifications, the SELECT clause is executed after a GROUP BY (and remember, there is *always* a GROUP BY in the grouped query specifications, even if not explicitly specified!). And after your grouping is done, you are in a very different rowset than before the grouping. Generally speaking, all your original columns are gone. An exception is made for columns that you grouped by, because their value within a group is guaranteed to be the same, so they "survive" a GROUP BY operator. Another exception is aggregates. Inside an aggregate, you can use any column freely. SQL standard (of '92 vintage) allowed only GROUP BY individual columns, i.e. you could say "GROUP BY a, b", but not "GROUP BY a+b". As an implementor, I wish it remained that way :-), because the semantics is pretty clear, and you can still achieve "GROUP BY a+b" functionality by using a derived table. However, it was perceived so convenient, that the ability to group by an arbitrary expression was added as a language feature. Seemingly benign, it actually brings a couple of esoteric issues with it. First, what would be a valid GROUP BY expression. The standard is pretty clear that the column you are grouping by should not be an outer reference. This makes sense, because an outer reference would be constant across the rowset, and the result of grouping by a constant is a single group - just as though the GROUP BY was not specified at all. While this is not so terrible, it's more likely that this is a coding bug than the real intent. So basically, the idea is to not allow grouping on a constant. So, how do you tell a constant from a non-constant when you have a complex expression? Here is the rule that Yukon Beta2 enforces: each individual GROUP BY expression has to contain at least one column reference, and that column has to be from the local scope (i.e. not an outer reference). Note that this is a little bit tighter than in SQL 2000, where any column reference (even from an outer scope) was enough to make a GROUP BY expression valid. As an additional restriction, GROUP BY expressions do not allow scalar subqueries or aggregates. Ok, that should be enough of esoterica for today, I'll leave the other GROUP BY issue for the next post :-) This posting is provided "AS IS" with no warranties, and confers no rights.
|
-
Thanks to everyone who commented on parameterization in TSQL. It was a pretty eye-opening experience to see the lack of "extended" parameterization (beyond simple runtime values) being such a pain point for the users. I'll make sure your voices are heard when new language features are planned. This posting is provided "AS IS" with no warranties, and confers no rights.
|
-
Here is an interesting comment from chrisbro about some features he finds sorely missing in TSQL: * Can't specify a parameter to be used as an IN clause * Can't easily specify parameters to be used in the ORDER BY clause; especially painful to do multiple order by clauses I actually hear quite a bit of requests in the same "parameterization" vein, such as "how come I cannot specify a table name as a parameter?". So let's look under the hood for a moment. Although it looks to the user as a "pure interpreter", TSQL is actually a compiler. During the compilation phase, all tables/columns are fetched from the catalogs, the statement gets type checked, and, most importantly, the query processor comes up with a plan for executing the query. Compilation is a pretty expensive process, so we keep the resulting plan in cache and reuse it if we see the same statement again. So, here is a rule: during compilation, we cannot look at the actual variable or parameter values. Otherwise, we would have to compile the query every time it is executed, because the parameter value may be different. So, we restrict parameters to the relatively "benign" positions where they cannot really influence the data types involved in a statement and the execution plan. This allows us to compile a statement once and reuse the plan for different parameter values (of course I am simplifying a bit here, let me know if more details would be of interest). Now, if you look at, say, ORDER BY clause, the execution plan is actually different (in some cases radically different) depending on what columns you sort on. That's why we don't allow you to parameterize this aspect of the query (IN clause is roughly the same deal). Good workarounds for parameterizing the IN clause are described here. When you want to parameterize your ORDER BY clause, you have multiple options, and which one you choose depends on how flexible you want to be on your ordering options and what your performance requirements are. The most performant solution would be to write several separate queries, each one with the particular flavor of ORDER BY that you want, and then choose one of them using IF statement. If you want to be absolutely flexible, you can use dynamic SQL (i.e. compose your statement on-the-fly and execute it using EXEC). This is slower, and requires you to be careful to avoid SQL injection attacks if you take the ORDER BY list directly from the user input (tip: check that this is indeed a simple ORDER BY list before composing the statement). So, why cannot SQL Server just implement the above "workarounds" as features and allow parameterizing of ORDER BY clause for the user? One reason is performance. Any generic solution would require essentially generating dynamic SQL under the covers. This would incur serious performance overhead in excess of "regular" parameterization, which would be "unexpected" as opposed to the explicit use of dynamic SQL where you are aware that it has perf penalty. Another reason is that we just cannot implement all features in the world, we have to prioritize them. So, keep posting your wishes, and they may come true in upcoming releases. This posting is provided "AS IS" with no warranties, and confers no rights.
|
-
Let's start from basics, if it looks boring, feel free skip to the next post :-). I just feel I need to establish some terminology first. The main building block of SQL is what the standard calls "query specification". It consists of "clauses": SELECT, FROM, WHERE, GROUP BY, HAVING (the last three are optional). They are executed in that order, with one exception: SELECT clause is executed last. How is this different from a "SELECT statement"? First, UNION. Several query specifications may be combined using "set operators" (we only support UNION and UNION ALL at this point), and that will still be a single SELECT statement. Second, subqueries. Single SELECT statement (or "query") may contain several query specifications, some serving as subqueries (table or scalar). So, "query specification" is an all-encompassing term, meaning that it contains general-purpose stuff that would be good either in a top-level query or in a subquery. Top-level query, aka SELECT statement, also contains some specialized syntax that is not accepted in a subquery. I'll mention ORDER BY clause, COMPUTE clause, and FOR BROWSE as examples. While I am on the topic of subqueries, table subqueries (aka "derived tables") deserve a special mention. This feature is what makes SQL "fully composable". This means that you can take any SQL query that delivers a relational stream and use it as a source for another query. It's amazing how many "hard to write" queries that don't really fit into a single query specification become possible to express if you use derived tables. So, here is the test: we can only allow something in a query specification if it produces a regular relational stream, easily composable with other relational operators. So, the three example clauses I mentioned above as "top-level-only" features all have something that is not really relational. ORDER BY introduces ordering, which the relational model does not support. COMPUTE returns multiple result sets. FOR BROWSE adds "hidden columns" to the result set, something that does not completely fit into the relational data model. This posting is provided "AS IS" with no warranties, and confers no rights.
|
-
My name is not actually TSQL language :-), I am Eugene Zabokritski, a Technical Lead working in the TSQL front-end team of the SQL Engine group. My plan is to post a series of overviews of some TSQL areas that I think are complicated enough to warrant some explanation or discussion. Please feel free to suggest topics of interest. I'll try to provide an implementor's viewpoint and explain some rationale behind even your least favorite features :-). I may also try to answer particular questions about TSQL syntax or semantics, we'll see if the question-answer format will actually work out. This posting is provided "AS IS" with no warranties, and confers no rights.
|
|
|
|