Conor Cunningham’s blog on SQL Server, data-driven applications, and pretty much whatever other random stuff he decides to post.
I am giving a talk tomorrow (Tuesday) night at the Austin PASS group on how Update Queries are processed/optimized in SQL Server.
The talk will be at the Microsoft office in Austin (which recently moved).
More details can be found at the CACTUS web page. (I don’t know if they have my talk announcement up there yet, but hopefully it will be there soon)
(I believe that I will be giving a longer version of this talk at the national PASS conference in Seattle in November, for those of you not lucky enough to be in Austin :)).
I hope you see you there!
I had asked some questions this afternoon about doing large amounts of updates without commits. I’m not sure I was able to ask the question I wanted, so I’ll put it here.
I had been given the schema for a mainframe application (IMS pointer array tables) that was the sum of 160 tables times, say 300 lines each. The table contained roughly 50,000 rows.
Roughly speaking, each row contained a table name, line number, column name, size information, and format information of the original database column to the right, various intermediate working columns in the middle, then SQL server names, data types, and size parameters, followed by Visual Basic property names, data types, and dimensions, followed by complete lines of SQL DDL statements, following complete lines of VB.NET classes. In short, it was possible to see, in one row, how each column was transformed from mainframe to client/server.
I wrote a single stored procedure to execute this transformation, which, if I recall, was about 1200 lines long. This took about 5 minutes to execute on 3.6Ghz computer with 2GB of RAM. I had SQL Server 2005 Developer Edition installed on the machine.
Its safe to say that I probably ran about one million updates on the table during that one execution. The specific question is: should I have put periodic BEGIN TRANS and COMMITS to flush pending transactions? Would this have made any difference in performance? Is there
any other design decisions that would have been better?
I did run out of room on my drive from time to time, and got good at shrinking the database. I had gotten really good at writing stored procedures, but only had cursory knowledge at that point about execution plans. The people with Oracle experience had told me that they wouldn’t have done it the way I did it, but they didn’t say how they would have done it differenty.
If the semantics of your application need all-or-nothing, then do txns at that scope.
If the desired semantics is to get reasonable performance but allow some work to commit before the whole job is done, smaller transactions mean that you hold locks for less time and generally promote concurrency.
I don't know enough about your app to know whether the application NEEDS isolation. I can tell you that most batch jobs have other notions of backup/recovery (ETL may go into a staging table and the whole thing can just be dropped at the logical level instead of worrying about txns). I would start with the business requirement and then work towards the "how does it get implemented w.r.t. txns?" question.
While not in this particular code situation, we were running another code transformation process where we had 4 million rows of mainframe source code in a table. My stored procedures did updates to translate all occurrences of (for example) VAR-ABC to VarAbc. Based on what you were saying in the meeting, this would have generated two spoolers to force the updates to operate sequentially. In this case, since the updates were occurring to different columns of the same table, so 'sequential' would occur by default.
When I tried creating Oracle stored procedures that did updates on hundreds of thousands of rows, it failed with a message to the effect that I had exceeded my rollback segment (or some term to that effect). To complete this update, I had to create a cursor, scroll through 1000 records applying my updates, then commit the 1000 update transaction. On the one hand, I was horrified at the bother, on the other hand, I could see that this probably forced better performance.
On a developer machine, one doesn't worry about concurrency or deadlocks. The local instance was, in effect, the production instance. Based on the fact that each row update creates a new record, however, this means our 4 million row table was putting 40 million rows in tempdb, under the assumption that there is no periodic commit. So the question becomes: is it best to have fewer columns in the table and run separate stored procedures on separate tables to generate (for example) the SQL schema and the VB classes? Better to begintrans and commit around each group of closely related updates (i.e., updates on string fields followed by updates on numeric fields followed by updates on date fields)? Or, better to reduce rowcounts if possible? We could do that down to the IMS table level, say 250 database columns with 20 repeating subgroups.
Sometimes someone can look at stuff like this and know instantly how they would do it better. On the one hand, you may not be able to remark on that in the capacity of a Microsoft employee, on the other, the approach I described may be as efficient as any alternative.
Personally, I would probably do this in batches. It likely would improve performance and also would give me the chance to restart the operation mid-way. However, it is not a requirement in SQL Server to do so. A lof of other factors would play into my ultimate decision, including load on the source and target machines, how long the operation took on the hardware, etc. It sounds like batching would be very reasonable in this case given that the Oracle equivalent was batching.