|
|
-
For the past few months, a few of us have had the pleasure of working with Ordinal to create an extremely performant sort component.
Chris, from Ordinal, sent an email earlier today announcing that they're done with a production ready version.
This is really great news, especially if you know the history of Ordinal and their reputation. Chris has been one of the most solid developers I've worked with either inside or outside Microsoft.
Anyways, check out http://www.ordinal.com/ssis.html if you're interested.
Enjoy,
ash
|
-
Hi again,
Over the last few months a lot of you have expressed a need to do impact analysis and lineage investigations across the different BI components we've got.... if I change this table here what packages, cubes, reports will be affected.
Well, earlier this week the BI team released a metadata pack that you can use for this. It includes sample projects, sample reports, etc. Right now it handles getting metadata from SSIS and SSAS, we hope to get to SSRS in the future.
We decided to release the source code for it as well so that:
- you have a piece of code that demonstrates using the object models for SSIS and SSAS, and
- you can extend the work we've done for integrating other technologies in the metadata repository as well.
While the tool is useful as it is, we're hoping there'll be a community effort to drive this forward as well... I can easily imagine a workspace on gotdotnet where active participation will make this tool more comprehensive than we could ever manage to do.
Anyways, the link is http://www.microsoft.com/downloads/details.aspx?FamilyId=11DAA4D1-196D-4F2A-B18F-891579C364F4&displaylang=en
There's also a whitepaper here: http://www.microsoft.com/downloads/details.aspx?FamilyId=182BD330-0189-450C-A2FE-DF5C132D9DA9&displaylang=en
As usual, I look forward to your feedback.
enjoy
ash
|
-
-
As you know, we just declared SQL RTM earlier today.
For those familiar with the changes in this release, my goodness. This has been a huge release. I've had a kid along the way, some have had two. Shipping software of this magnitude is probably one of the most gratifying experiences, and I had to go around shake the hand of everyone that I met today and congratulate them. 5 years is a long time.
If you're one of our regular contributors and haven't received a thanks from me directly, apologies. Wish I could shake your hand.
Now, on to the cigar.
|
-
Ever want to do a lookup on a text file without putting it in a database or using a merge join?
Grant from the Project REAL team found a great nugget:
Openrowset now allows the BULK clause to be specified which accepts a filename and format specification. An OLE-DB driver for flat-files or XML would also do the job but this is a nice work-around for the native stack. The only difficulty is getting your head around the format for the .fmt file but after that it 'just works':
select * from openrowset(bulk N'c:\temp\test.txt', formatfile = 'c:\temp\test.fmt') as x
thanks, Grant!
|
-
Earlier today Fred S. asked a great question.
"Using the following data I’m trying to get each node Element populated into their own variables. XML: <DataBases> <DataBase> <Name>Sales</Name> <Action>DBCC</Action> <FLOP>1</FLOP> </DataBase> <DataBase> <Name>Inventory</Name> <Action>DBCC</Action> <FLOP>1</FLOP> </DataBase> </DataBases>
Want to do: Name -> vsDataBase Action -> vsAction FLOP -> vnFLOP Where vsDataBase, vsAction and vnFlop are variables of type string, string, int32. "
He was using For each loop's ForEach NodeList enumerator, lets see how we can do this.
There might be other ways to do it, I wanted to do it using two steps... first get a database, then gets its element values.
The result of the first step would be the nodelist, so the EnumerationType should be ElementCollection. Then there were the Outer* and Inner* properties that I confess I didn't understand.
Here's how this enumerator currently works:
- The OuterXPathString will be used to get the list of elements collection from the XML document specified earlier. In this case, our XPath string is '/DataBases/*' which would return all DataBase elements.
- For each element in the collection we now have (i.e. for each DataBase element), the InnerXPath is applied. We can set this property to be '*' (or the equivalent 'child::*') and InnerElementType of 'NodeText' so that we get the value of the child elements. Since there're multiple children per DataBase, the result of the this XPath will be an array of strings (each string is the value of the element).
For the first time through the loop, then, the current value of the enumerator is an array of three strings: {"Sales", "DBCC", "1"}. The next time through the loop the current value will be an array of three strings again: {"Inventory", "DBCC", "1"}.
Per the discussion above the current value of the enumerator during an enumeration should be an array of strings, but I wanted to see the array to make sure that was right. Turns out I had to use an undocumented feature to do this. (As soon as I post this message I'll be sending a note to the doc folks to make sure this isn't undocumented for too long because it's really useful!).
If you map the index '-1' to a variable of type 'Object', that variable gets populated with the complete collection (which is current value of the enumerator). Then I used a script task in the loop with simple code like so:
Imports System.Collections . . .
Dim o As IEnumerator Dim o2 As Object o = CType(Dts.Variables("Variable").Value, IEnumerator) o.Reset() o.MoveNext() o2 = o.Current and put a breakpoint at the last line.
When you execute the package now, the breakpoint is hit and the first time through you can see that o2 is the string "Sales". Just step through the iterator to confirm the array contents are what you expect them to be.
Knowing the contents of the array, then, it's fairly straight forward to do the mapping. Create the variables, set their types to string, then use the indices like this:
User::vsDataBase 0 User::vsAction 1 User::vsFLOP 2
Note that vsFLOP is not vnFLOP... the text is string, so you'll have to convert it to a number.
Also note that this example assumes a well formed document. Specifically that each DataBase element will have three children in the right order.
Think that's about it.
Is there a better way to do this in a loop? Let me know!
Thanks Fred for a great question.
ash
|
-
This has come up a few times recently, and I couldn't find the information for them in the usual sources.
The desire that some folks have is to support parameters in their SQL statements. Sounds reasonable. The OLEDB source, for example, supports parameters. However, some providers don't support derivation of parameters from a statement. The question is how to make parameterized queries work for those providers.
For the OLEDB Source, the trick is to use the 'SQL Command from variable' access mode. What that means is that the sql command used during runtime will come from the contents of a variable. As you may know, in addition to just storing values, Variables can also store expressions which are evaluated to return the variable's value.
Lets work through an example. Lets say you want to execute this statement:
SELECT Income from Customer where Age > ?
But your provider doesn't support finding out there's one parameter in this statement.
What we can do here is to create two variables. Below we have the properties you'd set for the two variables displayed in a '<prop>:<value>' format.
#1:
Name: AgeVariable
Type: String
Value: 10
#2:
Name: SqlStatementVariable
Type: String
Expression: "SELECT Income from Customer where Age > " + @AgeVariable
EvaluateAsExpression: True
In this case I've hard coded the variable value to 10, but of course you can use whatever mechanism you want to (result of an Execute SQL task, the current value of a loop enumerator, etc.) to set that variable value.
Now we can use the 'SqlStatementVariable' as the source for the OLEDB Source.
How about on the destination side? What if I'm inserting into the customer table and want to set Age to 10, lets say, for all the rows.
The trick there is to introduce the variables as columns inside your data flow using something like a Derived Column transform and then map this column to the colunms of your table.
Hope this helps. If not, let me know and I can edit this post to make it more useful.
regards,
ash
|
-
We've been slowly creating a list of best practices. Simon asked me about these at the PDC a couple of weeks back. It's time to share them out not just to make folks aware of them but also to solicit other goodness you'd like us to include in the official list.
First off, Performance:
- [Oledb Source, Lookup, Fuzzy Lookup]: Remove columns which’re not used downstream.
- [Oledb Source, Lookup, Fuzzy Lookup]: Use a select statement when fetching data from a view instead of using the table drop down.
- [OLEDB Command Destination]: for large number of rows, this component might not scale because it sends one SQL statement per row. Persist data to a temporary table and use set based SQL.
- [SCD]: for large number of rows that will not exist in the dimension table, consider using a lookup before the SCD.
- [OLE DB Destination]: if the server is local, consider using SQL Server destination.
-
[Flat file source]: turn on fast parse for columns that have types that fast parse understands. Look at BOL for more info on Fast Parse.
- [Conditional Split]: For transforms that use conditions based on columns coming straight from OLEDB or ADO.Net sources, consider using the native filtering from the relational database to remove rows before they come in to the pipeline.
- [Flat file Source]: For transforms that have all columns with the default column type and size, consider using the ‘Suggest Type’ functionality.
- [OLEDB Destination]: Review the fast load settings on the destination adapter.
- [SQL Server]: for OLEDB Destinations/SQL Server Destinations that perform bulk insert into a database, verify that the logging mode is appropriate for performance. Refer to BOL for more info.
|
-
One of our ISVs ran into an issue that others writing managed code interoping with COM (nothing to do with SSIS directly) might run into. At some unpredictable time, he got an error message with the title of this blog entry while working in the designer. I couldn't find good information on the net on this topic, and have plagiarized something Michael, our tech lead, had written up. Any errors in the information below are solely mine:
The cause of the problem is that since the pipeline is free threaded, and most of the UI executes on the main VS thread, the call is marshaled to the free-threaded apartment using COM. While the main thread is waiting for such a call to complete, it can receive and handle other messages. That might result in reentrancy.
To avoid this, VS implements a message filter – a message filter that keeps most messages in a queue and prevent them from handling until the COM call is completed.
However, certain events (like WM_PAINT… see http://support.microsoft.com/default.aspx?scid=kb;en-us;176399) are allowed to go through to allow repaints. A restriction of this code is that it should not make another COM call to a free threaded COM component.
The simplest way to avoid this is to avoid code that calls free-threaded objects in the drawing code. Ideally, the UI should cache any data it needs for drawing and use it. Another method is to avoid the COM marshalling and invoke the COM method from a non-UI (usually thread pool) thread, doing the marshalling yourself.
If you see this consistently (we've made quite a bit of an effort to avoid this), do drop me a line.
regards
|
-
Over the last few days I've learnt a lot in how to debug SQL hangs thanks to Sunil from the SQL Server engine team.
For one of the customers last week, the problem seemed to be that their DTS2000 package worked great with SQL2000. However, upgrading the relational engine to SQL2005 and making the DTS2000 packages go against that version caused hangs during some of the dimension loads. The dimension load, to be clear, was doing an outer join between the source and destination table, and pushing to the destination table.
Turns out the problem was the table lock setting that's turned on by default in DTS2000. In SQL2000, if you were doing a bulk insert to a table with a clustered index and had asked for a table lock, the engine would silently ignore. Now (in SQL2005), we don't refuse, so the SELECT and X update caused the package to hang. The fix was to remove table lock.
Earlier today we got to the bottom of another hang with SSIS going against a SQL2005 database. This time, it turns out, one of the destinations that was getting tablock on a table with a clustered index and inserting about 10K rows. The locks escalated to a table lock which prevented other writers. Not having used SQL Server trace flags myself as yet, it was great to see the problem go away by turning on trace flags 1211 and 1224, narrowing the problem down to lock escalation. There were several fixes that Sunil mentioned, one was to remove the clustered index, another was to limit the batch size to be around 4500, under the ~5000 that triggers the lock escalation.
The DMVs in SQL 2005 were invaluable in this investigation, specifically sys.dm_os_waiting_tasks.
While all of this is in BOL, and probably trivial to most folks familiar with SQL, the logic and debugging abilities of locking in SQL2005 are truly impressive to me. Of course, it doesn't hurt at all if someone knowledgeable is making sense of all this!
If you're running into such unexplicable hangs, undestand the output of sys.dm_os_waiting_tasks (and sp_who2, of course); check if you have table locks and clustered indices (remember that turning on the primary key in the table designer might create a clustered index on that column for you); and if you have readers or multiple writers against the same table. If you run into an interesting case, drop me a line!
regards
|
-
It's typical in data warehouse loads to have lookups that go against dimension tables. If the key doesn't exist, either a new one is created, or a default value is used.
There're two ways to do this:
1. Lookup is configured to redirect rows that have no match in the reference table to a separate output (error output), then use a derived column to specify a default value, and finally merge both the lookup success output and the output of the derived column using a union all transform.
2. Lookup is configured to ignore lookup failures and pass the row out with null values for reference data. A derived column downstream of the lookup then checks for null reference data using 'ISNULL' and replaces the value with a default value.
As you may know Union All is an async transform due to implementation issues. This means its going to create a new type of buffer on the output and copy the input buffer data to the output.
Intuitively, #2 has been the way to go because it doesn't incur the cost of a memcopy, but I didn't get a chance to quantify the difference until now. Today, I was finally able to.
The test harness includes a script component that feeds 100M rows to the lookup, and the output of the union all in option #1 and derived column in #2 gets consumed by a script component that doesn't do anything with the rows. The AMD guys have given us a 64bit dual core machine with a couple of procs to play with and that's what I ran this on. There's enough (32GB) of memory on the box so that we can focus on just the perf of this isolated scenario. Since I was only interested in this specific scenario, the lookup statement itself is 'select * from whatevertable where 1=0' which means all keys fail lookups.
Average timings were:
Option #1: 104 seconds.
Option #2: 83 seconds.
For folks that have 12-13 dimension lookups, using approach #2 might significantly aid performance.
The other interesting fact here is that if there're lots of lookups, the execution tree that contains the lookups will be busy. Since there's one thread (currently) per execution tree, that thread will be doing a lot of work. Therefore, there will be a point where having X lookups on one thread would be slower than having X/2 lookups on it and a union all to introduce a new execution tree. I didn't get a chance to do that investigation as yet, hope to in the future.
regards,
ash
|
-
It's finally that part of the release when you see the end of the tunnel and there seems a good reason (or close to it) that your hair has lost more color this cycle.... we're close to shipping!
With that, of course, the marketing machine churns faster. Some results (IMO) are less than stellar: the Dinosaur ads for Office, for example. I don't associate with them and am afraid we're using motivation-by-shame to force Office upgrades. The focus studies that were done seem to have participants that had more humorous connotations and we haven't heard of a groundswell of angry faithful customers, so it must not be too bad. Oh well. Not my kind of humor.
I am, however, a fan of dubbed over movies. If they're from the far east and have folks flying in air or bouncing off of water, all the better. In fact I missed a party at WWPC earlier this month because there was such a movie playing in my hotel room and I couldn't resist. MXC and MST3K have been my favorite shows in large part due to voice overs.
In the MST tradition comes the latest installment of VS/SQL 2005 ads... if you haven't seen them already: http://www.escapeyesterworld.com.
Not bad!
Weren't we supposed to have flying cars and colonies on Mars by now? Oh well, looking forward to having my shipping cigar in the meantime!
|
-
At PDC this year, Sergei and I will be presenting a session on how to extend Integration Services. This is a 400 level course so will go into more details than the 300 level webcast I did earlier this week for MSDN.
Sergei has been the developer who gave us a lot of the extensibility that I talked about... the script task, script transform, managed interfaces for writing components, and of course given the fact that he holds the title of the Technical Lead of the Runtime, that's his baby too.
So, I'll probably end up being the T-shirt thrower for this talk and try to get out of Sergei's way.
It's been a while since I went to PDC, should be fun! I'm told Oleg from Ivolva will be there too. Is it a coincidence that both he and Sergei have the same last name? Perhaps, but I better brush up on my Russian!
The PDC sessions list is finally up: http://commnet.microsoftpdc.com/content/sessions.aspx
Sergei and I will also plan to stick around little bit longer if you want to have a deep dive on parts of the product. Drop me a line if you'll be at the conference, it'd be great to associate names with faces.
later,
ash
|
-
At 1PM PST today, I'll be doing an MSDN Webcast on how to extend Integration Services including how to write new tasks and create new components. Why would you want to do that? Well, the simplest example would be if you have binary files that only you know how to read or have other existing business logic, you can integrate that with the product fairly easily.
In our presentations, we've been talking about the fact that Integration Services is a change in mindset from DTS in that while DTS was a useful tool, SSIS is really a platform for enterprise level data integration. This presentation is where the rubber hits the road and we talk about how do you plug into the platform yourself.
The webcast will be recorded for posterity.
http://msevents.microsoft.com/cui/WebCastEventDetails.aspx?EventID=1032276860&EventCategory=4&culture=en-US&CountryCode=US
regards,
ash
|
-
Hi there,
I'm a member of the SQL Server Integration Services team and previously had a blog here: sqljunkies.
Last week I ran into issues posting there and thought I'd try out this site instead. Since I work to integrate technologies, getting blogs over from sqljunkies over to this site is a great problem to have. Of course since I'm lazy I want to use an existing tool to do that but haven't been able to find anything as yet.
The first part of moving data is to get access to it. Since sqljunkies uses CommunityServer as does this site, and CommunityServer supports MetaWebLog so I should be all set, right? Sorry, no cigar. While MSDN has MetaWebLog API enabled, I haven't been able to find the URL for the sqljunkies service so here we are twiddling our thumbs.
The other way to get data out would be to use RSS or Atom but that only goes back 4 months for sqljunkies. I want to get all the content!
Hmm.
That sent me down the path of googling my options and I ran into several blogging tools and technologies.
Man, I've been out of it! w.bloggar seemed pretty good, and that's what I used for posting the first version of this blog. I was naive and didn't realize the editor was HTML, not wysiwyg. oh well.
Well, no answer to the migration as yet. If you have any wisdom to share in this area, let me know. I would like to stay away from scraping HTML contents to get the content out.
Regards,
ash
|
|
|
|