Automating the world one-liner at a time…
How do I join two lists of objects into a single list? Or the equivalent: How do I join two CSV files into a single one?
Note that joining makes sense only when the two lists or CSV files have something in common.
If you know why you need join would not enjoy a review of basic database concepts, feel free to skip this section.
Say you own a company and want to keep track of the time when your employees arrive. A simplified table of employees could look like this:
Employee
Id
Name
1
John
2
Mark
3
Hanna
The id uniquely identifies an employee because two employees can have the same name. A full table would have more employees. It would also have last name, first name, the date the employee joined the company, and other columns.
The Entrance table that contains each employee's arrival time might look like this:
Entrance
Employee Id
When
6/12/2012 08:05:01 AM
6/13/2012 07:59:12 AM
6/14/2012 07:49:10 AM
6/12/2012 10:33:00 AM
6/13/2012 10:15:00 AM
44
2/29/2012 01:00:00 AM
The “When” column has both the date and the time. Types like DateTime are common for .Net Objects and databases.
We can conclude the following from the data:
Wait a minute, why do we have two tables instead of one?
A simple one-table solution might look like this:
Employee Entrance
Before throwing stones at this solution, I should say this could work for the simplest of cases. As we are going to see below, it doesn’t work if things get a bit more complex, so it is not future-proof. Now, for more specific stones:
In the table above, John’s 3 entrances and Mark’s 2 entrances are recorded, but I omitted the Entrance for id 44 that we used to have, because he doesn’t have a name. If we consider that we can have 2 Johns working for the company, we will need an ID, as seen below. We can also have more data for each employee, like their Last Name:
First Name
Last Name
Smith
West
This repetition of first name and last name is one of the reasons we have two tables. Having two tables allows us to have an Entrance Table, where we repeat just the Employee Id, a small amount of data. This way, the tables can evolve separately. If we start adding columns for Employees, like Address, Phone, etc., nothing changes in the Entrance table.
There is one more fundamental reason to have two tables. Say we want to start keeping track of employee payments. If the employees are in the same table as their entrance, how do add the payments? It clearly does not fit in the same table, because there would be multiple payments per employee and there are already multiple employee entries for their entrances.
The proper way to express the concept of one of Employee and multiple entrances is to have two tables. Then, it is going to be simple to add columns to the Employee table or add another table for their multiple payments. Another way to say this is that a one-to-many relationship, like one employee to many payments, or one employee to many entrances, needs a separate table.
It turns out that problems we can easily describe conceptually can often require many tables.
For example, let’s consider keeping track of sales at a fruit stand. That seems simple enough conceptually, but let’s see what is involved in this task. We have customers, fruit (our product), and sales. Each sale receipt has one or more lines, in which each line describes a product and its quantity. The tables we have are:
The information on a customer receipt would be gathered from all 4 tables and we would need to join this information from multiple tables.
If even simple conceptual problems require many tables, joining tables becomes a fundamental need, which brings us to the next section.
Let’s look at our entrance data again.
Now that the tables are separate and future-proof, we might want a view of the data that joins the name and entrance date/time for each employee. This can be done in 4 different ways:
1) Only the Employees with Entrances and Entrances with an Employee (SQL Inner Join)
This join shows all the employee entrances except for the Entrance for Employee 44 that has no corresponding Id. It also does not show a line for Hanna because there is no entry for her in the Entrance table.
An example of a SQL statement that would perform this join looks like this:
SELECT Employee.Name,Entrance.[When]
FROM EmployeeINNER JOIN Entrance ON Employee.id = Entrance.EmployeeId
Select enumerates the columns, which are Name from the Employee table, and When from the Entrance table. The square brackets surrounding “[When]” distinguish the “When” column name from the "When" SQL command. This is required only for SQL language keywords.
Employee is “inner joined” to Entrance by the statement: Employee.id = Entrance.EmployeeId.
Inner Join is the name SQL gives to this kind of join where an item appears in the result only if the join statement (Employee.id = Entrance.EmployeeId) is true in both tables being joined. Hanna has her ID of 3, but there is no 3 in the Entrance table to make this statement true. Employee 44 has his Entrance but there is no ID in the Employee table corresponding to 44 that would make the statement true.
2) All Employees (SQL Left Outer Join)
This join shows all employees, including Hanna who has no corresponding entry in the Entrance table.
The SQL Statement looks like:
SELECT Employee.Name, Entrance.[When]
FROM EmployeeLEFT OUTER JOIN Entrance ON Employee.id = Entrance.EmployeeId
The only change in this second SQL statement is the use of “LEFT OUTER JOIN” (instead of inner join).
Left refers to the fact that all elements of the left table in “Employee LEFT OUTER JOIN Entrance” (which is Employee) will appear in the result. “
OUTER is in opposition to the INNER join and it allows Hanna to be listed even though her Employee.id is not in any of the Entrance’s EmployeeIds.
Note: Some SQL variants will not use “OUTER”.
3) All Entrances (SQL Right Outer Join)
This join has all of the Entrances, including the entry for Employee 44 even though there is no corresponding employee Id for that in the Employee’s table.
FROM EmployeeRIGHT OUTER JOIN Entrance ON Employee.id = Entrance.EmployeeId
The only change to the previous join SQL statement is the use of “RIGHT” instead of “LEFT”.
Right refers to the fact that all elements of the left table in “Employee RIGHT OUTER JOIN Entrance” (which is Entrance) will appear in the result. “
4) All Entrances and Employees (SQL Full Join)
The join below shows all employees and all entrances, including Employees with no corresponding entrances (Hanna) or entrances with no corresponding employee (the entrance for Employee 44).
FROM EmployeeFULL OUTER JOIN Entrance ON Employee.id = Entrance.EmployeeId
The only change to the previous join the use of “FULL” instead of “RIGHT” or “LEFT”.
Full refers to the fact that all elements of both join sides (Employee and Entrance specified in “Employee FULL OUTER JOIN Entrance”) will appear in the result.
There is no built-in cmdlet for joining in Windows PowerShell, so the remainder of this post will be about building a cmdlet called Join-Object that performs the four join operations in lists of objects in memory. Those objects can come from different places, including csv files.
For example, lets create a couple of CSV files that contain the sample data we have been using:
@"Id,Name1,John2,Mark3,Hanna"@ > c:\temp\employee.csv
@"EmployeeId,When1,6/12/2012 08:05:01 AM1,6/13/2012 07:59:12 AM1,6/14/2012 07:49:10 AM2,6/12/2012 10:33:00 AM2,6/13/2012 10:15:00 AM44,2/29/2012 01:00:00 AM"@ > c:\temp\entrance.csv
$employee = Import-Csv c:\temp\employee.csv$entrance = Import-Csv c:\temp\entrance.csv
After running these commands, we have:
PS C:\temp> $employee
Id Name -- ---- 1 John 2 Mark 3 Hanna
And:
PS C:\temp> $entrance
EmployeeId When ---------- ---- 1 6/12/2012 08:05:01 AM1 6/13/2012 07:59:12 AM1 6/14/2012 07:49:10 AM2 6/12/2012 10:33:00 AM2 6/13/2012 10:15:00 AM44 2/29/2012 01:00:00 AM
I chose CSV because you can find data to be joined in CSV files and also because it’s a compact way to show the sample data.
Below are 4 calls to the Join-Object that produce the equivalent of the 4 types of join.
PS C:\temp> Join-Object -Left $employee -Right $entrance -Where {$args[0].Id -eq $args[1].EmployeeId} –LeftProperties "Name" –RightProperties "When" -Type OnlyIfInBoth
Name When ---- ---- John 6/12/2012 08:05:01 AM John 6/13/2012 07:59:12 AM John 6/14/2012 07:49:10 AM Mark 6/12/2012 10:33:00 AM Mark 6/13/2012 10:15:00 AM
Join-Object’s first two parameters are the left and right lists of objects.
The “Where” parameter is the join condition, and it uses $args[0] to represent an item from the left list and $args[1] for an item in the right list.
The RightProperties and LeftProperties parameters are the names of the properties to appear in the output from the left list and right list.
The Type parameter is one of four names: AllInLeft, AllInRight, OnlyIfInBoth and AllInBoth. I thought those four names were simpler to relate to their functionality than SQL’s full, inner and outer names, but here is a mapped correspondence:
Join-Object
SQL
AllInLeft
Left Outer
AllIInRight
Right Outer
OnlyIfInBoth
Inner
AllInBoth
Full Outer
Here is an example of AllInLeft:
PS C:\temp> Join-Object -Left $employee -Right $entrance -Where {$args[0].Id -eq $args[1].EmployeeId} -LeftProperties "Name" -RightProperties "When" -Type AllInLeft
Name When ---- ---- John 6/12/2012 08:05:01 AM John 6/13/2012 07:59:12 AM John 6/14/2012 07:49:10 AM Mark 6/12/2012 10:33:00 AM Mark 6/13/2012 10:15:00 AM Hanna
Here is the output of AllInRight:
PS C:\temp> Join-Object -Left $employee -Right $entrance -Where {$args[0].Id -eq $args[1].EmployeeId} -LeftProperties "Name" -RightProperties "When" -Type AllInRight
Name When ---- ---- John 6/12/2012 08:05:01 AM John 6/13/2012 07:59:12 AM John 6/14/2012 07:49:10 AM Mark 6/12/2012 10:33:00 AM Mark 6/13/2012 10:15:00 AM 2/29/2012 01:00:00 AM
And finally, the output of AllInBoth:
PS C:\temp> Join-Object –Left $employee –Right $entrance –Where {$args[0].Id -eq $args[1].EmployeeId} –LeftProperties "Name" –RightProperties "When" -Type AllInBoth
Name When ---- ---- John 6/12/2012 08:05:01 AM John 6/13/2012 07:59:12 AM John 6/14/2012 07:49:10 AM Mark 6/12/2012 10:33:00 AM Mark 6/13/2012 10:15:00 AM Hanna 2/29/2012 01:00:00 AM
Join is an essential tool for data centric problems because, usually, multiple tables are necessary.
Data centric problems are normally handled by database tools, but if you happen to have CSV files with this kind of data, Join-Object might be useful.
Lucio Silveira [MSFT]
function AddItemProperties($item, $properties, $output){ if($item -ne $null) { foreach($property in $properties) { $propertyHash =$property -as [hashtable] if($propertyHash -ne $null) { $hashName=$propertyHash["name"] -as [string] if($hashName -eq $null) { throw "there should be a string Name" } $expression=$propertyHash["expression"] -as [scriptblock] if($expression -eq $null) { throw "there should be a ScriptBlock Expression" } $_=$item $expressionValue=& $expression $output | add-member -MemberType "NoteProperty" -Name $hashName -Value $expressionValue } else { # .psobject.Properties allows you to list the properties of any object, also known as "reflection" foreach($itemProperty in $item.psobject.Properties) { if ($itemProperty.Name -like $property) { $output | add-member -MemberType "NoteProperty" -Name $itemProperty.Name -Value $itemProperty.Value } } } } }}
function WriteJoinObjectOutput($leftItem, $rightItem, $leftProperties, $rightProperties, $Type){ $output = new-object psobject
if($Type -eq "AllInRight") { # This mix of rightItem with LeftProperties and vice versa is due to # the switch of Left and Right arguments for AllInRight AddItemProperties $rightItem $leftProperties $output AddItemProperties $leftItem $rightProperties $output } else { AddItemProperties $leftItem $leftProperties $output AddItemProperties $rightItem $rightProperties $output } $output}
<#.Synopsis Joins two lists of objects.DESCRIPTION Joins two lists of objects.EXAMPLE Join-Object $a $b "Id" ("Name","Salary")#>function Join-Object{ [CmdletBinding()] [OutputType([int])] Param ( # List to join with $Right [Parameter(Mandatory=$true, Position=0)] [object[]] $Left,
# List to join with $Left [Parameter(Mandatory=$true, Position=1)] [object[]] $Right,
# Condition in which an item in the left matches an item in the right # typically something like: {$args[0].Id -eq $args[1].Id} [Parameter(Mandatory=$true, Position=2)] [scriptblock] $Where,
# Properties from $Left we want in the output. # Each property can: # - Be a plain property name like "Name" # - Contain wildcards like "*" # - Be a hashtable like @{Name="Product Name";Expression={$_.Name}}. Name is the output property name # and Expression is the property value. The same syntax is available in select-object and it is # important for join-object because joined lists could have a property with the same name [Parameter(Mandatory=$true, Position=3)] [object[]] $LeftProperties,
# Properties from $Right we want in the output. # Like LeftProperties, each can be a plain name, wildcard or hashtable. See the LeftProperties comments. [Parameter(Mandatory=$true, Position=4)] [object[]] $RightProperties,
# Type of join. # AllInLeft will have all elements from Left at least once in the output, and might appear more than once # if the where clause is true for more than one element in right, Left elements with matches in Right are # preceded by elements with no matches. This is equivalent to an outer left join (or simply left join) # SQL statement. # AllInRight is similar to AllInLeft. # OnlyIfInBoth will cause all elements from Left to be placed in the output, only if there is at least one # match in Right. This is equivalent to a SQL inner join (or simply join) statement. # AllInBoth will have all entries in right and left in the output. Specifically, it will have all entries # in right with at least one match in left, followed by all entries in Right with no matches in left, # followed by all entries in Left with no matches in Right.This is equivallent to a SQL full join. [Parameter(Mandatory=$false, Position=5)] [ValidateSet("AllInLeft","OnlyIfInBoth","AllInBoth", "AllInRight")] [string] $Type="OnlyIfInBoth" )
Begin { # a list of the matches in right for each object in left $leftMatchesInRight = new-object System.Collections.ArrayList
# the count for all matches $rightMatchesCount = New-Object "object[]" $Right.Count
for($i=0;$i -lt $Right.Count;$i++) { $rightMatchesCount[$i]=0 } }
Process { if($Type -eq "AllInRight") { # for AllInRight we just switch Left and Right $aux = $Left $Left = $Right $Right = $aux }
# go over items in $Left and produce the list of matches foreach($leftItem in $Left) { $leftItemMatchesInRight = new-object System.Collections.ArrayList $null = $leftMatchesInRight.Add($leftItemMatchesInRight)
for($i=0; $i -lt $right.Count;$i++) { $rightItem=$right[$i]
if($Type -eq "AllInRight") { # For AllInRight, we want $args[0] to refer to the left and $args[1] to refer to right, # but since we switched left and right, we have to switch the where arguments $whereLeft = $rightItem $whereRight = $leftItem } else { $whereLeft = $leftItem $whereRight = $rightItem }
if(Invoke-Command -ScriptBlock $where -ArgumentList $whereLeft,$whereRight) { $null = $leftItemMatchesInRight.Add($rightItem) $rightMatchesCount[$i]++ } } }
# go over the list of matches and produce output for($i=0; $i -lt $left.Count;$i++) { $leftItemMatchesInRight=$leftMatchesInRight[$i] $leftItem=$left[$i] if($leftItemMatchesInRight.Count -eq 0) { if($Type -ne "OnlyIfInBoth") { WriteJoinObjectOutput $leftItem $null $LeftProperties $RightProperties $Type }
continue }
foreach($leftItemMatchInRight in $leftItemMatchesInRight) { WriteJoinObjectOutput $leftItem $leftItemMatchInRight $LeftProperties $RightProperties $Type } } }
End { #produce final output for members of right with no matches for the AllInBoth option if($Type -eq "AllInBoth") { for($i=0; $i -lt $right.Count;$i++) { $rightMatchCount=$rightMatchesCount[$i] if($rightMatchCount -eq 0) { $rightItem=$Right[$i] WriteJoinObjectOutput $null $rightItem $LeftProperties $RightProperties $Type } } } }}
Great function! I added it to my default modules list. Do you think will get included with the default PS 3.0 package?
updated the script part of this blog today (7/26/2012) to support the hashtable syntax for the Property parameter of Join-Object. Please see the property documentation for details.
Regarding Jeremy's question: Join-Object will not be in the default PS 3.0 package.
Thanks for putting this out here. This is really a needed function for PowerShell and it is to bad it wasn't included in V3. There is one change I would suggest with it though. I seems to make sense for the most part except for the Where parameter. When you call the function you have a Left and Right parameter, but in the Where clause you reference those with $args[0] and $args[1].
So a where clause looks like:
-Where { $args[0].Id -eq $args[1].EmployeeId }
It would be nicer if it were to use $Left and $Right. Then it would be easier to understand. It shouldn't be too hard to do if you just recreate the Where scriptblock inside your function with something like the following.
$NewWhere = $ExecutionContext.InvokeCommand.NewScriptBlock( 'param ( $Left, $Right)' + $Where.ToString())
Then the Where parameter would be used like:
-Where { $Left.Id -eq $Right.EmployeeId }
This would be more intuitive than it currently is, although I could see some benifit if the parameters were $LeftItem and $RightItem, but I prefer it to have the same name as the parameters used in the function even if there is some room for confusion ( one is a collection and one is a single item).
Excellent stuff. This is part of my workshop content now...
Thanks,
Raimund
Mark!
This section is very useful to me.