Sunday, December 16, 2007

Liking LINQ: The Learning Curve

This is the second post in a series on Language Integrated Query.  I'm pushing LINQ's buttons and bumping into some of its boundaries. 

Every abstraction leaks at some point, and LINQ to SQL is no exception.  Consider the following code:

NorthwindDataContext d = new NorthwindDataContext(); 
int? quantityThreshold = null;
var sales = from p in d.Products
join od in d.Order_Details on p.ProductID equals od.ProductID
where !p.Discontinued && (quantityThreshold.HasValue ? od.Quantity >= quantityThreshold.Value : true)
select p;



So, when you begin fetching data out of "sales" you'll see the problem.  A run-time error is thrown because the expression tree visitor attempts to greedily evaluate quantityThreshold.Value.  Let's try to move the evaluation out of the LINQ expression.



Predicate<Order_Detail> hasSufficientQuantity = o => quantityThreshold.HasValue ? o.Quantity >= quantityThreshold : true;
var sales = from p in d.Products
join od in d.Order_Details on p.ProductID equals od.ProductID
where !p.Discontinued && hasSufficientQuantity.Invoke(od)
select p;


Well, that doesn't work either.  "The method or operation is not implemented." The expression tree visitor has no idea what this hasSufficientQuantity method is... changing it to hasSufficientQuantity.Invoke(od) reveals that we are barking up the wrong tree, no pun intended.  The error given then is that our Predicate function cannot be translated.  Okay... let's look at why.



This fun LINQ expression syntax in C# is just syntactic sugar for a bunch of extension methods with signatures so jam-packed with Generics, you'd think it was Wal-Mart.  So, we are grateful to our C# language team for the sugar.  But, it does tend to hide what is really going on, making it difficult to figure out why the syntax seems so finicky.  Our LINQ expression above would translate into imperative code similar to the following:



var sales = d.Products.Join(d.Order_Details, p => p.ProductID, o => o.ProductID, (p, o) => new { Product = p, Order_Detail = o }).Where(p => !p.Product.Discontinued && hasSufficientQuantity.Invoke(p.Order_Detail)).Select(p => p.Product);



This isn't exactly pretty, and it doesn't really help us to understand why our function can't be translate, or does it?  Consider what these function calls are doing.  They are taking arguments, primarily Func<...> objects, and storing them internal in an expression tree.  We know from stepping through the code that the execution of our supplied Func<...> objects (the lambda expressions above) is deferred until we start accessing values from "sales".  So, there must be some internal storage of our intent.  Further, the code above must be translated to SQL by the System.Data.Linq libraries, and we can gather from they call stack on our exception that they are using the Visitor pattern to translate the nodes of the expression tree into SQL statements.



What happens when they visit the node that calls invokes the hasSufficientQuantity Predicate?  Well, that code--the Preciate object instance itself--is not available in SQL, so the translation fails.  This seems obvious, but consider that if we were using LINQ to Objects here, any of these approaches would work fine, as the predicate would be available in the execution environment of the translated expression tree, where it wasn't for SQL.



This is a contrived example, of course, and we could "code around" this in any number of ways, e.g.



where !p.Discontinued && od.Quantity >= (quantityThreshold ?? 0)


However, we are still seeing the LINQ to SQL abstraction leak pretty severely.



There are some gotchas out there as well, of course.  Consider the following SQL statement that answers the question, "How many orders have my customers had for each of my products?"



SELECT o.CustomerID, od.ProductID, COUNT(*) as [Number of Orders] 
FROM dbo.Orders o JOIN dbo.[Order Details] od
ON o.OrderID = od.OrderID
GROUP BY od.ProductID, o.CustomerID


How might we attempt to answer the same question with LINQ to SQL?  Notice that we are specifying two columns to group by in our query.  Here's what we might like to write in LINQ:



NorthwindDataContext d = new NorthwindDataContext();
var results = from o in d.Orders
join od in d.Order_Details on o.OrderID equals od.OrderID
group by o.CustomerID, od.ProductID into g
select new {g.CustomerID, g.ProductID, g.Count()};


Of course, this doesn't even come close to compiling.  Here's the right way to do use multiple columns in a groupby: use a tuple!



var results = from od in d.Order_Details
group od by new {od.Order.CustomerID, od.ProductID} into orders
select new { orders.Key.CustomerID, orders.Key.ProductID, NumberOfOrders = orders.Count() };



Once you start getting the gestalt of LINQ, you'll find yourself creating tuples all over the place.  Consider this query expression to retrieve the total sales of each product in each territory:



var territorySales = from p in d.Products
join od in d.Order_Details on p.ProductID equals od.ProductID
join o in d.Orders on od.OrderID equals o.OrderID
join e in d.Employees on o.EmployeeID equals e.EmployeeID
join et in d.EmployeeTerritories on e.EmployeeID equals et.EmployeeID
join t in d.Territories on et.TerritoryID equals t.TerritoryID
where !p.Discontinued
group new { od.ProductID, p.ProductName, t.TerritoryID, t.TerritoryDescription, od.Quantity }
by new { od.ProductID, t.TerritoryID, p.ProductName, t.TerritoryDescription } into sales
orderby sales.Key.TerritoryDescription descending, sales.Key.ProductName descending
select new
{ Product = sales.Key.ProductName.Trim(), Territory = sales.Key.TerritoryDescription.Trim(), TotalSold = sales.Sum(s => s.Quantity) };



The interesting part of that expression is that I created a tuple in my group...by to "select" the data to pass on to the next expression.



What if what we really wanted were the top ten best-selling products in each territory?  Well, there's no "top" LINQ query expression keyword.  The standard query operators include a couple of methods that look interesting: Take(int) and TakeWhile(predicate).  Unfortunately, TakeWhile is among the standard query operators that is not supported in LINQ to SQL.  Why?  Well, it's because you couldn't write equivalent SQL, I imagine.  And, while Take(int) is supported, its not immediately useful in a situation like this where you want to apply it to subsets of your results.  Therefore, a more procedural result seems warranted.  I'll investigate this further in my next post on the topic.



It is interesting to note the situation that arises with certain standard LINQ query operators not being supported by various flavors of LINQ.  Because the standard query operators are implemented using extension methods, every LINQ provider must handle them all, including those they cannot support.  This means throwing the NotSupportedException from the implementation of those methods.  The System.Linq.Queryable static class is where the standard query operators are implemented, defining the operators on IQueryable<T>.  LINQ to SQL classes like Table implement this interface, as do all class that participate in LINQ expressions.



Despite using the same syntax, the LINQ providers will each have their own significant learning curve due to variations in the operators they support and their own quirks. Next time we'll try to implement a top(x) query in LINQ to SQL.