Showing posts with label sql. Show all posts
Showing posts with label sql. Show all posts

Tuesday, December 15, 2009

Premature Optimization and LINQ to SQL

Don’t count your chickens before they hatch.  Let’s call that adage the “count-no” principle.  I promise that will be funny later.

The common interpretation of this adage is that you should not rely on future results when making decisions in the present.  It is generally an admonishment to not be too cocky about the future.  What might the corollary be?  Perhaps to concern ourselves with the details of the present moment rather than thinking about the future.  We’ll call this corollary “nocount”.  One specific example of this corollary is the programmer’s maxim about premature optimization.  In fact, it was Donald Knuth who said, “"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil,” in a response to Dijkstra’s “Go to statement considered harmful”.

So, I’m debugging a particularly pernicious problem today.  In production (and only production) we were getting a ChangeConflictException when updating a row in the database.  Unfortunately, the logs indicated that there were no MemberChangeConflict objects.  In other words we weren’t violating optimistic concurrency.  So, I profiled the SQL statements and it appeared to be just what I expected, a simple update supporting optimistic concurrency.

Did you know you can eek out an exceedingly miniscule amount of performance by using the “SET NOCOUNT ON” option with SQL Server?  Did you know that you can actually set this option at the server level? Do you know what happens when your LINQ to SQL DataContext uses the row count returned from your optimistically concurrent update statement to determine if one and only one row was updated?

Yes dear reader, premature optimization can cause the most insidious and difficult to discover problems.  Please, don’t ever violate the “nocount princple”!

Friday, December 21, 2007

SQL Optimization: Substring Joins

The DBA at my current client has some mad T-SQL skills.  He took a naïve join implementation I had written and improved its performance by three orders of magnitude.

My initial query looked something like this:

SELECT A.*, B.* FROM A JOIN B ON A.Message LIKE '%' + B.Identifier + '%'

Obviously this isn't the best situation in the first place. We don't have a clean relation between these two tables, instead we've got to pick out an identifier (GUID/uniqueidentifier) from a larger text field. Notwithstanding some obvious ETL or trigger possibilities, the statement above seems the simplest solution. The only problem is it is slooow.

The DBA had a great optimization which was essentially to tease out the identifier from the field as you would if you were writing a trigger to create a join column. Putting this in your query allows you to join on this derived column in your code. The performance implications are well above what I would have guessed them to be, operating–as I am–from naïveté. Here's the generalized solution based on our example above:

SELECT B.*, J.* FROM B JOIN
(
  SELECT A2.*,
  CAST(SUBSTRING(A2.Message, A2.GuidStart, A2.GuidEnd - A2.GuidStart) AS uniqueidentifier) AS JoinGuid
  FROM
  (
    SELECT A1.*,
    CHARINDEX('TokenAfterGuid', A1.Message, A1.Start) as GuidEnd
    FROM
    (
      SELECT A.*,
      CHARINDEX('TokenBeforeGuid:', A.Message) + LEN('TokenBeforeGuid:') + 1 AS GuidStart
      FROM A
    ) AS A1
  ) AS A2
) ON B.Guid = J.JoinGuid

Of course, you would expand the wildcard (*) references. This is really a great technique considering the performance ramifications. Certainly in our case, where the query was for a view, this was a wonderful improvement.  Obviously, the best option from a performance stand point would be to tease out the uniqueidentifier in an INSERT/UPDATE trigger, create an index on the new column, and join the tables on that; however, in situations where you don't have the option of doing ETL or triggers this can be useful.

Sunday, December 16, 2007

Liking LINQ: The Learning Curve

This is the second post in a series on Language Integrated Query.  I'm pushing LINQ's buttons and bumping into some of its boundaries. 

Every abstraction leaks at some point, and LINQ to SQL is no exception.  Consider the following code:

NorthwindDataContext d = new NorthwindDataContext(); 
int? quantityThreshold = null;
var sales = from p in d.Products
join od in d.Order_Details on p.ProductID equals od.ProductID
where !p.Discontinued && (quantityThreshold.HasValue ? od.Quantity >= quantityThreshold.Value : true)
select p;



So, when you begin fetching data out of "sales" you'll see the problem.  A run-time error is thrown because the expression tree visitor attempts to greedily evaluate quantityThreshold.Value.  Let's try to move the evaluation out of the LINQ expression.



Predicate<Order_Detail> hasSufficientQuantity = o => quantityThreshold.HasValue ? o.Quantity >= quantityThreshold : true;
var sales = from p in d.Products
join od in d.Order_Details on p.ProductID equals od.ProductID
where !p.Discontinued && hasSufficientQuantity.Invoke(od)
select p;


Well, that doesn't work either.  "The method or operation is not implemented." The expression tree visitor has no idea what this hasSufficientQuantity method is... changing it to hasSufficientQuantity.Invoke(od) reveals that we are barking up the wrong tree, no pun intended.  The error given then is that our Predicate function cannot be translated.  Okay... let's look at why.



This fun LINQ expression syntax in C# is just syntactic sugar for a bunch of extension methods with signatures so jam-packed with Generics, you'd think it was Wal-Mart.  So, we are grateful to our C# language team for the sugar.  But, it does tend to hide what is really going on, making it difficult to figure out why the syntax seems so finicky.  Our LINQ expression above would translate into imperative code similar to the following:



var sales = d.Products.Join(d.Order_Details, p => p.ProductID, o => o.ProductID, (p, o) => new { Product = p, Order_Detail = o }).Where(p => !p.Product.Discontinued && hasSufficientQuantity.Invoke(p.Order_Detail)).Select(p => p.Product);



This isn't exactly pretty, and it doesn't really help us to understand why our function can't be translate, or does it?  Consider what these function calls are doing.  They are taking arguments, primarily Func<...> objects, and storing them internal in an expression tree.  We know from stepping through the code that the execution of our supplied Func<...> objects (the lambda expressions above) is deferred until we start accessing values from "sales".  So, there must be some internal storage of our intent.  Further, the code above must be translated to SQL by the System.Data.Linq libraries, and we can gather from they call stack on our exception that they are using the Visitor pattern to translate the nodes of the expression tree into SQL statements.



What happens when they visit the node that calls invokes the hasSufficientQuantity Predicate?  Well, that code--the Preciate object instance itself--is not available in SQL, so the translation fails.  This seems obvious, but consider that if we were using LINQ to Objects here, any of these approaches would work fine, as the predicate would be available in the execution environment of the translated expression tree, where it wasn't for SQL.



This is a contrived example, of course, and we could "code around" this in any number of ways, e.g.



where !p.Discontinued && od.Quantity >= (quantityThreshold ?? 0)


However, we are still seeing the LINQ to SQL abstraction leak pretty severely.



There are some gotchas out there as well, of course.  Consider the following SQL statement that answers the question, "How many orders have my customers had for each of my products?"



SELECT o.CustomerID, od.ProductID, COUNT(*) as [Number of Orders] 
FROM dbo.Orders o JOIN dbo.[Order Details] od
ON o.OrderID = od.OrderID
GROUP BY od.ProductID, o.CustomerID


How might we attempt to answer the same question with LINQ to SQL?  Notice that we are specifying two columns to group by in our query.  Here's what we might like to write in LINQ:



NorthwindDataContext d = new NorthwindDataContext();
var results = from o in d.Orders
join od in d.Order_Details on o.OrderID equals od.OrderID
group by o.CustomerID, od.ProductID into g
select new {g.CustomerID, g.ProductID, g.Count()};


Of course, this doesn't even come close to compiling.  Here's the right way to do use multiple columns in a groupby: use a tuple!



var results = from od in d.Order_Details
group od by new {od.Order.CustomerID, od.ProductID} into orders
select new { orders.Key.CustomerID, orders.Key.ProductID, NumberOfOrders = orders.Count() };



Once you start getting the gestalt of LINQ, you'll find yourself creating tuples all over the place.  Consider this query expression to retrieve the total sales of each product in each territory:



var territorySales = from p in d.Products
join od in d.Order_Details on p.ProductID equals od.ProductID
join o in d.Orders on od.OrderID equals o.OrderID
join e in d.Employees on o.EmployeeID equals e.EmployeeID
join et in d.EmployeeTerritories on e.EmployeeID equals et.EmployeeID
join t in d.Territories on et.TerritoryID equals t.TerritoryID
where !p.Discontinued
group new { od.ProductID, p.ProductName, t.TerritoryID, t.TerritoryDescription, od.Quantity }
by new { od.ProductID, t.TerritoryID, p.ProductName, t.TerritoryDescription } into sales
orderby sales.Key.TerritoryDescription descending, sales.Key.ProductName descending
select new
{ Product = sales.Key.ProductName.Trim(), Territory = sales.Key.TerritoryDescription.Trim(), TotalSold = sales.Sum(s => s.Quantity) };



The interesting part of that expression is that I created a tuple in my group...by to "select" the data to pass on to the next expression.



What if what we really wanted were the top ten best-selling products in each territory?  Well, there's no "top" LINQ query expression keyword.  The standard query operators include a couple of methods that look interesting: Take(int) and TakeWhile(predicate).  Unfortunately, TakeWhile is among the standard query operators that is not supported in LINQ to SQL.  Why?  Well, it's because you couldn't write equivalent SQL, I imagine.  And, while Take(int) is supported, its not immediately useful in a situation like this where you want to apply it to subsets of your results.  Therefore, a more procedural result seems warranted.  I'll investigate this further in my next post on the topic.



It is interesting to note the situation that arises with certain standard LINQ query operators not being supported by various flavors of LINQ.  Because the standard query operators are implemented using extension methods, every LINQ provider must handle them all, including those they cannot support.  This means throwing the NotSupportedException from the implementation of those methods.  The System.Linq.Queryable static class is where the standard query operators are implemented, defining the operators on IQueryable<T>.  LINQ to SQL classes like Table implement this interface, as do all class that participate in LINQ expressions.



Despite using the same syntax, the LINQ providers will each have their own significant learning curve due to variations in the operators they support and their own quirks. Next time we'll try to implement a top(x) query in LINQ to SQL.

Wednesday, December 5, 2007

Using arrays in SQL?

This is an excellent method of using arrays of integers in SQL queries. The basic idea is to convert your integer array to a byte array (the semantic equivalent of varbinary(MAX)), then to write a CLR table-valued function to convert the varbinary(MAX) to a table of integer values.


That's just cool.


I picked this up in a comment to Ayende Rahien's post about performance issues with his technique for working around the 2100 item limit imposed by SQL Server for an IN clause in T-SQL. This could show up in a number of situations where a linked server query wouldn't work. Say, for example, you were calling an API for the most recent searches on XYZ.com and wanted to match those keywords against recent searches on your site. You'd have, basically, an array of search terms that you wanted to find in your database. There are a number of solutions to this, but from what I've seen the best performance dynamics (assuming a somewhat large dataset) are given by creating some sort of temporary table and running your query against that.


The method of using integer arrays above is the most imaginative I've seen. Though, other alternatives exist, such as BULK INSERTing data, running your query, and then rolling back the transaction, as suggested in another of the comments.