Friday, December 21, 2007

SQL Optimization: Substring Joins

The DBA at my current client has some mad T-SQL skills.  He took a naïve join implementation I had written and improved its performance by three orders of magnitude.

My initial query looked something like this:

SELECT A.*, B.* FROM A JOIN B ON A.Message LIKE '%' + B.Identifier + '%'

Obviously this isn't the best situation in the first place. We don't have a clean relation between these two tables, instead we've got to pick out an identifier (GUID/uniqueidentifier) from a larger text field. Notwithstanding some obvious ETL or trigger possibilities, the statement above seems the simplest solution. The only problem is it is slooow.

The DBA had a great optimization which was essentially to tease out the identifier from the field as you would if you were writing a trigger to create a join column. Putting this in your query allows you to join on this derived column in your code. The performance implications are well above what I would have guessed them to be, operating–as I am–from naïveté. Here's the generalized solution based on our example above:

SELECT B.*, J.* FROM B JOIN
(
  SELECT A2.*,
  CAST(SUBSTRING(A2.Message, A2.GuidStart, A2.GuidEnd - A2.GuidStart) AS uniqueidentifier) AS JoinGuid
  FROM
  (
    SELECT A1.*,
    CHARINDEX('TokenAfterGuid', A1.Message, A1.Start) as GuidEnd
    FROM
    (
      SELECT A.*,
      CHARINDEX('TokenBeforeGuid:', A.Message) + LEN('TokenBeforeGuid:') + 1 AS GuidStart
      FROM A
    ) AS A1
  ) AS A2
) ON B.Guid = J.JoinGuid

Of course, you would expand the wildcard (*) references. This is really a great technique considering the performance ramifications. Certainly in our case, where the query was for a view, this was a wonderful improvement.  Obviously, the best option from a performance stand point would be to tease out the uniqueidentifier in an INSERT/UPDATE trigger, create an index on the new column, and join the tables on that; however, in situations where you don't have the option of doing ETL or triggers this can be useful.