SQL Optimization: Substring Joins
The DBA at my current client has some mad T-SQL skills. He took a naïve join implementation I had written and improved its performance by three orders of magnitude.
My initial query looked something like this:
SELECT A.*, B.* FROM A JOIN B ON A.Message LIKE '%' + B.Identifier + '%'
Obviously this isn't the best situation in the first place. We don't have a clean relation between these two tables, instead we've got to pick out an identifier (GUID/uniqueidentifier) from a larger text field. Notwithstanding some obvious ETL or trigger possibilities, the statement above seems the simplest solution. The only problem is it is slooow.
The DBA had a great optimization which was essentially to tease out the identifier from the field as you would if you were writing a trigger to create a join column. Putting this in your query allows you to join on this derived column in your code. The performance implications are well above what I would have guessed them to be, operating–as I am–from naïveté. Here's the generalized solution based on our example above:
SELECT B.*, J.* FROM B JOIN
(
SELECT A2.*,
CAST(SUBSTRING(A2.Message, A2.GuidStart, A2.GuidEnd - A2.GuidStart) AS uniqueidentifier) AS JoinGuid
FROM
(
SELECT A1.*,
CHARINDEX('TokenAfterGuid', A1.Message, A1.Start) as GuidEnd
FROM
(
SELECT A.*,
CHARINDEX('TokenBeforeGuid:', A.Message) + LEN('TokenBeforeGuid:') + 1 AS GuidStart
FROM A
) AS A1
) AS A2
) ON B.Guid = J.JoinGuid
Of course, you would expand the wildcard (*) references. This is really a great technique considering the performance ramifications. Certainly in our case, where the query was for a view, this was a wonderful improvement. Obviously, the best option from a performance stand point would be to tease out the uniqueidentifier in an INSERT/UPDATE trigger, create an index on the new column, and join the tables on that; however, in situations where you don't have the option of doing ETL or triggers this can be useful.