Tuesday, April 8, 2008

Bulk Deletes in Linq to Sql

UPDATE: Terry Aney independently came up with a similar solution to this problem as I did. Terry however has a better implementation in that his supports compound primary keys. He's also got similar support for batch updates in there. I recommend that you read about his approach here but then read his follow-up and download a later version of his code here. Thanks Terry! A couple of days ago I was working on an application that uses Linq for its data access layer and I was trying to figure out the Linq way of doing a bulk delete. To illustrate the type of operation I was attempting let's imagine a simple HR database. This database has two tables: Employees and EmployeeTypes. A relationship between the tables exists such that each Employee has one and only one type while each EmployeeType has 0, 1 or many Employees. This is accomplished with the following schema: Employees ======== EmployeeID int(primary key, identity) EmployeeName varchar EmployeeTypeID int (foreign key to EmployeeTypes) EmployeeTypes ============ EmployeeTypeID int (referenced by Employees) TypeDesc varchar Let's say that in my imaginary organization we had been experimenting with using offshore developers to decrease our development costs. Now let's imagine that (just like in most attempts at off shoring development) the higher ups have realized that the code coming from across the pond is total crap and that you really do get what you pay for. Therefore the powers that be have decided to can the whole offshore team. Now we need to update the database to get rid of all the offshore employees in the database. (In reality we probably wouldn't really want to delete these rows but this is just an example.) In Sql this would be a simple thing to accomplish. We could simply execute the following query:
DELETE FROM Employees
WHERE EmployeeTypeID IN
(SELECT EmployeeTypeID
FROM EmployeeTypes
WHERE TypeDesc = 'Offshore')
Much to my surprise there is no clear analog to this query in Linq. I thought that this would work:
var toDelete = from e in db.Employees
   where e.EmployeeType.TypeDesc == "Offshore"
   select e;

db.Employees.DeleteAllOnSubmit(toDelete);
db.SubmitChanges();
In fact it does work but when I ran the Sql Profiler I noticed something quite troubling. When Linq executed this code it was first running the select defined by my toDelete var and retrieving the results. It was then going through each Employee returned by the query and issuing a SEPARATE delete call for each. To me this seemed needlessly inefficient so I started Googling for a solution. I came up with two hits that attempted to solve this problem. The first was from Chris Rock (no not that Chris Rock.) He wrote a good post about a solution to the problem that a guy named Benjamin Eidelman wrote. You can see this solution at Chris's blog, RocksThoughts. But there were some problems with this solution. Firstly the code he posted did not build as posted. I traced down the problem to the fact that the capture groups in the regex he was using to parse the query were not named properly. I'm guessing that since .NET regexes put their capture group names between less than/greater than symbols his blogging app is stripping out the group names thinking that they are HTML. Fair enough. I've contacted him and hopefully he'll get the fix I supplied posted. A much bigger problem for me though was that when I tried to use his code to accomplish the above defined task I got an error that my query type wasn't supported. I tried to use his code like so:
db.DeleteAll(db.Employees.Where(e => e.EmployeeType.TypeDesc == "Offshore"));
This is definitely what I wanted but in examining the code from Chris's blog I quickly determined that it would only work if the Where clause was very simple and limited to checking values on the object being deleted. Joins are not supported by this code at all as far as I can see. And in fact while looking at the underlying code I realized that it relies heavily on string manipulation to achieve its results. And that certainly doesn't seem very Linq-y to me. Linq gives us the facility to pass around a query as a logical construct so it seemed to me that we should be operating on that construct rather than its interpreted string representation. So from Chris's blog I then followed a link to a Lambda expression-based solution to the same problem. The solution was from Jeffrey Zhao's blog. Here I thought I would find what I was looking for. A method that examined an IQueryable construct and parsed it into an efficient Sql DELETE statement. And yet upon further examination I found that this wasn't really what I was looking for either. First of all, rather than taking an IQueryable as an argument it takes a predicate defined as Expression>. While I think this cuts down on the readability of the consuming code I figured no big deal and so I tried my delete query with this library like so:
db.Employees.Delete(e => e.EmployeeType.TypeDesc == "Offshore");
This built but when I ran it it the generated query did not join in the EmployeeType table. Instead the query was looking for a field called TypeDesc on the Employees table. So of course it failed. I came to the conclusion that while this method was probably somewhat better than Mr. Eidelman's string manipulations it still didn't do what I wanted. If you look on the message boards, etc. you will see people asking this question and the answer is usually either to use one of the two above methods or else to write a custom stored procedure to accomplish this type of delete. But having to write a stored proc for every different type of bulk delete I might need to perform just rubbed me the wrong way. I wondered if it would be possible for me to come up with a solution more to my liking. After some thought I came up with something that so far I'm liking quite a bit. It takes its inspiration from the solutions explored above, combining the legibility of defining the records to delete in a simple Linq query with the relative type safety of using the expression tree to generate the where clause. Now it doesn't work in all situations.The main one I see right now is that it doesn't support tables with compound primary keys. But other than that I think it's pretty nice. The usage is pretty straightforward:
var toDelete = from e in db.Employees
        where e.EmployeeType.TypeDesc == "Offshore"
        select e.EmployeeID;

db.Employees.BulkDelete(toDelete);


Note that the query is selecting a list of the ids of the records to be deleted. This is key because basically what the code does is wrap the select query defined in Linq inside a delete query based on the primary key of the table being deleted from. So the query that is actually executed on the server from the above is:
delete from Employees
where EmployeeID in
(select EmployeeID
from Employees t0
inner join EmployeeTypes t1 on t0.EmployeeTypeID = t1.EmployeeTypeID
where t1.TypeDesc = 'Offshore')
Unfortunately I haven't yet been able to come up with a way to make the compiler ensure that the enumerable type returned by the user-supplied query matches the type of the primary key. That's just up to the developer right now. I'd love to hear if anyone's got any ideas beyond just making it only work with int. It should I think at the very least also support string and guid keys. But I think that the Sql Optimizer should be able to execute this sort of query quickly and efficiently. If anyone sees any problems please let me know! Here is the code for the extension method that actually does the work:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.Linq;
using System.Data.Linq.Mapping;
using System.Data;

namespace LinqExtensions
{
public static class TableExtensions
{
    public static int BulkDelete<TEntity>(this Table<TEntity> table, IQueryable idsToDelete)
       where TEntity : class
    {
        MetaTable metaTable = table.Context.Mapping.GetTable(typeof(TEntity));
        string tableName = metaTable.TableName;
        if (metaTable.RowType.IdentityMembers.Count != 1)
        {
            var keyFields = from im in metaTable.RowType.IdentityMembers
                                 select im.MappedName;

            if (keyFields == null || keyFields.Count() < 1 )
            {
                keyFields = new List<string>();
            }

            throw new ApplicationException(string.Format("Error in TableExtensions.BulkDelete<TEntity>: Table {0} has a compound key.  BulkDelete can only operate on tables with a single key field.  (key fields found: {1}", tableName, string.Join(",", keyFields.ToArray())));
        }
        string primaryKey = metaTable.RowType.IdentityMembers[0].MappedName;

        System.Data.Common.DbCommand origCmd = table.Context.GetCommand(idsToDelete);
        string toDelete = origCmd.CommandText;
        string sql = string.Format("delete from {0} where {1} in ({2})", tableName, primaryKey, toDelete);

        origCmd.CommandText = sql;

        bool openedConn = false;

        if (origCmd.Connection.State != ConnectionState.Open)
        {
            openedConn = true;
            origCmd.Connection.Open();
        }

        try
        {
            return origCmd.ExecuteNonQuery();
        }
        finally
        {
            if (openedConn)
            {
                origCmd.Connection.Close();
            }
        }

    }
}
}

Post in the comments and let me know how it works for you!

6 comments:

Terry Aney said...

Hi. Nice article. I've approached the same problem and think I may have a solution you are looking for.

I actually created an implementation for everything Jeffrey Zhao was envisioning I think. Check it out here:

http://www.aneyfamily.com/terryandann/post/2008/04/Batch-Updates-and-Deletes-with-LINQ-to-SQL.aspx

Benjamin Eidelman said...

Hi!,

I found your article througth Chris Rock blog, thank you for the reference (I'm Mr Eidelman :)

Your solution is great!, composite keys aren't a very usual problem to me, but strong type checking on the ids would be important, because a nice thing about Linq is type errors detected at compilation time.
But that might be acheived checking:

idsToDelete.ElementType == metaTable.RowType.IdentityMembers[0].Type

(I haven't tested it)

Best Regards!

Terry Aney said...

On more side comment. Your implementation for delete was almost identical to mine. To get around your compound key problem and type checking, don't create a DELETE ?? WHERE key IN ( )

Instead create a DELETE statement and use a INNER JOIN on the primary key(s). That way you can have multiple keys and also doesn't force your consumer to 'return the key', they just return the entire entity (simliar to how DeleteAllOnSubmit() works).

Again, you can see more details if you need from my post/code.

steve kain said...

Terry, your article looks interesting. I've downloaded your code but haven't yet had a chance to really play with it. When I do I'll post my thoughts here.

steve kain said...

Ben,

Glad you liked the post. I agree that compound keys are not all that common but in the interest of completeness it would be nice to support them I think.

If for some reason Terry's solution doesn't work out for my needs I will certainly look into your recommendation for type checking on the identity. Thanks.

steve kain said...

Terry, finally managed to find the time to take a look at your code and it is indeed similar to mine but better. I've posted some text at the top of my original post letting people know that they are likely better off using your implementation than mine. I certainly will be doing so.

Thanks!