Tuesday, April 8, 2008

Bulk Deletes in Linq to Sql

UPDATE: Terry Aney independently came up with a similar solution to this problem as I did. Terry however has a better implementation in that his supports compound primary keys. He's also got similar support for batch updates in there. I recommend that you read about his approach here but then read his follow-up and download a later version of his code here. Thanks Terry! A couple of days ago I was working on an application that uses Linq for its data access layer and I was trying to figure out the Linq way of doing a bulk delete. To illustrate the type of operation I was attempting let's imagine a simple HR database. This database has two tables: Employees and EmployeeTypes. A relationship between the tables exists such that each Employee has one and only one type while each EmployeeType has 0, 1 or many Employees. This is accomplished with the following schema: Employees ======== EmployeeID int(primary key, identity) EmployeeName varchar EmployeeTypeID int (foreign key to EmployeeTypes) EmployeeTypes ============ EmployeeTypeID int (referenced by Employees) TypeDesc varchar Let's say that in my imaginary organization we had been experimenting with using offshore developers to decrease our development costs. Now let's imagine that (just like in most attempts at off shoring development) the higher ups have realized that the code coming from across the pond is total crap and that you really do get what you pay for. Therefore the powers that be have decided to can the whole offshore team. Now we need to update the database to get rid of all the offshore employees in the database. (In reality we probably wouldn't really want to delete these rows but this is just an example.) In Sql this would be a simple thing to accomplish. We could simply execute the following query:
DELETE FROM Employees
WHERE EmployeeTypeID IN
(SELECT EmployeeTypeID
FROM EmployeeTypes
WHERE TypeDesc = 'Offshore')
Much to my surprise there is no clear analog to this query in Linq. I thought that this would work:
var toDelete = from e in db.Employees
   where e.EmployeeType.TypeDesc == "Offshore"
   select e;

db.Employees.DeleteAllOnSubmit(toDelete);
db.SubmitChanges();
In fact it does work but when I ran the Sql Profiler I noticed something quite troubling. When Linq executed this code it was first running the select defined by my toDelete var and retrieving the results. It was then going through each Employee returned by the query and issuing a SEPARATE delete call for each. To me this seemed needlessly inefficient so I started Googling for a solution. I came up with two hits that attempted to solve this problem. The first was from Chris Rock (no not that Chris Rock.) He wrote a good post about a solution to the problem that a guy named Benjamin Eidelman wrote. You can see this solution at Chris's blog, RocksThoughts. But there were some problems with this solution. Firstly the code he posted did not build as posted. I traced down the problem to the fact that the capture groups in the regex he was using to parse the query were not named properly. I'm guessing that since .NET regexes put their capture group names between less than/greater than symbols his blogging app is stripping out the group names thinking that they are HTML. Fair enough. I've contacted him and hopefully he'll get the fix I supplied posted. A much bigger problem for me though was that when I tried to use his code to accomplish the above defined task I got an error that my query type wasn't supported. I tried to use his code like so:
db.DeleteAll(db.Employees.Where(e => e.EmployeeType.TypeDesc == "Offshore"));
This is definitely what I wanted but in examining the code from Chris's blog I quickly determined that it would only work if the Where clause was very simple and limited to checking values on the object being deleted. Joins are not supported by this code at all as far as I can see. And in fact while looking at the underlying code I realized that it relies heavily on string manipulation to achieve its results. And that certainly doesn't seem very Linq-y to me. Linq gives us the facility to pass around a query as a logical construct so it seemed to me that we should be operating on that construct rather than its interpreted string representation. So from Chris's blog I then followed a link to a Lambda expression-based solution to the same problem. The solution was from Jeffrey Zhao's blog. Here I thought I would find what I was looking for. A method that examined an IQueryable construct and parsed it into an efficient Sql DELETE statement. And yet upon further examination I found that this wasn't really what I was looking for either. First of all, rather than taking an IQueryable as an argument it takes a predicate defined as Expression>. While I think this cuts down on the readability of the consuming code I figured no big deal and so I tried my delete query with this library like so:
db.Employees.Delete(e => e.EmployeeType.TypeDesc == "Offshore");
This built but when I ran it it the generated query did not join in the EmployeeType table. Instead the query was looking for a field called TypeDesc on the Employees table. So of course it failed. I came to the conclusion that while this method was probably somewhat better than Mr. Eidelman's string manipulations it still didn't do what I wanted. If you look on the message boards, etc. you will see people asking this question and the answer is usually either to use one of the two above methods or else to write a custom stored procedure to accomplish this type of delete. But having to write a stored proc for every different type of bulk delete I might need to perform just rubbed me the wrong way. I wondered if it would be possible for me to come up with a solution more to my liking. After some thought I came up with something that so far I'm liking quite a bit. It takes its inspiration from the solutions explored above, combining the legibility of defining the records to delete in a simple Linq query with the relative type safety of using the expression tree to generate the where clause. Now it doesn't work in all situations.The main one I see right now is that it doesn't support tables with compound primary keys. But other than that I think it's pretty nice. The usage is pretty straightforward:
var toDelete = from e in db.Employees
        where e.EmployeeType.TypeDesc == "Offshore"
        select e.EmployeeID;

db.Employees.BulkDelete(toDelete);


Note that the query is selecting a list of the ids of the records to be deleted. This is key because basically what the code does is wrap the select query defined in Linq inside a delete query based on the primary key of the table being deleted from. So the query that is actually executed on the server from the above is:
delete from Employees
where EmployeeID in
(select EmployeeID
from Employees t0
inner join EmployeeTypes t1 on t0.EmployeeTypeID = t1.EmployeeTypeID
where t1.TypeDesc = 'Offshore')
Unfortunately I haven't yet been able to come up with a way to make the compiler ensure that the enumerable type returned by the user-supplied query matches the type of the primary key. That's just up to the developer right now. I'd love to hear if anyone's got any ideas beyond just making it only work with int. It should I think at the very least also support string and guid keys. But I think that the Sql Optimizer should be able to execute this sort of query quickly and efficiently. If anyone sees any problems please let me know! Here is the code for the extension method that actually does the work:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data.Linq;
using System.Data.Linq.Mapping;
using System.Data;

namespace LinqExtensions
{
public static class TableExtensions
{
    public static int BulkDelete<TEntity>(this Table<TEntity> table, IQueryable idsToDelete)
       where TEntity : class
    {
        MetaTable metaTable = table.Context.Mapping.GetTable(typeof(TEntity));
        string tableName = metaTable.TableName;
        if (metaTable.RowType.IdentityMembers.Count != 1)
        {
            var keyFields = from im in metaTable.RowType.IdentityMembers
                                 select im.MappedName;

            if (keyFields == null || keyFields.Count() < 1 )
            {
                keyFields = new List<string>();
            }

            throw new ApplicationException(string.Format("Error in TableExtensions.BulkDelete<TEntity>: Table {0} has a compound key.  BulkDelete can only operate on tables with a single key field.  (key fields found: {1}", tableName, string.Join(",", keyFields.ToArray())));
        }
        string primaryKey = metaTable.RowType.IdentityMembers[0].MappedName;

        System.Data.Common.DbCommand origCmd = table.Context.GetCommand(idsToDelete);
        string toDelete = origCmd.CommandText;
        string sql = string.Format("delete from {0} where {1} in ({2})", tableName, primaryKey, toDelete);

        origCmd.CommandText = sql;

        bool openedConn = false;

        if (origCmd.Connection.State != ConnectionState.Open)
        {
            openedConn = true;
            origCmd.Connection.Open();
        }

        try
        {
            return origCmd.ExecuteNonQuery();
        }
        finally
        {
            if (openedConn)
            {
                origCmd.Connection.Close();
            }
        }

    }
}
}

Post in the comments and let me know how it works for you!

Thursday, April 3, 2008

The REAL Current Directory in .NET

I've got this little open source C# app I wrote called PowerFlag. PowerFlag is a response to the unending plethora of spammers on Craigslist. Now, I love Craigslist but the spammers really get on my nerves. So I wrote this app to automatically flag as spam posts which contain keywords that the user defines. The app worked pretty great except for the fact that when I tried to run it through the Windows Scheduler it seemed to run fine judging by the logs but it never seemed to find any posts to flag. For instance if I ran the app by double-clicking it and clicked the Flag! button in the app it would find maybe 10 or so spam posts and flag them. This would be evident via a text box in the app and the log file that is created on every run. If I right-clicked the scheduled task and chose 'Run' though the log file would be created but it would say that no posts were found to flag. I lived with it this way for a couple of months before today I finally got around to figuring out what the root of the problem was and how to fix it. My problem was that the app uses two settings files and the definitions for the filepaths for those two files were stored in string constants which concatenated together System.Environment.CurrentDirectory with the file names of the files. By adding more logging I was able to see that when the app was run via the task scheduler the value of System.Environment.CurrentDirectory was not the directory the app executable was in but instead was for some reason c:\windows\system32. Upon further investigation I found different settings files for the app in this location as well. These settings files contained the initial settings that the program uses on first startup and did not contain all of the keywords I had defined using the app itself. The fix for this problem of System.Environment.CurrentDirectory not always being the same as the current directory of the executing app is to instead use this: System.IO.Path.GetDirectoryName(System.Reflection.Assembly._ GetExecutingAssembly().Location) This is pretty much guaranteed to always give you the path to the folder containing the current executable. Once I made this change my log files overflowed with flagged posts from the scheduled app. Voila!