Saturday, January 11, 2014

Using a User Defined Table Type as an alternative to Sql Bulk Insert in SQL Server


Recently I had to solve a problem while importing a relatively large .CSV file to SQL Server from a C# service application code. This had to be done without using an ETL tool like SSIS. The .CSV files would come in the form of a Zip file uploaded by an ASP.NET MVC Web application user.

Some sample rows of the data with the header row in the .CSV file are as below

Id, Some Description field, Another field, Yet another field, A Guid field
1046348264289,"Comma, in this field",no comma here,,48B6B3FE-C2E1-4B01-AB4E-90FBC98E1F81
1046348264289,"Comma, in this field",no comma here,,58B6B3FE-C2E1-4B01-AB4E-90FBC98E1F82
1046348264277,"Comma, in this field",no comma here,,48B6B3FE-C2E1-4B01-AB4E-90FBC98E1F81

The requirement was to capture only the first and last fields in every row. There could be thousands of rows in a single file, and the zip file could consist of many such files.

I initially started off by parsing the uploading file in C# and capturing only the required fields in a structure, that I would later turn into one big concatenated string, sending it down to a stored procedure for further parsing and inserting in the target table. I started off by writing the parsing routine and data structure used to capture the data below.

        public static Dictionary<string, List<string>> ParseCsv(Stream stream)
        {
            var mapDictionary = new Dictionary<string, List<string>>();
            using (var parser = new TextFieldParser(stream))
            {
                stream.Position = 0;
                parser.Delimiters = new string[] { "," };

                parser.ReadLine(); //read first line to advance and ignore the header fields

                while (!parser.EndOfData)
                {
                    string[] fields = parser.ReadFields();

                    if (fields == null) continue;

                    List<string> guids;
                    if (!mapDictionary.TryGetValue(fields.FirstOrDefault(), out guids))
                    {
                        guids = new List<string>();
                        guids.Add(fields.LastOrDefault());
                        mapDictionary.Add(fields.FirstOrDefault(), guids);
                    }
                    else
                    {
                        guids.Add(fields.LastOrDefault());
                    }              
                }
            }

            return mapDictionary;
        }


I would later use the routine below to create the string to send down the stored procedure

        private void SaveMappings()
        {
            var mapped = new StringBuilder();
            foreach (var pair in _mapDictionary)
            {
                var guidsSb = new StringBuilder();
                foreach (var item in pair.Value)
                {
                    guidsSb.Append(String.Format("{0},", item));
                }
                var guids = guidsSb.ToString().TrimEnd(',');
                mapped.Append(String.Format("{0},{1}|", pair.Key, guids)); 
//1046348264289,48B6B3FE-C2E1-4B01-AB4E-90FBC98E1F81,58B6B3FE-C2E1-4B01-AB4E-90FBC98E1F82|
            }
            //process all the mappings in one call to the backend
            _repository.RunStoredProcedure(mapped.ToString());
        }

However, this turned out to be overtly complex in processing the data in the stored procedure and had to use a recursive string split table valued function inside the stored procedure as well. Not very performant and error prone to say the least.

The Good News is from Sql Server 2008, I could simply make use of  the User Defined Data Table Type and send down the data I need to insert in the back-end as a temporary table!! No complex string parsing in the stored procedure anymore.


The User Defined Data table Type method:

First, simply define the table structure you need in SQL like so

CREATE TYPE [dbo].[ItemTopicMap] AS TABLE(
        [ItemIdentifier] [bigint] NOT NULL,
        [TopicId] [uniqueidentifier] NOT NULL
)
GO
Next, simplify the earlier C# .CSV parsing routine like below. Note the use of the ADO.NET DataTable.

        public static DataTable ParseCsv(Stream stream)
        {
            var dt = new DataTable();
            dt.Columns.Add("ItemIdentifier", typeof(string));
            dt.Columns.Add("TopicId", typeof(Guid));
            using (var parser = new TextFieldParser(stream))
            {
                stream.Position = 0;
                parser.Delimiters = new string[] { "," };
                parser.ReadLine(); //read first line to advance and ignore the header fields
                while (!parser.EndOfData)
                {
                    string[] fields = parser.ReadFields();
                    if (fields == null) continue;
                    if (fields.FirstOrDefault().Length > 10 &&
IsValidGuid(fields.LastOrDefault()))
                    {
                        dt.Rows.Add(fields.FirstOrDefault(), fields.LastOrDefault());
                    }
                    else
                    {
                        //log line out to bad file with filename and linenumber
                    }
                }
            }
            return dt;
        }

Then inside the repository's execute stored procedure method, do something like this

  public void RunStoredProcedure(DataTable dataTable)
  {
       var conStr = ConfigurationManager.ConnectionStrings["DefaultConn"].ConnectionString;
            using (var connection = new SqlConnection(connectionString))
            {
                connection.Open();
                var proc = new SqlCommand("InsertItemTopics", connection);
                proc.CommandType = CommandType.StoredProcedure;
                proc.Parameters.AddWithValue("@ItemTopicMap", dataTable);
                proc.Parameters["@ItemTopicMap"].TypeName = "ItemTopicMap";
                proc.CommandTimeout = 180;
                proc.ExecuteNonQuery();
                connection.Close();
            }
   }

and Voila, no more further parsing from a huge string in the back-end!!

Note that the data table type defined here is used as some sort of temporary table on the back-end that we can read from in a stored procedure for any further validation and ultimately inserting or upserting the data in the actual destination table.

For completeness, a simplified version of the stored procedure InsertItemTopics is shown below:

CREATE PROCEDURE [dbo].[InsertItemTopics]
( @ItemTopicMap ItemTopicMap READONLY )
AS
BEGIN
    SET NOCOUNT ON
        TRUNCATE TABLE ItemTopicMapDestination
        INSERT INTO ItemTopicMapDestination(ItemIdentifier, TopicId)
                SELECT itemIdentifier, TopicId FROM @ItemTopicMap
END


No comments:

Post a Comment