30346

SSIS: Flat File Source to SQL without Duplicate Rows

I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.

E.g. Data:

Item Number | Item Name | Update Date ITEM-01 | First Item | 1-Jan-2013 ITEM-01 | First Item | 5-Jan-2013 ITEM-24 | Another Item | 12-Mar-2012 ITEM-24 | Another Item | 13-Mar-2012 ITEM-24 | Another Item | 14-Mar-2012

Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check <strong>if next item number = previous item number then do NOT import this line</strong>.

I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.

So is there any other way?

Answer1:

There are a couple of approaches you can take to do this.

1. Aggregate Transformation

Group by Item Number and Item Name and then perform an aggregate operation on Update Date. Based on the logic you mentioned above, the <strong>Minimum</strong> operation should work. In order to use the <strong>Minimum</strong> operation, you'll need to convert the Update Date column to a date (can't perform <strong>Minimum</strong> on a string). That conversion can be done in a <strong>Data Conversion Transformation</strong>. Below are the guts of what this would look like:

<img src="https://i.stack.imgur.com/QY093.png" alt="enter image description here">

2. Script Component Transformation

Essentially, you could implement the logic you mentioned above:

<strong>if next item number = previous item number then do NOT import this line</strong>

First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):

<ol> <li>Select <strong>Transformation</strong> as the Script Component type</li> <li>

Add the Script Component after the Flat File Source in your Data Flow:

<img src="https://i.stack.imgur.com/jJo2e.png" alt="enter image description here">

</li> <li>Double Click the Script Component to open the <strong>Script Transformation Editor</strong>.</li> <li>

Under <strong>Input Columns</strong>, select all columns:

<img src="https://i.stack.imgur.com/syoeH.png" alt="enter image description here">

</li> <li>

Under <strong>Inputs and Outputs</strong>, select <strong>Output 0</strong>, and set the SynchronousInputID property to None

<img src="https://i.stack.imgur.com/iEaYb.png" alt="enter image description here">

</li> <li>

Now manually add columns to <strong>Output 0</strong> to match the columns in <strong>Input 0</strong> (don't forget to set the data types):

<img src="https://i.stack.imgur.com/QDGhL.png" alt="enter image description here">

</li> <li>Finally, edit the script. There will be a method named Input0_ProcessInputRow- modify it as below and add a private field named previousItemNumber as below:</li> </ol>
    public override void Input0_ProcessInputRow(Input0Buffer Row)
    {
        if (!Row.ItemNumber.Equals(previousItemNumber))
        {
            Output0Buffer.AddRow();
            Output0Buffer.ItemName = Row.ItemName;
            Output0Buffer.ItemNumber = Row.ItemNumber;
            Output0Buffer.UpdateDate = Row.UpdateDate;
        }  

        previousItemNumber = Row.ItemNumber;
    }

    private string previousItemNumber = string.Empty;

    

Answer2:

If performance is a biggy for you I'd suggest you to dump the entire text file into a temporary table on SQL Server and then use a SELECT DISTINCT * to get the desired values.

Recommend

  • Export “Create Aggregate” functions from PostgreSQL
  • MongoDB - How can I use multiple groups in an aggregation pipeline?
  • What's the best way to find the most frequently occurring value in MongoDB?
  • Conditional AGGREGATE/Median in Excel 2010
  • Stream data into rotating log tables in BigQuery
  • Formatting a string using values from a generic list by LINQ
  • Best way to aggregate a list of items and collect totals
  • Mapping result of aggregate query to hibernate object
  • Postgres overlap arrays from one column
  • How to Update Multiple Array objects in mongodb
  • How to get rows with min values in one column, grouped by other column, while keeping other columns?
  • CONVERT MySQL Query to SQL Server (MSSQL / SQLSRV) (WiTH DISTINCT)
  • Execute multiple Scalatests in sbt
  • How to find documents with exactly the same array entries as in a query
  • Calculating subtotals in R
  • Get pretty git rev name
  • Filter Array Content to a Query containing $concatArrays
  • How to get list of users who's birthday is today in MongoDB
  • How to distribute Java-based software?
  • Visual studio alerts workspace already exists
  • Sending Content-Type application/x-www-form-urlencoded WSO2 ESB
  • What is the default HTTP verb in WebApi ? GET or POST?
  • Counting problem C#
  • TFS 2015 - Waiting for an agent to be requested
  • How to synchronize jQuery dialog box to act like alert() of Javascript
  • Object and struct member access and address offset calculation
  • Redux Form - Not able to type anything in input
  • Unable to get column index with table.getColumn method using custom table Model
  • why xml file does not aligned properly after append the string in beginning and end of the file usin
  • htaccess add www if not subdomain, if subdomain remove www
  • Transactional Create with Validation in ServiceStack Redis Client
  • Hardware Accelerated Image Scaling in windows using C++
  • JSON response opens as a file, but I can't access it with JavaScript
  • Use of this Javascript
  • C++ Partial template specialization - design simplification
  • MongoDB in PHP using aggregate to group by _id is null not working
  • Accessing IRQ description array within a module and displaying action names
  • How to get next/previous record number?
  • Python: how to group similar lists together in a list of lists?
  • Suggestions to manage Login/Logout transitions