Programmers: Reading large files for ETL program POC

jeudi 5 février 2015

Reading large files for ETL program POC

I'm looking for some information on how to increase the performance of reading large text files for use in an ETL process. Not really looking for a detailed "solution" just more of a point me in the right direction to find the information I need.

My problem is that while I'm an OK coder, I'm self-taught so I'm not well versed in the terminology of things I haven't yet worked on and that puts me at a big disadvantage when trying to search for options.

We're currently using a proprietary program that is more of a scripting language that gets converted into C++ and compiled into an executable. The only redeeming factor that this program has is it's ability to cycle through a 19gig file and populate 250+ fields very quickly. Where it bogs down is the transformations which, due to the scripting nature of it, tend to be inefficient, cumbersome, and difficult to maintain.

I'm able to parse the text file using .NET (VB or C#) but I'm unable to even come close to the efficiency of the C++ for reading the file, literally hours vs minutes.

I also considered using the data warehouse which loads the same extract I'm reading but for some reason the data warehouse team created one large table that contains millions of rows and all 250+ fields. Since I need about 150 or so of the fields for the transformations, the queries are excessively slow

Thanks in advance,

Frank

Programmers

jeudi 5 février 2015

Reading large files for ETL program POC

Aucun commentaire:

Enregistrer un commentaire