POSTED : January 13, 2017
BY : Jim Egan
Categories: Automation & Operations,Business Optimization
Sometimes ETL repositories are like cockroaches. They’re all over the place, nobody knows how they got here and nobody knows how to get rid of them.
An overabundance of repositories poses a few problems:
It’s the last item in the list that is the most important. So why do you need multiple repositories? If your ETL code is for completely different topics you might think you need different repositories. But is that true? I don’t think so. A job server could care less what sources and targets your jobs are using. If I have hundreds of jobs does using multiple repositories make managing those jobs easier? No, in fact it makes it far more difficult.
Let’s say your task today is to improve performance for the ETL. If the jobs are broken out into ten different production repositories you now have to gather runtime statistics from ten different locations to determine the five worst running Dataflows. Sure, I can write a query unioning the data in AL_STATISTICS from all ten repositories. But it’s a lot easier to gather that information from a single repository.
One justification given for using multiple repositories was that an object needed to be coded differently for various situations. If that’s the case then it probably doesn’t make sense to try and use the same name for the object or you need to figure out a way to code the object so that it handles these situations at run time. One example of this is business logic that differs for U.S. and Canadian customers. There are ways to handle this:
Circling back to the space issue, I’ve seen repositories that take up far more database and file space than you would expect. Jobs that abnormally fail don’t get cleaned up. The job’s history remains forever unless someone manually cleans it up. At one client site we eliminated over 20 million rows in AL_STATISTICS and 2000 log files of jobs that were still in the history from three years ago. Just because your job history is set to 30 days doesn’t mean it cleans up everything. I would rather have to do this maintenance on a single repository than on ten repositories.
In my next post I’ll address how to consolidate multiple repositories into one.
Tags: Cloud Engineering, Development, ETL Repositories