Integration of a Hadoop system into the existing classic Business Intelligence (BI) landscape
Brief description
In the context of the project to modernize the business intelligence / data warehouse landscape, a concept for data management, storage and evaluation based on a Hadoop system is required. Among other things, it should be considered how a Hadoop system can be optimally integrated into the customer's existing BI landscape.
Supplement
The existing BI infrastructure consisting of a Teradata Enterprise Data Warehouse with ETL routes (Extract, Transform, Load) mapped in Oracle Data Integrator is to be extended by a Hadoop system. This requires concepts for data management, storage and evaluation that differ from classical approaches. This includes the evaluation and selection of: suitable knowledge modules of the Oracle Data Integrator regarding data management; a suitable data format (Avro, CSV, JSON, ORC, Parquet) and compression methods (Snappy, Zlib) regarding data storage; and a suitable SQL engine (Hive/Tez, Presto, Spark) regarding data evaluation.
Subject description
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. (see http://hadoop.apache.org/)