What is Apache Impala?

Impala is a MPP (Massive Parallel Processing) SQL query engine for processing huge volumes of data that is stored in Hadoop cluster. It is an open source software which is written in C++ and Java. It provides high performance and low latency compared to other SQL engines for Hadoop.

Impala runs on a number of systems in the Hadoop cluster. Unlike traditional storage systems, impala is decoupled from its storage engine. It has three main components namely, Impala daemon (Impalad), Impala Statestore, and Impala metadata or meta store.

Apache Impala uses an architecture with the following components:

Impala daemon (also known as impalad) component runs on each node where Impala is installed. It accepts the queries from various interfaces like impala shell, hue browser, etc.… and processes them.

Impala State store component is responsible for checking the health of each Impalad and then relaying each Impala daemon health to the other daemons frequently. This can run on same node where Impala server or other node within the cluster is running.

Impala Catalog Service is another Impala component that propagates metadata changes from Impala SQL commands to all Impala daemons in the cluster.

The important details such as table, column information and table definitions are stored in a centralized database known as a meta store. Impala metadata & meta store use traditional MySQL or PostgreSQL databases to store table definitions.

Each Impala node caches all of the metadata locally. When dealing with an extremely large amount of data and/or many partitions, getting table specific metadata could take a significant amount of time. So, a locally stored metadata cache helps in providing such information instantly.

Why Monitor Apache Impala?

Impala's strength rests in its ability to process data-heavy workloads in Hadoop clusters. This SQL engine is therefore common place in large, mission-critical environments, where there is a need to rapidly process large volumes of data. In such environments, even the slightest delay in query processing can degrade the performance of dependent applications, thus adversely impacting user experience with those applications. To ensure peak query performance and a superlative user experience at all times, the Apache Impala server should be continuously monitored, and deficiencies in query performance should be promptly detected and resolved. For this purpose, eG Enterprise offers a specialized Apache Impala server monitoring model.

By closely monitoring the target Apache Imapala server, administrators can be proactively alerted to issues in the overall performance and critical operations of this query engine, identify serious issues and plug the holes before any query processing job gets affected.