Traditional application performance monitoring (APM) models have relied upon the intelligence of the administrator.
The awareness of the underlying deployment would follow a mental model of causality with a limited number of parameters affecting the outcome. The administrator would know that a request for action generated by a system, or person would correspond to an associated transaction, several times resulting in a known multi-tiered model.
At the occurrence of an alert, such as a JVM memory warning, or decline in queries per second (QPS) of a high throughput system, action paths would originate from the admin, to support teams and end at the developers. Needless to point out the larger the system, the larger the instrumentation required to monitor the numerous parameters. The resolution would be equally complex or simple depending upon the actual event.
The role of such administration has changed with what has now happened in this new world of distributed systems, particularly with Hadoop based Big Data Deployments.
The following are the key elements of a data lake:
- A data lake is a shared multi-tenant resource for the entire organization
- The same data lake compute infrastructure powers various use case such as Business Intelligence, Data Science, ETLs, and Data Warehousing.
- Data lakes are spread over several machines, networked and the locality of such servers are not always local
- The more the distribution of computing, the more expensive is the integration, due to data movement (referred to as shuffle)
- Uniformity of hardware is a great step but is never a guarantee
The new sources of data, therefore, are varied in nature. Some of the examples are — JMX logs from compute nodes, service logs of access engines such as Spark & Hive, underlying system logs which are generated from Data and Compute nodes, logs that are generated at the end user systems.
Needless to point out the distributed nature of the job, brings about the power and complexity of multiple JVMs together. The complexity could mean failure of one of the JVMs or a failure of a disk on a particular Data Node which contains data related to several critical queries, or it could mean stressed networks.
The root cause of the problem can lie hidden in plain sight or can be visible at the last leaf node of execution of a specific job. None of these problems can be solved with ease by plotting more metrics on a Grafana dashboard, without an Administrator who has the absolute awareness of every component of the system.
Unfortunately, the explosion of options in every area of work — storage, access and ingestion technologies, brings up newer unseen challenges every day.
We, at Acceldata, believe that we need to cut through such complexity of Big Data and derive actionable insights to allow uninterrupted operations. Acceldata’s approach is unique and has three basic principles:
- Source and stream signals from all layers, filter noise and retain data for historical trending and analysis
- Identify insights, patterns and over operational data, apply heuristics, machine learning algorithms
- Enable administrators visually by displaying extensible insights instead of logs
Given the criticality of the business processes that run on Hadoop based systems today, it is quite essential that the performance monitoring and operational tools enable the Big Data team.
They can deliver better results when the operational metrics are treated not as data points, but as converted insights which take into account the current and past performance of the systems they are responsible for.
APM for distributed systems have to be native.
We’re rethinking it.