90% Reduction in System Downtime With Observability

Aokumo helped a FinTech company reduce incidents by 70% and system downtime by 90% while improving recovery time.

SUMMARY

The Client

The client is a leading Fintech company based in Sydney, providing services in the capital market and facilitating mission-critical transactions.

The Challenge

The client’s monitoring and logging systems could not get them a holistic view of their massive IT infrastructure, causing frequent downtime, delayed maintenance, and SLA implications.

The Impact

Aokumo implemented cloud-native monitoring, logging, and alerting system to help the client resolve issues faster and improve their digital operations.

Client

Industry

Financial Services

Website Link

Featured Services

90% Reduction in System Downtime With Observability

Aokumo helped a FinTech company reduce incidents by 70% and system downtime by 90% while improving recovery time.

Client

Industry

Financial Services

Website Link

Featured Services

SUMMARY

The Client

The client is a leading Fintech company based in Sydney, providing services in the capital market and facilitating mission-critical transactions.

The Need

The client’s monitoring and logging systems could not get them a holistic view of their massive IT infrastructure, causing frequent downtime, delayed maintenance, and SLA implications.

The Results

Aokumo implemented cloud-native monitoring, logging, and alerting system to help the client resolve issues faster and improve their digital operations.

8

X

faster incident response time

70

%

reduction in incidents

90

%

reduction in system downtime

70

%

faster recovery

Use case

The Summary

The client processes thousands of financial transactions daily. They need to complete these transactions without delays and according to their SLA. However, due to legacy monitoring, logging, and alerting systems, they faced frequent incident rates, downtime, and business loss.

They wanted to transform their legacy monitoring and logging system with cloud-native technologies to ensure system availability and business continuity. They also needed real-time alerting systems to take proactive actions and recover faster from an incident.

Aokumo implemented modern monitoring and logging technologies and helped the client improve its SLA, system stability, and resiliency.

Before

The existing monitoring and logging tools could not capture all relevant data points, making it hard to identify problems and establish the system's current state.

Monitoring was handled by a third-party vendor, which was costly.

Longer recovery time due to inefficiencies of legacy tools in providing complete visibility.

Lack of holistic view about system and infrastructure coupled with unreliable and delayed alerting impacted business SLA.

After

We implemented Prometheus capturing multi-dimensional data with real-time visualizations for effective monitoring.

We implemented the ELK stack using AWS-managed Elasticsearch for interactive log analytics with real-time monitoring.

Integrated Jaeger for tracing and monitoring the transactions and Kibana for visualizing Elasticsearch data.

We enabled real-time alerting for errors and exceptions with configurable escalation flow across the system.

The Outcome

8

X

faster incident response time

Log analytics, maximum coverage, and real-time alerting significantly reduced the incident response time. 

70

%

reduction in incidents

Proactive monitoring and remediations reduced unplanned events and incidents by more than 70%.

90

%

reduction in system downtime

Real-time and proactive alerting reduced the downtime risks significantly.

70

%

faster recovery

Using automation and comprehensive incident reports reduced debugging and bug fixing time.

Tools & Technologies

Aokumo leverages several Amazon services

Amazon Elasticsearch Service

- A fully managed service that makes it easy to deploy, operate, and scale Elasticsearch at scale with zero downtime.

Amazon CloudWatch

- An AWS service designed to help users monitor the performance and health of their AWS resources and applications.

Amazon S3

- A highly scalable, fast, and durable solution for any data type object-level storage accessed anywhere via the Internet through the Amazon Console and S3 API.

ELK Stack

- A package of open source technologies for collecting, searching, analyzing, and visualizing large data volumes generated by diverse data sources.

Prometheus

- An open-source monitoring and alerting solution for microservices and containers that provides flexible queries and real-time notifications.

Grafana

- An open-source dashboard visualization tool that allows users to ingest data from many data sources, query this data and display it on beautiful, customizable charts for easy analysis.

Jaeger

- An open-source software for tracing transactions between distributed services used to monitor and troubleshoot complex microservices environments.