90% Reduction in System Downtime With Observability

Aokumo helped a FinTech company reduce incidents by 70% and system downtime by 90% while improving recovery time.

Project Brief

The Client

The client is a leading Fintech company based in Sydney, providing services in the capital market and facilitating mission-critical transactions.

The Problem

The client’s monitoring and logging systems could not get them a holistic view of their massive IT infrastructure, causing frequent downtime, delayed maintenance, and SLA implications.

The Result

Aokumo implemented cloud-native monitoring, logging, and alerting system to help the client resolve issues faster and improve their digital operations.

Industry

Financial Services

Featured Services

Technology Stack

ELK, AWS Elasticsearch, Prometheus, PagerDuty

90% Reduction in System Downtime With Observability

Aokumo helped a FinTech company reduce incidents by 70% and system downtime by 90% while improving recovery time.

Industry

Financial Services

Featured Services

Technology Stack

AWS Workspace, Office 365, Azure Active Directory, MicrosoftIntune, AWS VPN

Project Brief

The Client

The client is a leading Fintech company based in Sydney, providing services in the capital market and facilitating mission-critical transactions.

The Problem

The client’s monitoring and logging systems could not get them a holistic view of their massive IT infrastructure, causing frequent downtime, delayed maintenance, and SLA implications.

The Solution

Aokumo implemented cloud-native monitoring, logging, and alerting system to help the client resolve issues faster and improve their digital operations.
results

Solutions

Implemented Prometheus for comprehensive infrastructure monitoring.

Implemented ELK stack using AWS Managed Elasticsearch for log management & analytics.

Integrated cloud-based monitoring systems like PagerDuty to ensure proactive real-time alerting and improved response time for mission-critical events.

Enabled real-time visualization of systems states to identify issues and remediate them proactively before impacting business.

Impacts

8

X

faster incident response time

70

%

reduction in incidents

90

%

reduction in system downtime

70

%

faster recovery

summary

The Need

The client processes thousands of financial transactions daily. They need to complete these transactions without delays and according to their SLA. However, due to legacy monitoring, logging, and alerting systems, they faced frequent incident rates, downtime, and business loss.

They wanted to transform their legacy monitoring and logging system with cloud-native technologies to ensure system availability and business continuity. They also needed real-time alerting systems to take proactive actions and recover faster from an incident.

Aokumo implemented modern monitoring and logging technologies and helped the client improve its SLA, system stability, and resiliency.

The Challenges

The existing monitoring and logging tools could not capture all relevant data points, making it hard to identify problems and establish the system's current state.

Monitoring was handled by a third-party vendor, which was costly.

Longer recovery time due to inefficiencies of legacy tools in providing complete visibility.

Lack of holistic view about system and infrastructure coupled with unreliable and delayed alerting impacted business SLA.

The Solutions

We implemented Prometheus capturing multi-dimensional data with real-time visualizations for effective monitoring.

We implemented the ELK stack using AWS-managed Elasticsearch for interactive log analytics with real-time monitoring.

Integrated Jaeger for tracing and monitoring the transactions and Kibana for visualizing Elasticsearch data.

We enabled real-time alerting for errors and exceptions with configurable escalation flow across the system.

Tools & Technologies

Aokumo leverages several Amazon services

Amazon Elasticsearch Service

- A fully managed service that makes it easy to deploy, operate, and scale Elasticsearch at scale with zero downtime.

Amazon CloudWatch

- An AWS service designed to help users monitor the performance and health of their AWS resources and applications.

Amazon S3

- A highly scalable, fast, and durable solution for any data type object-level storage accessed anywhere via the Internet through the Amazon Console and S3 API.

ELK Stack

- A package of open source technologies for collecting, searching, analyzing, and visualizing large data volumes generated by diverse data sources.

Prometheus

- An open-source monitoring and alerting solution for microservices and containers that provides flexible queries and real-time notifications.

Grafana

- An open-source dashboard visualization tool that allows users to ingest data from many data sources, query this data and display it on beautiful, customizable charts for easy analysis.

Jaeger

- An open-source software for tracing transactions between distributed services used to monitor and troubleshoot complex microservices environments.

impact

The Impacts

8

X

faster incident response time

Log analytics, maximum coverage, and real-time alerting significantly reduced the incident response time. 

70

%

reduction in incidents

Proactive monitoring and remediations reduced unplanned events and incidents by more than 70%.

90

%

reduction in system downtime

Real-time and proactive alerting reduced the downtime risks significantly.

70

%

faster recovery

Using automation and comprehensive incident reports reduced debugging and bug fixing time.