Site Reliability Engineering at Veson Nautical

The Veson IMOS Platform (VIP) is a multi-tenant SaaS application that serves the needs of thousands of maritime industry participants around the world. Every day, our clients depend upon our software to plan voyages, optimize operations, and move cargoes to their destinations.

As the platform that propels maritime commerce, it is imperative that VIP is highly available, highly reliable and highly secure at all times. Issues with mission critical systems can easily bring about real-world consequences, and at Veson our engineers understand the great responsibility that we carry for our clients and for the broader industry.

The Veson Site Reliability Engineering (SRE) team manages the availability, resiliency, and security of our production systems. However, SRE is more than just a team name; SRE is a discipline that epitomizes the desire to build reliable systems, coupled with the right attitudes and skillsets.

In this blog post, I will share some of the key factors that have allowed our SRE team to be successful at what they do. These concepts are by no means unique to Veson, however they have certainly contributed to our team’s high-performance culture and ability to deliver.

All infrastructure will fail. Plan accordingly

Over a long enough time horizon, all infrastructure will inevitably fail. Accepting this fact allows our SRE team to put appropriate thought and planning into the construction of redundant systems. We leverage multiple availability zones (AZs), load balancers, and backup systems to minimize impact before it happens.

Monitor all the things

Running on top of AWS yields enormous advantages for our SRE team. One of the greatest is the abundance of AWS CloudWatch metrics, which help us to understand how the infrastructure is performing in realtime. Aside from standard metrics, we also make extensive use of custom CloudWatch metrics and Lambdas to govern various subsystems.

Our SRE team members are adept at writing code. As a result, we make heavy use of serverless lambda functions integrated with Slack, Splunk and Atlassian OpsGenie to augment our monitoring capabilities.

Runbooks for everything

A runbook describes the standard operating procedure that should be followed for a particular subsystem in a particular scenario. Our SRE team has a comprehensive library of runbooks assembled; not just to train new hires, but also to serve as the agreed upon protocol to follow when the need arises. Runbooks also help us to eliminate knowledge silos, and to build bench depth within the team.

Infrastructure is code

The way in which companies provision infrastructure has changed dramatically over the last decade. Gone are the days of spinning up EC2 instances via the console and hand-crafting a software installation. Veson’s SRE team has invested heavily in infrastructure as code (IaC) via Terraform, which affords extremely precise change management capabilities. Managing changes via IaC also ties in perfectly with Veson’s ongoing attestation for SOC 2 Type 2 security and availability trust principles.

Autonomy, alignment and continuous improvement

The Veson SRE team understands the mission. Team members operate with a high degree of autonomy, often inventing new ways to monitor things, or perhaps refactoring existing processes to make them more robust. The team is committed to continuous improvement every day. The cumulative effect is the team’s work is proactive, not reactive and our key metrics continue to trend positively.

At Veson Nautical, the principles of Site Reliability Engineering are central to our ongoing success. This is how we operate complex systems at scale, yielding the levels of reliability, availability and security that our clients deserve.