Introduction

Info

title	Server SLA Policy

The Server SLA Policy in intended to describe procedures and guidelines around how the Infrastructure Systems Team maintains the server infrastructure necessary to sustain SPU's Enterprise Applications & Services

Download Policy as PDF

Architecture Overview

The Core Services Team manages over 270 servers; with the exception of hardware necessary to run Safety & Security's "VidNet" service, all of these servers are virtualized. We use VMware ESXi as our preferred hypervisor. We use a "template" process in VMware to standardize our server builds, to automate the creation of newly requested servers. We currently offer Windows and Linux server builds.

Security Patches

All SPU server builds are configured to install OS security patches automatically. Linux machines check for and install paches nightly; Windows machines check for and install patches nightly when possible, but no less frequently than weekly (during our Wednesday morning downtime).

Application patches are kept up to date as we're notified of their availability from the respective vendors.

Backups

All SPU servers run daily backups; please see our /wiki/spaces/ARC/pages/36331288 Backup and Recovery Policy for more detail

Monitoring

We use PRTG to monitor over 2000 data points across our ~270 servers. These metrics include criteria such as:

Network availability (Ping)
Disk space / usage trends
CPU Load
Memory Usage
Website Availability
Custom SQL Queries

We use this data to establish baselines for what is deemed "normal" behavior – we then have alerting configured so that when the metrics report data outside the norm, the branches of CIS responsible for the particular server / service are notified for further investigation and remediation.

Log Files

Server log files are aggregated and copied off the servers directly to a centralized platform. We currently do not have any active monitoring / alerting on this data; the intent is to have a repository we can go back and look at, in the event that a situation has developed and we need historical data to figure out what happened. We aim for a few months of retention, but the amount of time we can go back and look at things is ultimately dictated by the amount of incoming data. Oldest data is purged from the system to make way for new data.

Versions Compared

Old Version 2

New Version 3

Key

Introduction

Table of Contents

Effective Date: February 1, 2017

Architecture Overview

Security Patches

Backups

Monitoring

Log Files

Page Comparison

Versions Compared

Old Version 2

New Version 3

Key

Introduction

Table of Contents

Effective Date: February 1, 2017

Architecture Overview

Security Patches

Backups

Monitoring

Log Files