Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction



Info
titleServer SLA Maintenance Policy

The Server SLA Maintenance Policy in intended to describe procedures and guidelines around how the Infrastructure Systems Team maintains the server infrastructure necessary to sustain SPU's Enterprise Applications & Services

Download Policy as PDF


Table of Contents

Table of Contents
maxLevel2
indent20px

Effective Date: February 1, 2017

Last  Review/Update: May 2021


Architecture Overview

The Core Services Infrastructure Systems Team manages over 270 260 virtual servers; with the exception of hardware necessary to run Safety & Security's "VidNet" service, all of these servers are virtualized.  We use VMware ESXi as our preferred hypervisor.  We use a "template" process in VMware , the campus perimeter firewall, and several dedicated physical servers and storage platforms (Vidnet, DFS, Faith&Co).  We heavily use server virtualization software along with machine image "templates" to standardize our server builds, to automate the creation of newly requested servers, and dynamically manage compute and storage resources to best serve the SPU community.  We currently offer the following Windows and Linux server builds :

  • Windows Server 2012 R2
  • CentOS 6
  • CentOS 7We are "beta testing" our CentOS 7 build; it will replace our CentOS 6 build within the next few months

    in our virtual environment.

    System Reviews and Updates

    For systems and services that cannot be interrupted during the normal school year, IST updates these machines during the Christmas and Summer breaks. During this time, IST reviews and then updates these systems with all the necessary cumulative firmware, OS and application patches and updates. In addition, we conduct ad hoc assessments of needed system maintenance activities as recommended by the system vendor or industry advisories as noted below:

    Security Patches

    All SPU server builds are configured to install OS security patches automatically.  Linux machines check for and install paches patches nightly; Windows machines check for and install patches nightly when possible, but no less frequently than weekly (during our Wednesday morning downtime).Application patches are kept up to date weekly based on a staggered schedule defined by machine group policy. Perimeter systems are updated automatically as pushed from our firewall vendor.

    Application and Firmware Patches

    Application and firmware patches are reviewed as we're notified of their availability from the respective vendors. Our general process for application and firmware patches involves:

    • Immediate installation of high-level (zero day) security patches that are recommended and verified by the vendor;
    • Feature/functionality patches and step releases are applied as needed/recommended, but not necessarily immediately;

    Unless there are extenuating circumstances, our goal is to keep systems on the latest major versions of software and firmware, with discretionary application of point/step releases between major revs. In most instances, major updates will be scheduled during the twice-annual lift rather than risk bringing systems down during times of peak utilization. Lower-risk step upgrades will be considered on a case by case basis.

    Backups

    All SPU servers run daily backups; please see our Backup and Recovery Policy for more detail

    Monitoring

    We use PRTG to monitor over 2000 data points across our ~270 serversserver fleet.  These metrics include criteria such as:

    • Network availability (Ping)
    • Disk space / usage trends
    • CPU Load
    • Memory Usage
    • Website Availability
    • Custom SQL Queries

    We use this data to establish baselines for what is deemed "normal" behavior – we then have alerting configured so that when the metrics report data outside the norm, the branches of CIS responsible for the particular server / service are notified for further investigation and remediation.

    Log Files

    Server log files are aggregated and copied off the servers directly to a centralized platform.  We currently do not have any active monitoring / alerting on this data; the intent is to have a repository we can go back and look at, in the event that a situation has developed and we need historical data to figure out what happened.  We aim for a few months of retention, but the amount of time we can go back and look at things is ultimately dictated by the amount of incoming data.  Oldest data is purged from the system to make way for new data. 

     

      This process and architecture is currently under review.