GOALS AND OBJECTIVES
To ensure uninterrupted operations, prevent any downtime due to emergency, and eliminate associated financial losses.
To build a high availability IT infrastructure which ensures fault tolerance of all services, including production services.
To develop standard solution for the protection of critical systems and document the required SLAs.
To ensure the continuous monitoring of the availability of all IT systems components.
- A distributed virtualized systems complex based on two data centers
- The solution includes: virtual farms, database clusters, storage networks and backup systems from various vendors (EMC, IBM and VMware)
- A system for analytical monitoring
The new high availability IT platform is based on two data centers and is a distributed virtualized computing system that operates in active-active mode during normal functioning. Multilevel protection of IT services (cluster architecture for storage systems and database servers) has been worked out in detail, in order to ensure the platform’s smooth operation.
All data is mirrored between sites, and virtual machines can be quickly moved from one location to another. The largest sites (several terabytes) are replicated to additional storage, which significantly reduces the service recovery time in the event of an emergency, given that there is no need to restore from a backup.
Service systems are separated by a firewall for security reasons. Backup – with individually configured frequency depending upon how critical a particular service is – is carried out for each of the systems.
A Disaster Recovery Plan, with a detailed description of methods and steps for eliminating failures in the event of force majeure situations (determining the required team of specialists, their tasks, etc.), has been developed and documented.
Continuous saving of changes to disk arrays protects databases from logical errors and makes it possible to completely recreate the system in a way which is identical to a specific point in time before an accident. An analytical monitoring system (monitoring systems for storage systems, virtual machines and network infrastructure) tracks IT infrastructure operation in real time.
The detailed systems operation and restoration instructions (with predictable SLA parameters provided for each system) make it possible to optimize the work and interaction of specialists and ensure efficient control mechanisms.
In total, 16 typical emergencies were tested (failure of virtual infrastructure at one of the sites, complete or partial database destruction, loss of SAN configuration, etc.), as were the recovery actions for each situation.
Jet Infosystem specialists have managed to achieve an almost unique recovery time for Volkswagon’s primary production IT services - in the case of sporadic failures, recovery time does not exceed 40 minutes. Potential data loss in the event of damage has also been minimized (the Return Point Objective is almost zero).
The technologies applied and the platform service standards developed as part of the project allow for extension to meet the needs of any production line, even if very large-scale.
21 hours per day
Time during which the production conveyor is running
40 minutes or less
Recovery of operation of critical IT services (RTO) in the event of an accident
16 typical emergency situations
Tested during trial systems runs
The volume of data loss (RPO) in an event of a disaster tends to zero