When you pay a few million $£€ for a supercomputer you expect great service. Here are a couple of service snapshots:
An article about the introduction of the on-site module tester for XMP boards.
From Volume 6 No 1 – Jan/Feb 1986 of Cray Interface
The on site testing and repair procedures are described in depth in an article by Charlie Clark.
Customer satisfaction survey response
Most years Cray would survey it’s customer base to measure satisfaction with the Cray products and services. This document from 1988 shows the response Cray made to the survey feedback CRAY-Response 1988.
In the context of smaller systems, increasing system reliability and pressure on service costs the technical support service Cray provided changed from a engineer and software analyst per system to a more blended model. More documents and services were delivered online with remote access via customer controlled modem for diagnostics being standard across the range.
Extract from 1996 CUG paper 262_264 by Colin Campbell
Cray Service in Europe
The Cray Service organisation in Europe is made up of three major regions:
• Europe Central: Germany, Austria, Italy Manager Juergen Hochlenert, Munich, Germany
• Europe South: France, Belgium, Spain, Switzerland Manager Louis Ancian, Paris, France
• Europe North: UK, Ireland, Norway, Sweden, Finland, Denmark, The Netherlands, Poland, Russia, Czech Republic Manager Colin Campbell, Bracknell, England
For the record, Europe South staff also maintain systems in Morocco, while service in South Africa is managed from Europe North.
The distribution of maintained Cray systems in each European country is as follows. For the purposes of this table, large systems are defined as C90, Y-MP, X-MP and T3D systems while small systems includes YMP-EL, J90 and CS6400 systems.
Colin went on to describe the setting up of a technical call centre and spares distribution arrangements.
Weekly Critical systems call
In mid ’97 Cray & SGI had 4 major system types out in the field. (T3e, J90, T90 and O2000). As part of the customer service effort sites that had escalated problem reports would be discussed on a weekly critical systems and watch site call. Looking at one report from July we see this number of sites on the list
System, Critical, Watch
T3E 6 4
J90 5 7
T90 8 5
O2000 3 0
Issues discussed on the call include System hangs, data corruption issues and overall system reliability. The FCO “field change order” status of systems is carefully tracked along with the listed SPR software problem reports from the sites. One entry describes a cluster of J90s.
Since Installation in May 1996, S/N 9532, 9533, and 9534 had numerous problems. Since they were escalated on 11-NOV-96, the systems continued to have problems. This cluster of systems has a fairly complex configuration, using SFS Shared file system.
In February 1996, S/N 9554 was added to the cluster, which now contains 4 fully configured J932 systems, using SFS, and 3 Essential Communications
HiPPI switches. Problems are being tracked in separate meetings, using separate reports. Additional information can be obtained from Don, responsible SPS manager for this site.
Fully upgraded cluster completed acceptance at the end of April.
The total J90 cluster has been very stable since May. The remaining root cause analysis and identification of remaining problems needs to be done in Eagan/CF using S/N 9551.
A J932 system, S/N 9551, was installed in CF during the week of 19-MAY to assist with resolving three separate problems seen at KAPL.
The site has been running well since the beginning of May. Progress is finally being made on the DOlO problem. Two occurrences have been seen on S/N 9551 in CF.
–UPDATE– 09-JUN-97, 14-JUL-97
Little progress has been made in identifying the cause of the DOlO data miscompare problem. This is the most serious problem to be solved, and the one which keeps this site on the CRITICAL list.
15) Site problems are being tracked and reviewed with the customer once a week, using a separately maintained problem list maintained by Don, SPS responsible manager.
20) SPS/SWDIV/EDG/ to continue in-house testing on S/N 9551 to address outstanding issues from KAPL.
23-JUN: Two instances of data-miscompare have been seen on S/N 9551. Investigation continues by SWDIV.
–CUSTOMER SATISFACTION– 09-JUN-97
Customer satisfaction is increasing as stability of the cluster increases…..
System S/N 9532 (J91) is fully FCOed on 05-JUN
System S/N 9533 (J92) is fully FCOed on 12-MAY
System S/N 9534 (J93) is fully FCOed on 20-MAY
System S/N 9554 (J94) is fully FCOed on 14-JUL
FIELD ESTIMATE – What it will take to downgrade/remove site from WSR:
Resolutions to the DOlO problem, Essential Switch microcode problem, and the VHT HiPPI diagnostic failure are required for moving system to WATCH.
Large complex systems are what Cray did. The service teams of hardware and software engineers, often dedicated to small groups of sites, would work directly with customers to resolve tricky technical problems.
Remote support for some systems was provided via modem ( this was the ’90s ) and by using Smartie. More information in these documents.
Service Overview from Seattle era: