Cray Customer Service – Memories (1979 to 2014) by Charles Clark

Cray Customer Service – Memories (1979 to 2014) a personal memoir by Charles “Charlie” Clark

I have seen lots of articles on Cray Supercomputers, their design. development, uses and installations, but I haven’t seen too much documented on the WEB about the Cray Engineers and Analysts who supported these systems in the field so I thought I would try to document some of the key “facts” (that I remember) of Cray Customer support.

I also list some key milestones and events that occurred in Service during my 36 years (35 years 10 months to be precise) working at Cray. Below is a picture of me on my first trip to Chippewa Falls in 1979. Directly behind me is the Engineering building and the light-colored building beside it is the Development building.

Illustration 1: Cray Research Facilities, Chippewa Falls, WI
Illustration 1: Cray Research Facilities, Chippewa Falls, WI

Article Contents

 
  1. Main Index
  2. Cray 1 Hardware System and Support was unique
  3. All the parts of a Cray 1 Installation were supported by the Cray Engineers
  4. Cray 1 Module
  5. Timing is critical
  6. The Cray Wire Mat
  7. Hardware Problem Resolution
  8. Planned Maintenance
  9. On-Site Repair
  10. 60Hz MG required for all sites outside USA
  11. Wire repair
  12. Module Tester
  13. Module Repair
  14. On-site Spares
  15. Site Planning
  16. Communication and Technical Support
  17. Setting up support in new countries
  18. Cray (UK) Engineering Manager (1984)
  19. The System in India
  20. The XMP (Multi Processor) system
  21. Cray Software (Briefly)
  22. The Cray 2 System
  23. On-site repair of the Cray 2 module
  24. Return Equipment Management
  25. Cray Systems and Hardware Customer Support 1988 – Onward
  26. Cray Europe – Director of Customer Service (1990)
  27. Region 4 (1992)
  28. Hardware Product Support (HPS) (1995)
  29. Cray T90
  30. Silicon Graphics
  31. Tera Computer
  32. Cray X1
  33. Red Storm and the XT range of systems
  34. OctigaBay
  35. XC30
  36. My Retirement

Cray 1 Hardware System and Support was unique

The Cray 1 system was revolutionary in its design, and it was easily “The fastest computer in the world” at the time. Seymour’s design focused on speed and functionality at the expense of resiliency and redundancy, so to make up for these “deficiencies” each Cray 1 system was supplied with about 2.4 Engineers and an Analyst.

A Cray 1 customer decided on a Hardware support model that suited the criticality of their operation and their budget. Cray offered on site coverage, 7×24, 7×8, 5×16 or 5×8 and the remainder of the cover was “on call” with a two-hour dedicated response time. Most customers took the 5×8 onsite and the rest on call. ECMWF, which was my first site in 1979, opted for 16×5 onsite, rest on call so Cray provided three Engineers to support this.

Cray also provided, at no cost to the customer, one on-site Analyst who would support the Operating system, work with the users to maximize their use and very importantly analyze the Cray memory system dumps on a failure, to assist the Cray Engineer with troubleshooting the system.

Cray S/N 1 was famously designed with only parity on its 64-bit word memory, so it suffered from memory failures quite frequently (S/N 1 had half a million words of memory and it suffered a parity error every 8 hours or so, if I remember correctly).

However, Cray quickly corrected this problem by adding another 8 modules to each memory column (and six inches to the height of the system) to accommodate Single Error Correction Double Error Detection (SECDED) for all future Cray 1 systems. This is the reason that there is no S/N 2 Cray 1 system. The cold bars were already fabricated for S/N 2, but they could only support 64 modules and did not provide the 72 slots required to include the SECDED modules. So, S/N 2 was never built.

Never-the-less none of the Functional Units or Registers had any parity checking so diagnostics were frequently run by the Engineers to certify the data integrity of the system. In fact, the system was usually taken for Planned Maintenance (PM) every week for 4 hours or so, to facilitate these checks and fit any Field Change Orders (FCOs) that may have become necessary to improve the functionality or resiliency of the system.

All the parts of a Cray 1 Installation were supported by the Cray Engineers

The Cray 1 installation consisted of many parts:

Typical Cray 1 Installation consisted of:

  •  The Cray 1 system itself
  •  The Cooling System
  •  400 Hz Motor Generators
  •  Power Distribution Unit
  •  The System Disk Subsystem
  •  Operator and Maintenance Workstations
  •  Front End Interface

The Cray 1 computer itself

The iconic 270-degree cylindrical shaped system with 12 columns of modules and its power supply seats. The four middle columns contained the CPU and the four columns on either side of the CPU held the memory modules. It is a well-known fact that the Cray 1 obtained this shape to keep the lengths of the wires to/from memory as short as practical. These twisted wire pairs formed part of the timing circuit, so each one had to be cut to the exact specified length.

Illustration 2: Cray 1, S/N 1 at its final customer installation ~1989
Illustration 2: Cray 1, S/N 1 at its final customer installation ~1989
 

The Cooling System

The Cray 1 and the Disk Control Unit were Freon cooled. The Freon was compressed and pumped around the system by the Refrigeration Control Unit (RCU), usually located in the basement of the building. The RCU was cooled by customer supplied chilled water. The Cray Engineer had to maintain the RCU but not the chilled water unit.

Illustration 3: Refrigeration Control Unit (RCU)Illustration 3: Refrigeration Control Unit (RCU)

 

400 Hz Motor Generators

The power for the Cray was generated by Motor Generator units (MGs), usually located in the basement of the computer center. These fed 208V 400Hz power into the Cray Power Distribution Unit (PDU) in the computer room. Many customers also provided a standby generator and battery backup for the MGs (in case the grid power failed) but these were generally not supported by the Cray Engineers.

Illustration 4: 400 Hz Motor Generators
Illustration 4: 400 Hz Motor Generators

Power Distribution Unit

The PDU cabinet contained all the variacs that were manually adjusted to provide the power to each of the power supplies (in the seats) of the Cray which in turn provided the -2 Volts and -5.2 Volts to the buss bars of each column and hence to each module to drive the ECL logic components. It also housed a “Scanner” which monitored the Cray 1 columns for voltage and temperature and alerted the humans in the computer room if there was a problem. This was the forerunner to the ever more increasingly sophisticated Warning and Control System (WACS) that were designed on all later Cray machines.

Illustration 5: Power Distribution Unit (PDU)
Illustration 5: Power Distribution Unit (PDU)

The System Disk Subsystem

The system disk storage was provided by Control Data fixed disks (DD 19s) a chain of which were driven by a Cray designed Disk Control Unit (DCU). This controller was a stand alone Freon cooled cabinet that contained one to four disk controllers made up of Cray style modules.

Illustration 6: Two Disk Controller cabinets on S/N 9 at ECMWF

Illustration 6: Two Disk controller cabinets on S/N 9 at ECMWF

Illustration 7: A row of DD19 disk units on S/N 9 at ECMWF
Illustration 7: A row of DD19 disk units on S/N 9 at ECMWF

Operator and Maintenance Workstations

The Cray system was a number cruncher, and it didn’t speak directly to slow humans; it only “spoke” to other computers. There was an Operator Work Station (OWS) and a Maintenance Work Station (MWS) provided to enable control and maintenance respectively for the Cray. These systems were Data General Eclipse minicomputers with a 25Mb disk (removable disk pack), a tape drive, a card reader, and a printer.

 
Illustration 8: Maintenance Work Station (MWS)
Illustration 8: Maintenance Work Station (MWS)

Front End Interface

The Cray required a “Front end” computer such as a CDC Cyber 170 or IBM 370 system to do the pre-processing of the customer workload. Cray designed and provided a Front-End Interface (FEI) to facilitate this. The FEI was a little box that contained two air cooled, Cray 1 style, channel adapter modules. This box converted the Cray Channel protocol to the host system protocol (e.g., Block Multiplex for an IBM front end).

Illustration 9: Front End Interface (FEI) being installed
Illustration 9: Front End Interface (FEI) being installed

The Cray 1 Module

The Cray 1 module consisted of two logic boards which were mounted on either side of a solid copper cold plate. The heat from the logic components was transferred to the copper plate (Gold washers under the board at each of the 20 connecting nuts assisted in this transfer). The module was clamped into the aluminum cold bar column which then transferred the heat into the Freon and so to the Refrigeration Control Unit (RCU). Illustration 10 below shows one of the Instruction Buffer modules.

  • The green strip on RHS held the Test Points which allowed the Engineer to scope some signals while the module was in the machine.
  • The Zig Zag foil runs provide signal delays to the circuits (6” = 1 n Sec).
  • The green strip on the LHS held the pins (actually sockets – see Illustration 12 below) that connected to the back plane.
Illustration 10: Cray 1 module showing one board mounted on the copper cold plate
Illustration 10: Cray 1 module showing one board mounted on the copper cold plate

 Illustration 11 below shows the same Instruction Buffer module close up (Looking at the test point strip end)

  • Each module had a two-letter identifier and a unique Serial number etched into the cold plate. This is an HR module S/N 1631.
  • The small black “dots” are packages containing 2 x 60 Ohm resistors. Each signal is terminated with 60 Ohms.
  • The white chips are the basic component of a Cray 1 CPU …they each contain two 5/4 input Nand gate.
Illustration 11: Close up of Cray 1 module showing
Illustration 11: Close up of Cray 1 module showing “test point” edge

Illustration 12 below shows a close up of the edge connector. You will notice that the module “pins” are sockets. They fit over pins that are in the back plane once the module is inserted in the column. Both the “sockets” and the pins are gold plated and the resulting connection made was extremely reliable.

Illustration 12: End view of Cray 1 module showing the two board connectors

Illustration 12: End view of Cray 1 module showing the two board connectors

Timing is critical

Delay on modules was achieved as necessary by adding foil runs (6” of foil = 1 n Sec). Wires between CPU modules or from CPU to memory were cut to the exact length required. They too are part of the circuit. The site Engineer had a library of “Wire Tabs” that detailed every wire’s source, destination and length.

The Cray Wire Mat

The Cray wire mat consisted of blue and white twisted wire pairs that connected the back-plane connectors together. The wires were cut to the exact length required for each interconnection since, as mentioned already, the time spent on the wire formed part of the logic timing. When showing visitors around the Cray 1 I usually joked that it was easy to find any bad wire in the wire mat because we knew it was either a blue wire or a white wire! …. some visitors laughed!!! also mentioned that there are about 21 miles of backplane wiring (although later I saw a museum plaque in Chippewa which said 67 miles). Whichever is correct it is a lot of wires. The wire mat at the center is about one and a half foot thick.

Illustration 13: Wire mat of Cray 1 S, S/N 53 at KSEPL
Illustration 13: Wire mat of Cray 1 S, S/N 53 at KSEPL

Hardware Problem Resolution

Some hardware problems caused the Cray system to crash, and a system dump was automatically taken. This dump was analyzed by the Cray on-site analyst who would then try to point the Engineer towards the failing part of the system.

Cray provided online and offline diagnostics that when run by the Cray on-site engineers could help to pinpoint the failure. If the failure was still not defined the engineer would create a small loop of instructions and then use an oscilloscope to walk through the suspect functional unit, register etc. to identify the failing module.

Each logic board (two per module) had a set of test points brought out to a test point strip along the outer edge of the board to facilitate checking some of the key logic points on the board without removing the module.

Illustration 14: Scoping a test points on a Cray 1 module
Illustration 14: Scoping a test points on a Cray 1 module

The Cray 1 had 12 Functional Units (FUs) which were each made up of several Cray modules, so it wasn’t enough to know the problem was “in the Floating Point Add FU” you had to troubleshoot to the bad module before replacing it with the on-site spare. If I remember correctly, the Floating-Point Reciprocal Approximation FU (Performed the Divide function) was the FU made up of the most modules ……it consisted of over 100 modules. There was a “bubble chart” in plastic sheets provided which could be filled in as the test points on this functional unity were scoped. The chart would indicate the failing bit and hence the module responsible.

The faulty module would be removed, and an on-site spare put in its place. This was recorded in the system “Swap Log” book which was used to track all module replacement. Good practice demanded that once the faulty module was repaired it would be “homed” in its original slot, at the next PM.

Another unique aspect of tracing the Cray 1 logic was there were no logic diagrams like most computers of the time. The logic was defined with Boolean Algebra terms. The library of books containing the Boolean for Cray S/N 1 was all handwritten (With later Cray 1s the Boolean was typed). I was told that this was Seymour Cray’s handwriting, but I am not sure if that was an “Urban Myth” or not. However, the Boolean was written on the squared graph paper that Seymour was known to prefer.

At any rate I made a photocopy of one page of the 16 pages that defined Instruction Buffer (HR) module. This HR module was made redundant by an FCO that redesigned the Instruction Buffer and created the HX module that we FCOed into S/N 1 in the field.

Illustration 15: Photocopy of a hand written page of Boolean from Cray 1, S/N 1
Illustration 15: Photocopy of a hand written page of Boolean from Cray 1, S/N 1

Planned Maintenance (PM)

Diagnostics were frequently run by the Engineers to certify the data integrity of the system. In fact, the system was usually taken for Planned Maintenance (PM) every week for 4 hours or so, to facilitate these checks and fit any Field Change Orders (FCOs) that may have become necessary to improve the functionality or resiliency of the system.

During PM the Cray modules may be “shocked” to try and bring out any marginal or intermittent solder joint, component or connection. This technique was the same as the one used in System Test and Checkout (STCO) when the system was being brought up and checked out in Chippewa Falls.

The shocking (or vibrating) of the modules on the early Cray 1s was accomplished by running a wooden medical tongue depressor up and down the cold plates of the modules in the column under test (The Customer was amused at first when they saw the Cray engineer carry out this “shocking” procedure, but he was happy when any intermittent problems were identified and fixed). This wooden tool was later replaced with a 4′ x 1′ piece of hard glass fiber (which didn’t leave splinters of wood on the Cray’s seats!!) and eventually Cray provided a modified electric etching tool which was placed against one module at a time to vibrate it and better target the suspect area.

 

On-Site Repair

A 60Hz MG was required for all sites outside USA

The Maintenance Work Station (MWS), the Operator Work Station (OWS) and the Ampex terminals all ran on 60Hz power. In addition, all the onsite tools were supplied by Cray Inc and shipped from Chippewa Falls with the system. This means that the Oscilloscopes, Microscope, Soldering irons, Solder suckers, Heat gun, Module tester etc. were designed for the USA power grid (60Hz 110V). The result was that almost every country outside the USA where we installed a Cray system (Most of the rest of the world uses 50Hz 240V) we had to provide a 60Hz Motor Generator to provide the required power for all the support equipment and tools.

Wire repair

Changing a module in the Cray 1 is a skilled operation, if the module is not inserted carefully and straight then it is possible to “crush a pin” in the wire mat connector.

The on-site Engineer must drill out the crushed pin and epoxy in a new pin which is pre-attached to a twisted wire pair. The joint of the original wire must be found in the wire mat and that joint un-soldered. The new wires are then cut to length, stripped and a “solder sleeve” (a small plastic tube with a ball of solder in the center) fitted over the two ends of wires that are to be connected. By carefully using a heat gun the solder ball is melted and the connections was made.

If more than one pin needed repair or if a connector got cracked and needed replacing, then that is a job for the experts. La crème de la crème of the wiring girls in Chippewa Falls (the dedicated women who patiently build these wire mats during manufacture) were formed into a Special Work Assembly Team (SWAT) and they were ready, 7 x 24, to jump on a plane, go to site and fix a wire mat anywhere in the world.

Module Tester

When the suspect module is removed from the system it is placed in the module tester (Located in the Cray Engineer’s office on-site). The Engineer can now run either static or a dynamic test on the module. The static tester provided fixed signals to the module connector and the dynamic tester sequenced test patterns at system speed (12.5 n Secs). Now the Engineer can scope all the logic on both boards of the module to isolate the failing component.

Illustration 16: Cray 1 on-site module tester
Illustration 16: Cray 1 on-site module tester

Module Repair

Once the failing component was identified the module was moved to the repair bench. The repair work was done under a microscope. The microscope and all the necessary tools (Soldering Iron, solder sucker etc.) and new spare components were kept on site.

Illustration 17: Site Engineer's module repair work bench
Illustration 17: Site Engineer’s module repair work bench

On-site Spares

There were 113 module types in a Cray 1 and Cray provided at least one of each module type as an on-site spare. There were also spares for the Disc Controller Unit (DCU) and the Front-End Interface (FEI) both of which were built with similar Cray 1 style modules. In addition to all the spare modules the on-site spares included component parts for these modules these were used during the module repair process. The cost of providing such a large onsite set of spare parts was considerable. Spares depreciation made up about 70% of Cray’s cost of providing service on each Cray 1 system.

Site Planning

Getting a customer to prepare his computer room is the job of the Cray Site Planning Engineer. This was initially all done from Chippewa Falls (Mechanical Engineering) but as the UK grew its customer base and we also seemed to thrive on moving systems from customer to customer, it very quickly became necessary to hire a Site Planning Engineer in the UK Customer Service group. This specialist would visit the new customer’s site and establish the location of the system and all its parts from the Cray 1 itself to the RCU and the MG. He would provide the customer with electrical wiring diagrams, plumbing layouts (for the Freon piping) and a floor cut out template for the Cray and IOS. The floor tiles had to be precisely cut to ensure the Freon piping flanges and power supply cables had access to the mainframe connections.

Illustration 18: Checking the floor cut outs for the Cray 1S at KSEPL
Illustration 18: Checking the floor cut outs for the Cray 1S at KSEPL

The site planner also had to determine the access route for the large, heavy Cray mainframe. Sometimes there was a freight elevator that could take the load and sometimes a crane had to be used to swing the system in through a “hole” on an upper floor that had been created by removing a window. Oh, and sometimes the computer room was conveniently located on the ground floor.

I mentioned the practice of Cray UK lending a customer a used system while his own system was being built (In the late ’70s, early 80’s the demand for a real Supercomputer meant there was a waiting list … and customers were willing to wait for a Cray). The most frequent system traveler was the Cray 1, S/N 1 itself which had one installation in the USA and then moved 5 more times (different customers) in the UK. I am still amazed that there was no Cray 1 prototype they just designed and built S/N 1. It then ran for 13 years before retiring to a museum in Chippewa Falls, WI. (1976 to 1989).

Communication and Technical Support

Each site engineer communicated with Technical Support (TS) in Chippewa. They reported weekly statistics on their system so that TS could provide a detailed picture of all the installed system’s key reliability statistics (MTTI, MTBF and MTTR) to Engineering. This was done by Telex from the Cray UK office.

The site engineer also wrote a descriptive monthly report to share his experience, suggestions and improvements in support with the other installed sites around the world. This was sent by “snail mail” to TS who compiled them together and mailed the bundle back out to all the Cray site engineers (Where was email when we needed it?).

Finally in the early 80s Cray started a remote support tool to better assist in a system down situation where the on-site engineer needed some help from TS. This was a modem-based tool but of limited use because, unfortunately, many of these early Cray customers would not allow external access to their system (This was true for Special System Government customers, but it was also true for commercial customers who were concerned about protecting their intellectual property).

Setting up support in new countries

Cray UK Customer Service started to install and support Cray systems outside the UK and the first such system was SN13 at the Max Planck, Institute for Plasma Physics (IPP) at Garching near Munich in Germany. This was in September 1979. I was assigned to help with the install and to support it for 6 weeks while the German site engineers got up to speed. I was lucky that the acceptance of the system was celebrated at Oktoberfest, in the Spaten Brau tent. All the Cray executives at the time, including Seymour Cray, were in attendance.

This support format was followed from then on. When Cray UK sold a system in another country outside the UK. Cray Inc set up a subsidiary in the country, Cray UK supplied temporary customer support staff while the local engineers and analyst were recruited and trained. These country nationals then worked for the new Cray subsidiary.

I was also involved with the next installation outside the UK too, this time it was KSEPL (Royal Dutch Shell Exploration and Production) in Rijswijk, the Netherlands. We installed a Cray 1S (S/N 53). This was a Cray 1 with an attached I/O Subsystem (IOS S/N 19). I moved to the Netherlands with my family for a year (1983/84) to “temporarily” support this system.

Cray (UK) Engineering Manager – 1984

On my return from the Netherlands in April 1984 some of the Cray 1s in the UK had already been replaced with XMPs. These multi-processor systems still used the Cray 1 style boards and Freon cooling, but the modules had a double cold plate and four boards per module. There was a saying popular with Cray Engineers that went “If you can’t Fix it, Teach it. If you can’t Teach, it Manage it” Well I skipped the “teach bit” and became the Field engineering manager of the UK’s Southern Region. So I was never trained to actually fix the XMP, Cray 2 or any of the follow-on products.

In September 1984 my boss took a position in Cray Inc (Chippewa Falls Manufacturing) and I got his job as the Cray (UK) Engineering Manager. When I started in this position the UK was split into three Customer Engineering regions. South, Central and International At this time there were twelve systems in the UK and two in the associated overseas subsidiaries, Sweden and the Netherlands. The UK Salesmen were on a mission to sell Crays, all types of Crays, anywhere so in the next 6 years my team and I scrambled to plan, install and support customers all over. The customer base grew to 17 systems in the UK and some 8 customers in other countries (Sweden, Norway, Finland, The Netherlands, Abu Dhabi, Saudi Arabia. Australia and India).

The System in India

Speaking of India this was an interesting installation. Supercomputers were of course of strategic importance to the USA, so each system Cray installed abroad had to obtain export license approval. These export licenses often had provisions attached, only certain groups were permitted to use the system, data security must be protected, or even physical security must be seen to be carefully controlled.

The customer in India, the Medium Range Weather Forecast bureau (NCMRWF), ordered a Cray XMP system in 1988 and was required, by the export license, to provide physical security to its Cray system. In addition to the usual computer room electronic access set up, the customer decided to pay for a platoon of Army soldiers to guard their computer center and its grounds.

As a Brit. I did not have any exposure to guns, so I just had to ask a couple of the guards for a photograph. I am the one in the middle, without a gun!!

Illustration 19: Security detail for NCRMWF's XMP System
Illustration 19: Security detail for NCRMWF’s XMP System

India was very proud that they were to get a Cray supercomputer and the local newspapers in Delhi had frequent articles about the Cray in the weeks running up to the delivery. The customer site was one of those where the computer room was on the second floor and the elevator was not rated to take the weight of the XMP, so the company that we engaged to transport the system in India built a metal platform (“bucket”) that was used to crane the system through the second-floor window opening.

This company did not miss the opportunity to do some advertising by painting a large Cray logo on the side of the platform they built.

Illustration 20: The Cray
Illustration 20: The Cray “lift platform” decorated by the proud moving company in India

The XMP (Multi Processor) system

The XMP and the Cray 2 systems continued with the Cray 1 Service support model, i.e. a complete set of on-site spares, on-site module repair and on-site engineers. I want to note that the XMP site module tester was a much prettier animal when compared to the original Cray 1 one (Illustration # 16 above) although they performed basically the same functions for their respective modules.

Illustration 21: On-site XMP Module tester
Illustration 21: On-site XMP Module tester

Cray Software (Briefly)

During the time I was Cray (UK) Engineering manager, known by Cray Inc as a “Field Engineering Manager” or FEM. There was a separately managed group in Service consisting of site analysts and other software support. The manager of this team was known as a “Regional Analyst Manager” or RAM. To me the later TLA sounded a lot more dynamic than our one!!

When Cray Research shipped their first Cray 1 system in 1976 the company “didn’t do software” so the first customer developed the operating system for themselves and called it the Cray Time Sharing System (CTSS). Cray Research quickly realised that they had to provide a Cray Operating System (COS) and other software tools like Cray Fortran, Cray Compilers etc. so Cray Software Development was born. I mention this because COS was used on most/all Cray systems up to about 1985 when Cray Inc took the courageous decision to move from providing our propriety OS to supplying the Unix based UNICOS operating system which they had developed.

What this meant to our Software Customer Service colleagues was that in the second half of the 1980s and the early 90s the site Analysts worked with their customers to migrate all users from COS to UNICOS. This was a huge effort that, even with the help of the “Guest Operating System” Cray provided for multi-processor systems, took several years to complete on all our installed systems.

The Cray 2 System

In 1986 the first Cray 2 to be installed in the UK was purchased by The Atomic Energy Authority (AEA), Harwell (S/N 2008). The Cray 2 system cooling was completely different from the preceding Cray 1 and XMP systems. It used liquid immersion cooling. The modules and power supplies were immersed in a tank of fluid. The coolant was Fluorinert which is an inert, electrically non-conductive liquid. Obviously though it was a good conductor of heat.

The picture below is of the Cray 2 installation of S/N 2017 at KISTI in Seoul, South Korea. Although South Korea wasn’t formally part of Cray (UK) customer service. I interviewed and hired the first two Korean site Engineers and the photo shows that it was Cray (UK) engineers and analysts who installed the system and help to support it in the early days.

Illustration 22: Cray team gather round the Cray 2 during the KISTI (S/N 2017) Installation
Illustration 22: Cray team gather round the Cray 2 during the KISTI (S/N 2017) Installation

On-site repair of the Cray 2 module

The Cray 2 module was a densely packed stack of 8 boards that formed a sort of “brick”

Illustration 23: Cray 2 module showing one board
Illustration 23: Cray 2 module showing one board
Illustration 24: Cray 2 module, end view, showing edge connectors and 7 of the 8 boards of a module
Illustration 24: Cray 2 module, end view, showing edge connectors and 7 of the 8 boards of a module

(For those with sharp eyes you will note there are only 7 boards in this picture …one had been removed to use in a display).

The Cray 2 site engineers had a module tester that used pulsed power so the module under test (which was obviously not immersed in Fluorinert) did not overheat while it was being scoped. I very much admired the engineers’ skill and dexterity that allowed them to scope signals on a certain chip in the middle of this stack of boards to find the failing component. The stack was then split to give soldering access to the correct board and the faulty chip was replaced.

Return Equipment Management

To conform with US export license requirement any de-installed system, which was not donated to a museum, had to be destroyed. At first this was done in Chippewa Falls but with all the system swaps that were occurring in the UK region we decided to appoint a Return Equipment Coordinator in UK Customer Service. His task was to appoint the firm(s) that did this destruction and recycling of these systems and oversee the that the activity was successfully completed. Recycling was quite lucrative because of the amount of copper in the cold plates, aluminum in the cold bars, gold in the connector pins (and the gold washers between the boards and the cold plate) and of course the metals in the chips themselves.

Illustration 25: A man from a recycling company removing a cold bar from a scrapped Cray 1 system
Illustration 25: A man from a recycling company removing a cold bar from a scrapped Cray 1 system
Illustration 26: A wooden box half full of destroyed Cray 1 modules ready to be recycled
Illustration 26: A wooden box half full of destroyed Cray 1 modules ready to be recycled

Cray Systems and Hardware Customer Support 1988 – Onward

Starting with the Cray YMP on the vector Mainframe line the hardware service model changed and, although we kept system spares on site or in a nearby clustered location depending on the customer’s service contract, we no longer did on-site repair. The failing module was returned to Chippewa Falls Manufacturing for repair.

The various smaller air-cooled systems J90, YMP-EL, CS6400 etc. also had spares held “locally” and all module repair was done “centrally” in the USA.

Cray Europe – Director of Customer Service – 1990

In 1990 Cray Inc decided to combine all the countries in “Europe” into one Sales and Service unit under a VP of Europe. I was made the Director of European Customer Service.

Creating a European Region for Service had two main objectives. The first was combining the Hardware and Software (Engineers and Analysts) into one team under one country service manager for the first time. Secondly it gave us the opportunity to better share resources across Europe like Technical Support, Product Specialists, Logistics, Training and even REM. According to Cray “Europe“ was defined as all the countries from Sweden to Spain and from the UK to Poland but it also included the Middle East, India, South Africa and Australia. I had three Customer Service Mangers reporting to me, located in the UK, France and Germany who were in turn responsible for providing Customer Service to their own country and each had some of the other countries in “Europe”. In addition, I assembled a team of experts that were now shared across Europe. These experts were selected from the existing Technical Support personnel previously dedicated to just one country. There were Technical Support for all mainframe systems, Product Specialists (CRS and YMP-EL), Software Specialists (OS and Networking), Logistics manager and a Remote Support coordinator.

Installations continued apace. When I set up Europe Customer Service, we had 74 systems/customers to look after and one year later there were 86. Our Service team worked well together, and a more efficient Customer Support was being achieved (I am somewhat biased!?). However, the Country Sales teams did not like having to go through the European “middleman” to get to Cray Inc corporate so after only two years the European organization was dissolved.

Region 4 – 1992

In 1992 the combined Hardware and Software Customer Service model continued but The Service managers in the UK, France and Germany reverted to reporting directly to their respective Country Managers.

All the other countries were combined into one region, imaginatively called “Region 4”. I became the Service Manager of Region 4, reporting to the Region 4 Sales Manager. I also retained some of the technical specialists from “Cray Europe” reporting to me (ELS support, CRS coordinator, Software Product Support and REM coordinator) who were still used as a pan-European resource.

In the first year of operation the region grew from 8 countries to 12 by installing systems in the Czech Republic, South Africa, Denmark and Poland. Cray created subsidiaries in South Africa and Poland, and I was able to recruit local engineers in those countries. However, with the improvements in Remote Support, the Online System Monitor (SMARTE), the degradability of our multi-processor systems and a much-enhanced MTTI with our smaller systems we were able to use a remote support model of Service to four of our new customers (FzuAV in the Czech Republic, U of Copenhagen in Denmark, U of Tarragona in Spain and ENEA in Italy).

At year end 1993 there were 24 Mainline systems and 19 Entry Level systems in Region 4 The team was challenged with system swaps as existing customers upgraded to newer model systems. These included the first YMP M90 system in the region in Switzerland and a C94, S/N 1 in Italy. With all the new system types (YMP, YMP-EL, YMP M90, C90, T3D and CS6400) being introduced to the region training was a big commitment, but the site staff rose to the occasion. There were, in addition, many upgrades of CPUs, Memory, Disks and Networking that were all achieved, by the region staff with the minimum of customer disruption.

Finally, before moving on to the next phase of my Cray adventure I was involved with the planning of the first Cray system in Russia (At Rosgidromet … the Russian weather bureau). There was an extra challenge in planning this installation because we had first to select a company to build out the facility before we could plan for the actual Cray installation. We chose an Italian company to do this work. The interesting part for me was that this was the first time I had ever seen a translator do simultaneous translation. When we were meeting with the customer, we spoke English and he spoke Russian and this interpreter spoke in the opposite language at the same time as we spoke, there was no waiting ’till we finished a thought. This was very impressive … although obviously we had no way of telling how accurate the translation was. However somehow, we got the planning done to all parties’ satisfaction.

Clive England sent me a couple of pictures from the successful install of the YMP (S/N 1040) in Moscow in 1996, which happened after I left the UK. He mentioned that he went to Chippewa Falls to do the software setup in STCO and then he had a couple of trips to Moscow to do the software setup and later install the YMP-EL that they also bought. The illustrations below show his site badge and a “special” tee shirt created for the occasion.

SN1040 Site badge
SN1040 – Site tee shirt created after a close call with a bus bar by one of the installers.

Hardware Product Support (HPS) – 1995

In 1995 I got the opportunity to go to the USA and manage the Hardware Product Support (HPS) group in Chippewa Falls. I moved there in February and my wife and family joined me at the end of the English school term in June. The HPS group was already a fully functioning product support group. These engineers were the highest level of escalated support in Customer Service, and they interfaced directly with the design engineers and Manufacturing on Cray hardware problems. They provided 7 * 24 support every day of the year including US holidays (Obviously other parts of world do not celebrate USA holidays!!). In the 1995 timeframe, we had pagers to facilitate call out and it was a few years before pagers displaying the caller’s number were available. Needless to say, much later cell phones were a game changer for on call…although the cell phone coverage was, at first, somewhat limited in rural Wisconsin (You had to test to see if your favorite fishing spot had cell phone service otherwise you just stayed home!!).

When software fails the problem is always reported in a Software Problem Report (SPR) but with hardware it is not so simple. A hardware failure could just be a “one off” failing component which is just replaced. However, when a failing trend is detected then a Hardware Problem Report (HPR) would be raised and HPS would check around the installed base for similar failures, if found these would be added to the HPR and help to emphasize and prioritize the issue with Engineering. Once Engineering isolated and fixed the problem an Engineering Change Order (ECO) was written to fix any systems still being built in Manufacturing and a Field Change Order (FCO) would be written against the relevant (usually identified by HPS) install base in the field.

Cray Customer Service Engineers located on site or in a cluster of sites were now supporting a plethora of Cray systems at any one time. It is worth noting that it would be several years after the last system of a particular product shipped to the field before all the systems of that product were retired. For example, at the end of 1996 there were 944 Cray systems in the worldwide customer base, consisting of 9 different Cray products. Those systems ranged from XMP (still 14 installed), Cray 2, YMP, YMP-EL, C90, T3D, T90, J90 to the most recent T3E (already ~40 systems installed). What this meant was that on-site engineers had slowly become more “generalists” rather than a one product expert, since they covered so many more system types. This meant that more hardware problems were escalated to the Field Technical Support (FTS) engineers and also on to the Product Specialists of HPS.

Technical Support (TS – combined HPS and SPS) held a weekly telephone call with the CS support team(s) of any customer site(s) that was on the Critical Site List (CSL). This call enabled the site to request special attention from TS, Chippewa Engineering or Software Development to an unresolved ongoing “Critical” customer system issue. This list also included “Watch” sites and the goal here was to address the Watch site issues before the customer became Critical.

Cray T90

Cray shipped the first immersion cooled (four processor) T90 to a Canadian customer just before I arrived in the USA and this system seemed to be exceptionally reliable. However, Cray then shipped a T916 to Japan followed by 13 more T90 systems in 1995 and things started to go wrong. There were several logic issues but worst of all were the module plugging issues. To cope with these, I had to increase the number of T90 hardware engineers that we had in HPS, and they were all kept extremely busy, for the next few years, flying around the world when needed to help fix these T90 customer systems.

One significant design feature of the T90 was the interconnect. It used Zero Insertion Force (EZIF) T-Rail or connectors (No wire mat). There was a heat-controlled spring in each connector that opened when a small voltage was applied to it. Modules were inserted at 90 degrees to each other. Illustration # 27 below shows five module slots (white edges) for modules and the other module was inserted in the gap along the top of the T-Rail.

Each connector had some 400 signal contacts. In the Illustration # 28 below if you look closely at the “solid” orange color you may be able to see the individual foil traces. Although this connector functioned extremely well, it was the granularity of these foils together with any debris floating around in the system that caused the plugging issues mentioned above.

Illustration 27: T90 T-Bar showing slots for modules at 90 Degrees to each other.
Illustration 27: T90 T-Bar showing slots for modules at 90 Degrees to each other.
Illustration 28: T90 T-Bar showing individual foil traces (look very closely at the
Illustration 28: T90 T-Bar showing individual foil traces (look very closely at the “solid” orange color)

I don’t plan on “airing all our dirty laundry” in this document on the various problems and issues we tackled over the years. Suffice to say The HPS group was always kept busy with Cray product problems, Disk problems, System upgrades and of course escalated calls from the Field Technical support engineers across the world.

Silicon Graphics

In 1996 Silicon Graphics (SGI) purchased Cray. This did not actually have too much impact on HPS because SGI had their own Technical Support team in Mountain View, California supporting the legacy SGI products and they still needed us (HPS) in Chippewa Falls to support escalated Cray product problems. The bigger disruption took place in the Customer Service Field as SGI products, processes and procedures tried to merge with the Cray ones. It was clear that things were not working out, so SGI created a separate Cray business unit and shortly after that sold it off.

Tera Computer

In 2000 Tera Computers of Seattle purchased the Cray business unit from SGI. Tera renamed the combined company Cray Inc. Tera had designed and built a system implementing Multi-Threaded Architecture (MTA). They had only installed one MTA 1 system at a customer location (SDSC in San Diego) in 1998. This system was supported directly by the design engineers in Seattle. Tera had no Customer Service. Tera was working on the improved version of their water-cooled product that was called the MTA 2.

Once we left SGI, as well as managing the HPS team, I took on the job of managing the Service Planning group. The latter group worked with Engineers, Site Planning, Documentation, HPS and SPS (Software Product Support) teams to develop a Service Plan describing how to support a product in the Field. Our first challenge was to create a Service plan for the second-generation MTA (MTA 2). This plan was put to the test when the first MTA 2 system was shipped to ENRI in Japan in 2001.

Cray X1

In 2003 the X1 was launched. This system combined the architectures of T90, T3E and the SV1 systems providing, in the captured liquid cooled version, a scale-able system up to 4,096 processors. It came in air cooled (AC) and Liquid cooled (LC) versions and even the LC system was very much easier to install than previous large scale Cray systems.

Red Storm and the XT range of systems

2004 The “Red Storm” (Joint development by Cray and Sandia National Labs) was installed. It was upgraded in the field several times and was the basis of Cray’s successful XT range (XT3, XT4 and XT5 & XT6). This XT product did not use a Cray designed processor but rather used AMD Opteron processors. What this meant to HPS was we now had to work with AMD on any processor issues, and obviously there were some. The packaging and the very successful Seastar (and Seastar 2) interconnect, used by these products, was however designed by Cray Engineering.

OctigaBay

In 2004 Cray purchased OctigaBay Systems Corp. of Vancouver, BC. They had developed an entry level, 19-inch rack mounted system using AMD Opteron processors. This system was marketed by Cray as the XD1. My Service Planning group developed a Service plan for the XD1 and I also managed the Customer Service Product Support, Hardware and Software, team located in Vancouver.

XC30

By 2014 Cray was delivering XC30 systems which were a very successful computer system that contained thousands of processors made by AMD, NVIDIA and/or Intel, using the new Cray designed Aries interconnect. This is a long way from the single processor Cray 1 of the 1970s and 80s but Cray has provided and continues to provide Supercomputers that consistently ‘blow the competition away’. In addition to great products Cray always has had and continues to provide quality Customer Service to support their installations. I hope these ramblings give you some idea how Cray Service has evolved to meet our customer’s grand expectations.

My Retirement

It was in December 2014 that I retired from Cray Inc and literally went “out to pasture” by moving to the land I bought in the foothills of the Blue Ridge mountains in North Carolina.

Illustration29: Happy Retirement 2015
Illustration29: Happy Retirement 2015

Document created October, 2021

I hope you enjoyed reading my ramblings and I would be happy to receive feedback that adds to this description or corrects any “facts” that I have remembered incorrectly. You can contact me Charles (Charlie) Clark by email: clarkcm1949@gmail.com

Copyright/Disclaimer

This document cannot be reproduced in whole or in part without the written consent of the author. All the images presented here were taken by me or for me. “Cray” and other words and logos used in this document are trademarks which belong to Cray Inc. and others. Some of the ideas described in this document are subject to patent law or are registered inventions. Description here does not place these ideas and techniques in the public domain.

Scroll to Top
%d bloggers like this: