Cray Research and Cray computers FAQ Part 2
- Tales from the crypto and other bar stories
- The immovable object
- Cray-1 Lands at Boeing
- CPU hangs
- A good start
- Tales from the crypto
- A change in the weather
- It would not start and EL pictures
- Cray-2 Cooling towers
- NetworkCS Powers Down (last) CRAY-2
- Cray-3 memories by Steve Gombosi
- Regular Crashes
- Lost in the post
- The wrong kind of help turns a crisis into a disaster
- The world according to Clarkson .. comments about the UKMet T3E
- Do more, Fork less
- Trademark Disclaimer and copyright notice
This Cray supercomputer Faq is split into sections, Part 1 describes Cray supercomputer families, Part 2 is titled “Tales from the crypto and other bar stories“, part 3 is “FAQ kind of items“, part 4 is titled “Buying a previously owned machine” and part 5 is “Cray machine specifications“. Corrections, contributions and replacement paragraphs to CrayFaq0220@SpikyNorman.net Please see copyright and other notes at the end of each document. Note: Part 3 was the only part posted to newsgroups. The FAQ was previously hosted at www.SpikyNorman.net
Tales from the crypto and other bar stories
Stories about particular sites/machines/situation involving Cray hardware are solicited. No names or site details required so take with a pinch of sodium chloride.
The immovable object
The YMP4-D was a large chassis comprising, a central square box for the CPUs and memory, and two XMP style wings for the IOSs. In an early 4 CPUs machine one of the wings was, unknown to the customer just an empty shell. Much fun was derived from this by a site engineer who one day swapped round the lightly patterned raised floor tile under the corner of the empty wing. He maintained a straight face the whole day while the customer analyst glanced repeatedly at the out of sequence tile but never said a word. The next day the tile was restored to its original orientation.
Cray-1 Lands at Boeing – by Ed Barnard
This tale relates to Cray-1 serial number 20, purchased by Boeing Computer Services in 1980. The arrival, as with all Cray supercomputers at the time, was rated a big deal, with sundry Boeing officials on hand to see the mainframe unloaded from the truck.
Remember that this is Boeing. Legend had it that the crane operators for the 747 operation had to be able to lower the main fuselage onto an egg, touching the egg without breaking it. Forklift operators undoubtedly had similar standards.
Meanwhile, then, we have the Cray mainframe at the back of the truck on a pallet. Off in the distance we hear a forklift roaring towards us. He came around a corner, heaving into view, saw all the people standing around in suits, and immediately began inching towards us at a sedate and suitably responsible pace.
He pulled up to the back of the truck and carefully took on the multi-ton, ten million dollar pallet. He lowered it to near the ground, reversed direction, and began heading to the prepared building entrance.
Oops! Between him and that door was a covered walkway. The top of the forklift could not pass underneath the walkway roof!
We stood around for a while longer, and this time heard two forklifts in the distance – and both were approaching at a responsibly sedate pace. The big forklift lowered the pallet to the pavement, backed off, and the two new forklifts faced each other, the pallet between them. They lifted the pallet together, proceeded underneath the covered walkway, one driving forward and one driving backwards.
It was obvious to all present that fumbling the pallet would not be acceptable, and it was also clear that these were expert forklift operators. Still, they were both glad to set that pallet back down beyond the walkway! Balancing that pallet between them made for some nervous minutes.
By this time the big forklift had travelled the long way around the building. He picked up the pallet, ran it over to its destination, and we wheeled the computer into the building on its “come along.”
This building had been built long before the days of raised floors. The Cray machine room, therefore, had a ramp leading up to the piece of wall left open. That is, once the mainframe was in place, the wall was to be patched up. Pushing the heavy mainframe up the ramp was tricky by itself, but as we got to the top…
Oops! The opening was high enough to fit the CRAY-1 through if it were level, but tilted on that ramp, it would not fit through the door! So back down the ramp we went with our ten million dollar mainframe, to await developments.
Boeing Aircraft Company is a union shop, so there were undoubtedly complications relating to getting the right person to make the wall opening an inch higher. However, with all those people in suits in attendance, things happened rather quickly.
The original installation of CRAY-1 s/n 20 was at the Renton plant, where the 737s were manufactured. Boeing Computer Services later had its own set of buildings in Bellevue.
A note in CFD review Oct 2003 and Puget Sound Business journal mention that Boeing has now bought a Cray X1. Hope that one was an easier install.
CPU hangs
An El customer called up with the problem “The process accounting did not run last night, Why ?” I logged on to the system using the remote support modem and started looking around. System was behaving normally but the cron.log showed a large number of cron jobs bounced because the cron queue had exceeded its max. of 25 outstanding jobs. This would be why the accounting had not run. So what was filling the cron queue? It was clear from the ps command output that olcrit, the on-line CPU health check process, was the problem. Run hourly, from cron to check the system, diagnostics would run and normally be in and gone in a second or two. Here were 25 sets of diagnostics waiting to run. These CPU diagnostics are attached, one to each CPU, using the ded(icate) command. Each diagnostic program. would start then wait until its nominated CPU exchanged to the kernel then jump in a run some CPU functional checks for a second or so then exit. Here we had a whole bunch of diagnostics. waiting to get into CPU-2 that was just not exchanging into kernel space from user programs. It was possible to monitor the contents of the program counter in the CPUs by running one of the diagnostic programs on the EL IOS and from this it was clear that the CPU was hung on a gather/scatter instruction, one of the two states, Test/set being the other that early ELs would occasionally get tripped up on. As a result of this and CPU hangs at a couple of other sites all the PC-04 chips on all the EL CPU boards where changed.
The J90 also had a problem with occasional CPU hangs but this was solved by software. Once the problem instruction sequence had been identified and a program was generated that would scan program binary files and change the problem instruction sequence by adding a no operation command at the critical point. This scanner was later integrated into the loader sequence of the compiler and the problem and solution disappeared from user view.
A good start
During my first month at Cray I was asked to work at the weekend to help rearrange some of the equipment on the machine room floor of the UK data centre. At that time the data centre was home to two active XMPs and a YMP-2d so the intention was to avoid any interruption to service. As part of my task was to pull some disk cables from under the floor so that the drives could be moved.
Sometime after I extracted the last disconnected cable the power-down siren screamed and the centre went deathly quiet. The cooling plant room indicator board glowed red and a cheerful security guard came in to report that there was “a big puddle” in the underground car park below the centre. It turned out that the accidental 1/8 turn of the drain cock on the pipe nested between two larger air handling tubes was enough to start emptying the vital primary water cooling circuit. All the machines had tripped out instantly on the cooling circuit failure and stayed down for 5 hours while the roof tank slowly filled from the top-up hose pipe. Good thing I was employed as a software analyst.
Tales from the crypto
** This section not available for public distribution. **
A change in the weather
A C90 is a solid, dependable and at 16 GFlop a reasonably powerful machine well suited to the day-to-day grind of production weather forecasting. By contrast, in 1997 anyway, the T3e was a tricky beast. Because of the relative newness and particular nature of its hardware, OS, IOS and programming environment the T3e was most widely used in research and academic environments where ultimate performance requirements are more important than total availability. It was a bold step then for one site to replace their C90 with a 686 CPU T3e. Prior to the C90 the site had had two YMP 8 CPU systems (one green, one blue) for production forecasting, and long term climate research. It is not difficult to see that the site had duel requirements for extreme stability and extreme performance.
Being a government organisation the contract was king, and that document specified a complex multi-staged acceptance process with reliability, performance and feature milestones.
To be honest the new system was having a hard time meeting any of the required metrics especially uptime. This was not helped by the fact that the site would pound relentlessly on any newly released OS feature, try to squeeze every last drop of performance from the system whilst being reluctant to change programming and system methods that were quite plainly inefficient. By the autumn of 1997 we were getting our arses kicked, even though the machine was running in parallel with the C90 it was clear that the new box was struggling. It was running upwards of 40 to 50 million application CPU seconds a day but was too slow on the complex shell scripts and kept getting behind on the file transfer timetable.
One November Friday afternoon, of what had been not a particular happy week, all hell broke loose, both the C90 and T3e went down at the same time. Smoke alarms had gone off in the plant room and the site emergency shutdown procedure had been invoked. The engineers soon reported that the power supply conditioning motor-generator set had run a bearing dry and this would put the C90 out of service until a replacement was installed. A C90 MG set weighs about 6 tonnes and is about the size of a railway freight container box car so replacement is non-trivial. There was a spare in the country but it would take all weekend to install.
The T3e was up to bat. After a moderate scaling back of the workload and some rapid switching on the front end machines to discard the non-existent C90 output and use the T3e forecasts the system was in production. By 19:00 that evening the first T3e forecasts went out to the TV stations, and other customers; the production cycle was recovered. The beast must have known this was its big chance to prove itself. It didn’t put a foot wrong all weekend.
The feeling amongst the Cray staff on the following Tuesday when the service cut back over to the C90 was one of both relief and pride. The box had shown that it could run to schedule. It may not be completely ready now but we could see that the T3e would one day replace the C90.
Many improvements were made to system and application software over the following months but possibly the most intriguing was the express message queues feature. Each processor has a small micro-kernel that can service some system calls directly but any system calls involving IO or a centally managed service has to be passed to the operating system PE (CPU) that runs the required server. These OSPEs would have to receive and service calls from every other CPU in the system and can get swamped in some circumstances. The idea of express message queues was to carry the “nice” value of the calling process into the kernel service area. Requests coming from processes running with low “nice” values would put on separate service queues from the other messages and would be serviced first. Unix nice values are normally used to arrange priority application access to CPUs in SMP systems but on an MPP every program already has multiple CPUs so by this method the administrator could designate processes that had priority access to shared operating system services and IO paths.
It was not until the following spring that the C90 was finally shipped out and the T3e took over the production forecasts. It was not very smooth at first but just before Easter 1998 another crucial change occurred.
Once the system began to stabilise it became apparent that the longer the system was up the slower it went. The slowdown was most noticeable with shell scripts, these did not run fast at the best of times, but after a few days uptime the same scripts would run 30..80% slower. At the request of the site analyst some extra time accumulators and counters were added into the kernel to track system call times more accurately. It was immediately noticed that the rate of slowdown was worse during prime shift when all the users ran compilations and interactive jobs but hardly changed when just the batch load ran overnight and weekends. Once provided with the performance graphs and description, one of the kernel developers found the bug, a missing freecred() at the end of the restart() system call, that had been causing us so much pain. The new fixed archive (kernel) went live just before a bank holiday weekend so it was not for a few days that the fix could be confirmed. It was like a veil had lifted. The system genuinely felt snappier at the command line and all the scripts just seemed to fly along. A whole bunch of those spooky “ghost in the machine” problems just went away at the same time and we all knew that life would be sunnier now.
Ave open system call Time ^ / | -- | / / = before credential fix | -- .. = after credential fix |/ |........ +----------> Time since boot
The system, now 868 CPUs, delivers over 60,000,000 CPU seconds a day and continues to kick serious meteorological butt on a daily basis.
A recent (May 99) press release announced the purchase of a second T3E by this site.
It would not start and EL pictures
Follow this link to Mike Islers page describing a problem getting an EL started. This page has lots of detailed photos of the EL system involved. It is thought that the problem was to do with the disk parameter file passed to Unicos at boot time. The machine involved was sold before the booting problem was resolved. A post to comp.unix.cray can sometimes help in situations like this.
Cray 2 Cooling Towers
The cooling tower waterfall device is one of the most distinctive features of the Cray-2. The final form show above and as seen on the Sales brouchure photo was not the initial form. Originally the shape was to be 6 plexiglass towers.
Mike reports :
Regarding the Cray-2 cooling tower change, I thought they went from
the cylinders to the square waterfall for several reasons.
First was the weight of the nert caused outward pressure on the
plexi-glass cylinders. They were changing shape and they crazed and
became opaque from fatigue stresses, but there was never any danger
of breaking. The water fall with 1 inch thick plexi-glass, like we use
in bank windows, suffered no deflection from the weight.
Also marketing and customers wanted something more dramatic, and
they wanted to have it embossed with the Cray logo prominently.
So they came up with the waterfall which I don't think became
back-lit until the T90 HEU.
This is what I remember, but you should also check with Jeff.
He worked on the first one at Livermore Labs, and also brought it up
through the prototype stage at the old Riverside Building.
Jeff reports :
Mike is correct about why the decision was made to go
from tube to tank reservoir. The tubes bulged ( visibly )
about a third of the way up from the bottom. One could see that
the crazing was more pronounced in that area from the flexing
of the acrylic. The crazing looked like what you would find on
very old porcelain. The waterfall version used, I believe,
1 inch Lexan mounted on top of an aluminium tank with a Lexan
window to see the level of the Fluorinert in said tank. The
gas exchange apparatus and the water fall was not illuminated
on the Cray-2.
However, the customer used a slide projector
at one time to see how an image might look projected on the
flowing Fluorinert. The image was of salmon jumping in a river.
When one was in the right spot, the salmon looked great. However,
outside of this very small viewing window the image was a big blur.
NetworkCS Powers Down (last) CRAY-2
A press release lifted from : http://www.networkcs.com/ncsi/news/cray2.html On Thursday, February 11 1999, Network Computing Services, Inc. (NetworkCS) powered off its CRAY-2 system for the last time. This particular system is actually NetworkCS' third CRAY-2. The first such system, a single-processor, 16 MWord prototype machine (Serial Number:Q2) was installed in 3Q85. SNQ2 eventually became the module check-out machine for Seymour Cray's CRAY-3 project in Colorado Springs, CO. The second system, a four-processor, 256 MWord system (Serial Number: 2003) was installed at the end of 1985. In 1989 this system was returned to Cray Research and was eventually installed at the Massachusetts Institute of Technology (MIT). The present CRAY-2 system is a four-processor vector supercomputer (Serial Number: 2021), with a 4.1 nanosecond clock period, a 2 GByte/s I/O backplane, and is capable of 1.9 gigaflops at peak performance. It has 512 MWords of Common Memory accessible by all four processors, and 16 KWords of high-speed Local Memory dedicated to each processor. This large memory version of the CRAY-2 was the first of its kind and one of only three that were built. The CRAY-2 was installed for service in December 1988. Its 10+ year life cycle makes it NetworkCS' longest-running production system. It also proved to have been one the most reliable high-performance computing systems ever. This passing of an era brings mixed emotions. John Sell, President of NetworkCS remarked, "The CRAY-2 has been the most interesting and fun machine we have owned. We can emphatically say, perhaps with some sadness, 'they don't build them like that anymore '". This system was the last operational CRAY-2 in existence.
Cray-3 memories by Steve Gombosi
From a comp.unix.cray posting
Graywolf (“S5”) was installed at NCAR. Like all NCAR supercomputers, until fairly recently, it was named after a Colorado locale.
This was the only Cray-3 shipment, installed in May 1993, the machine was a 4-processor, 128 Megaword system.
Two problems in the Cray-3 system were uncovered as a result of running NCAR’s production climate codes (particularly MM5): a problem with the “D” module causing intermittent problems with parallel codes, and an error in the implementation of the square root approximation algorithm which caused incorrect results for certain data patterns (kinda like the Pentium divide bug 😉 ). These were rectified and replacement CPU modules were installed, although I can’t remember the date.
The machine ran NCAR production until CCC folded in March, 1995. Since NCAR never paid for it, at some point we reduced the CPU count to 2 and let the machine run essentially unattended. I’m not too sure when that happened, although it marked the end of my regular commuting between Colorado Springs and Boulder.
There were a total of 7 Cray-3 “tanks” constructed. S1-S4 were single “octant” tanks (the smallest that could be constructed) which accomodated up to a 2 processor/128MW configuration. S5 and S6 were two-octant tanks. S7 was a four-octant tank which we used as a software development and benchmarking platform. S6 was chiefly used for system testing.
S1-S3 were diverted to Cray-4 testing once the Cray-4 project built up steam. S4 was diverted to the quite possibly suicidal Cray-3/SSS project after S7 became available (S4 was previously our software development machine).
For those of you who have Cray-3 posters lying around (by the way, I took all the photos on that poster as well as the Cray-3 and Cray-4 brochures and all the annual reports except the first two):
1) The big photo is of S5
2) Seymour is leaning on S5 (and you have no idea how hard it was to get him to hold still that long while wearing a suit…or to talk him into that particular pose)
3) The two “cooling system” photos are S6
4) The hand holding the module is mine 😉
Cray-3 modules were 4x4x0.25 inches in size. Each module consisted of a multi-layer “sandwich” of PC boards (69 electrical layers), with 2 layers of 16 1×1 inch stacks. The stacks were the circuit boards containing the actual circuits (GaAs for logic, SRAM for memory modules). There were 16 bare GaAs chips mounted to each side of a logic stack. I think there were 12 bare SRAM chips on each side of a memory stack (the logic chips were square, the memory chips were rectangular).
Regular Crashes
The following is an approximate description of an event that took place in the late ’70s:
There was the time that an early Cray-1 serial number was running pre-ship reliability tests and was crashing consistently at roughly the same time in the pre-dawn morning hours. After much head scratching, somebody realized that the (newly hired) third shift custodial engineer was coming through on his regular pass through the checkout bays at about the time the failures happened. He was questioned when he came in the next night to find out if he might have been moving or operating any of the system machinery in the bay during his rounds. He denied doing anything of the sort. Later that night however he was noticed plugging his Hoover into one of the 60Hz utility outlets conveniently located on the base of a Cray-1 power supply ‘bench’ in an adjacent checkout bay. The outlets were handy for plugging in test equipment during checkout activities but an ECO (engineering change order) was issued soon afterward removing them.
Lost in the post
We are all used to things going missing in the post but misplacing a supercomputer takes some doing as this source reports.
" A Cray YMP-EL was lost in shipment at Chicago O'Hare Airport's Air Cargo facilities. It took over 5 weeks for the system to resurface. Seems that the shipping box was misplaced and it took that long for someone to finally ask, "Say, what's in that box over there?" ... Needless to say, we had a little fun sending a note to a friend that a request for a UNICOS System Admin class had been requested as an on-site presentation in Pyongyang."
The wrong kind of help turns a crisis into a disaster
Having a fire in your $ 30 million Cray C90 computer is bad enough but to see that crisis turned into an irretrievable catastrophe takes just the wrong person with the wrong fire extinguisher. As this message relates …..
551 NOUS40 KWBC 302235 FOS/NOAAPORT NOTICE NO. 1543 SEPTEMBER 30... 1999 ATTENTION FAMILY OF SERVICES SUBSCRIBERS NOAAPORT USERS NOTE THIS MESSAGE WAS SENT AS A SPECIAL NCEP DISCUSSION ENTRAL OPERATIONS/NCEP/NWS/ WASHINGTON DC ALL FIELD PERSONNEL...BELOW IS THE LATEST INFORMATION AVAILABLE ON THE OUTAGE OF THE CRAY C-90 SUPERCOMPUTER AT SUITLAND... MD. THIS INFORMATION WAS SENT TO YOUR REGIONAL DIRECTORS LAST EVENING... AND THEY WERE BRIEFED THIS MORNING. THE EXACT SCHEDULE FOR THE BACKUP... MODEL SUITE IS STILL BEING EVALUATED BY NCEP AND OSO. AS OF 00Z THE...MODEL SUITE ACCOMPANYING THIS MESSAGE WILL GO INTO EFFECT. ADDITIONAL... OPTIONS REGARDING MODEL RUNS AND BACKUP POSSIBILITIES FROM OTHER... ORGANIZATIONS CONTINUE TO BE INVESTIGATED. I WILL KEEP YOU POSTED AS THE MODEL SUITE IS MODIFIED. WE ALSO PLAN TO POST UPDATED INFORMATION ON THE INTERNET BOTH ON THE NCEP AND OSO HOME PAGES. THE NCEP HOME PAGE ADDRESS IS HTTP //WWW.NCEP.NOAA.GOV/ THE OSO HOME PAGE IS HTTP //WWW.NWS.NOAA.GOV/OSO/NOTICES/NOTICES.SHTML INTERNET ADDRESSES ALL LOWER CASE. WE APPRECIATE YOUR PATIENCE DURING THIS CRITICAL PERIOD. LOUIS W. UCCELLINI/DIRECTOR/NCEP FACT SHEET ON NATIONAL WEATHER SERVICES CRAY C90 SUPERCOMPUTER FIRE AT SUITLAND CAMPUS ON MONDAY 9/27/99 AT 4 00 P.M. A FIRE OCCURRED INSIDE THE CRAY C90 SUPERCOMPUTER IN FEDERAL OFFICE BUILDING 4 IN SUITLAND...MD. THE CRAY C90 IS THE CENTRAL NOAA NATIONAL WEATHER SERVICE /NWS/ COMPUTER THAT GENERATES NUMERICAL WEATHER FORECAST MODELS FOR THE NATIONAL CENTERS FOR ENVIRONMENTAL PREDICTION /NCEP/ IN CAMP SPRINGS MD. THE PRINCE GEORGES COUNTY FIRE DEPARTMENT RESPONDED AND EXTINGUISHED THE FIRE BY USING DRY CHEMICALS. TWO OF THE POWER SUPPLY UNITS INSIDE OF THE CRAY C90 WERE DAMAGED. THE DRY CHEMICALS USED BY THE FIRE DEPARTMENT CONTAMINATED OTHER COMPONENTS IN THE COMPUTER AS WELL. AFTER AN ASSESSMENT OF THE CRAY C90 COMPUTER...SILICON GRAPHICS INCORPORATED...THE COMPUTER CONTRACTOR...HAS DETERMINED THAT THE COMPUTER HAS BEEN SIGNIFICANTLY DAMAGED. SILICON GRAPHICS DOES NOT BELIEVE THE COMPUTER CAN BE REPAIRED. THE CRAY C90 IS INOPERATIVE. NCEP CONTINUES TO INVESTIGATE OPTIONS FOR RESTORING CRAY C90 SERVICE. IN COLLABORATION WITH NOAA S HIGH PERFORMANCE COMPUTING CENTER /HPCC/... NCEP IS GETTING A SECOND OPINION THROUGH AN EXPERT AT NIST ON THE EXTENT OF DAMAGE CAUSED BY THE FIRE. ALL CRITICAL OPERATIONS AT NCEP CONTINUE TO BE SUPPORTED. THE NWS FORECASTERS CONTINUE TO PERFORM THEIR DUTIES UTILIZING NUMERICAL MODELS RUN ON OTHER NWS COMPUTERS AND ACCESSED FROM OTHER NATIONAL CENTERS. NCEP HAS IMPLEMENTED PRE-ARRANGED BACKUP SUPPORT PROCEDURES USING AIR FORCE... NAVY... AND FSL FORECAST PRODUCTS. SOME MODELS WHICH USUALLY RUN FOUR TIMES DAILY ARE NOW RUNNING TWICE DAILY AND ONE MODEL THAT IS USUALLY RUN HOURLY IS BEING RUN EVERY THREE HOURS... WITH HOURLY RUNS TO BE RESUMED BY FRIDAY OCTOBER 1ST. WE ARE ABLE TO MAKE DO WITH THESE RUNS IN ADDITION TO USING THE OTHER MODEL INFORMATION AVAILABLE FROM OTHER NATIONAL CENTERS. THE NWS HAS TAKEN ALL STEPS NECESSARY TO MINIMIZE POTENTIAL IMPACTS TO NWS FORECASTERS AND THOSE OUTSIDE OF THE NWS WHO RELY ON OUR PRODUCTS. END NNNN
This story is also related on The Register
This site subsequenly became an IBM SP/2 shop.
The world according to Clarkson .. comments about the Uk met T3E
One section of the humorus book “The world according to Clarkson” by popular broadcaster Jeremy Clarkson relates, under the title “Red sky at night, Michael Fish’s Satellite is on Fire”, his discussion with a contact the the UK Meteorological Office. Written in July 2003 the thrust of the discussion is about how accurate the current weather forcasts are these days. Obtaining an honourable mention is the Cray supercomputer that is capable of “eleventy billion calculations a second.” The whole book is a really fun read and highly recommended. Read more about the technical side of the mentioned machine here.
“Do more, Fork less”
This was the title of a talk given to a community of T3e developers as part of a system performance improvement review. The T3e was a great machine but one of its weakness was running shell scripts real slow. In traditional shared memory computers the frequently used commands such as ps, awk, sh are held in memory and ‘shared’ between the processes that need them. On a T3e the shell commands had to be distributed around the command nodes when needed. Starting a new process also involved backward and forward communication from the command node to the process table owning system node. Scripts that include a lot of sub-shells are still expensive on most Unix systems but the distributed internal architecture of the T3e, made them very resource hungry and slow. Some modern shells have mitigated this generic issue by using shell builtins. Unicos/Mk at least had the advantage of using ksh as the default shell and the site had the common sense to ban the abomination that is csh.
A detailed study of the production workload showed the controlling scripts were actually taking longer than the heavy lifting computations. The scripts in use had been developed over the years to run on many different environments. Whereas the program codes had be fine tuned to the exact architecture in use and fully hand optimized. In some cases a 20 minute forecast run used only 5..6 minutes of real 32-way CPU work, the rest of the time being spent in the data file transfers and controlling shell scripts. When asked “How long does it take a Cray to add two numbers” the answer would be either “Almost no time at all or up to 50ms – it depends on how you do it. Using vector arithmetic you can get an answer per clock cycle; that averages to a very low number. Using a shell construct such as c=expr $a + $b
the sum takes ages.
A detailed examination of the command accounting data soon showed the culprits. The main delays were scripts that, used shell scripted arithmetic, basename and dirname commands, made long command piplines, or called lots of subscripts. Try these shell fragments on your local Unix system and see the time output even on a modern linux or OSX a fork/exec sequence can take real 0.01s.
Problem shell construct
|
Traditional approach
|
Mimic using shell builtins
|
Shell Arithmetic | a=2; b=4; time c=`expr $a + $b` | a=2; b=4; time let z=$a+$b |
basename | pn=/bin/fred/bill ; time fn=$(basename $pn) ; echo $fn | pn=/bin/fred/bill ;time fn=${pn##*/} ; echo $fn |
dirname | time pn=/bin/fred/bill ; fn=$(dirname $pn) ; echo $fn | time pn=/bin/fred/bill ; fn=${pn%/?*} ; echo $fn |
Long pipelines | cmda | cmdb | cmdd | cmde | Use your shell commands awk, sed and Perl to collapse commands into fewer commands. |
Long complex shell scripts | Convert to Perl and run the whole control script in fewer steps. Perl is just a portable as Ksh if used carefully |
The talk was reasonably well received. Follow up work included making an inherited shell procedure (see ksh builtin typeset) to transparently replace dirname & basename commands in the production scripts. Once implemented some of the production shell scripts had reduced run times from 20 down to 8 minutes including the heavy lifting. This also benefited the whole machine by putting less pressure on the processors handling the process table, the root filesystem nodes and causing less inter-processor communications.
Another shell optimisation was to paralise the ftp file transfers for the forcast runs. Originally the 10 files for a forecast were transfered sequentially, the script continuing only when each and all had arrived. The optimisation was to kick off all the transfers at once using & to background the comands using a shell wait to only continue when all the transfers had completed. While each of the transfers took longer the overall time was much less.
Improving overall file transfer times | |
| – | ——— | – | – | – | – | – |______________T |
Sequential file transfers |
| – | ——— | – | – | – | – | – |_______T |
Parallel file transfers |
The move to Perl took a bit longer but eventually became the site standard for scripting.
Trademark Disclaimer and copyright notice
Thank you for taking time to read these important notes.
Cray, Cray Research, CRI, XMP, YMP, C90, T90, J90, T3d, T3e, Unicos, plus other words and logos used in this document are trademarks which belong to Cray Inc. and others. There is nothing generic about a Cray supercomputer.
Some of the ideas described in this document are subject to patent law or are registered inventions. Description here does not place these ideas and techniques in the public domain.
I wrote this document with input from a number of sources, but I get both the credit and the blame for its content. I am happy to read your polite correction notes and may even integrate them with the text so please indicate if you require an acknowledgement.
Personal use of this document is free but if you wish to redistribute this document, in whole or in part, for commercial gain, you must obtain permission from and acknowledge the author.
June 2021 V1.0.6 Lightly dusted on move to WordPress
Copyright (c) 1999 by "Fred Gannett", all rights reserved. This FAQ may be posted to any appropriate USENET newsgroup, on-line service, web site, or BBS as long as it is posted in its entirety and includes this copyright statement. This FAQ may be distributed as class material on diskette or CD-ROM as long as there is no charge (except to cover materials). This FAQ may not be distributed for financial gain except to the author. This FAQ may not be included in commercial collections or compilations without express permission from the author.