Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!seismo!gatech!hubcap!lls From: lls@mimsy.UUCP (Lauren L. Smith) Newsgroups: comp.arch,comp.hypercube Subject: Hypercube Survey response Message-ID: <271@hubcap.UUCP> Date: Wed, 8-Jul-87 11:29:13 EDT Article-I.D.: hubcap.271 Posted: Wed Jul 8 11:29:13 1987 Date-Received: Sat, 11-Jul-87 06:28:17 EDT Sender: fpst@hubcap.UUCP Lines: 277 Keywords: Commercial hypercubes, parallel machines Approved: hypercube@hubcap.clemson.edu Xref: mnetor comp.arch:1582 comp.hypercube:28 This is the response that I received to my request on information on the hypercube machines available commercially. ------------------------------------------------------------------ >From: wunder@hpcea.HP.COM (Walter Underwood) Date: 20 Jun 87 00:38:22 GMT Organization: HP Corporate Engineering - Palo Alto, CA When looking at hypercube-type systems, don't forget Mieko (or Meiko?) in Bristol, England. Theirs is based on the Transputer, and has been shipping for a little while. --------------------------------------------------------------------- Date: Wed, 17 Jun 87 15:36:29 PDT From: seismo!gatech!tektronix!ogcvax!pase (Douglas M. Pase) Organization: Oregon Graduate Center, Beaverton, OR Of the four you mentioned, here are my experiences: 1) Intel's iPSC I am most familiar with this system. I have written several programs of various sizes for this machine, and am currently working on a moderately large language implementation for it right now. It uses the Intel 80286/287 processor pair and connects 16/32/64/128 machines together using ethernet chips. Each node has 512K bytes of memory, but that can be expanded to 4 1/2 M bytes (I think that's right) by removing alternate processor boards (cutting the number of nodes per chassis in half). Each processor does about 30K flops. The message size is limited to 16K bytes or less. Transfer rate varies from about .0025 sec for an empty message to about .035 sec for a 16K byte message, both traveling 5 hops. About .0001 sec and .030 sec for the same messages going only one hop. Software development is encumbered somewhat by the different memory models the compiler(s) and architecture must support (small, medium, and several "large" memory models), but for me that has been more of an annoyance than a restriction. My only real problem has been figuring out what set of flags to use when I compile the programs. They're all documented; I've just been lazy about looking them up. Their communication utilities are reasonable, but the nomenclature they use seems a bit strange and misleading -- a channel is not really a communication channel as much as it is a "FILE" descriptor. A process ID is not really an ID "process") as much as it is an identifier of the channel descriptor. This has taken time to get used to, but once I was used to it, no problem. Because they use what they call a type, a process ID and a node to segregate ich recipient the message is intended, and the process ID can specify either always specifies the hardware node ID, which is both expected and reasonable In summary, I like the machine, but could suggest some important improvements in both the hardware and software. I've also heard some comments that it's a hard machine to use, too slow, etc. To those I respond: distributed memory multiprocessor model is, by nature, the most difficult model to use. I see nothing in the Intel design which makes it more difficult than any other machine, and I see some "conveniences" which do simplify my tasks. Some good hardware speedups are in the works, too. It's a good platform for distributed processing research. 2) NCUBE I don't know much about NCUBE, but that it uses 680x0 technology, which I personally prefer over the 80x86 line. I have also heard it only has 128K bytes of memory per node, which would be inadequate for my purposes. I'm having a hard time getting by with 512K. I've heard favorable rumors about it's price tag, but that's it - just rumors. 3) FPS T-Series The T-Series is a hefty machine family. The price tag is steep even at 2 or 4 processors, but you get some good array processor technology along with it. Unless your inclination was towards weather or hypersonic aerodynamic simulation, I personally would stay away from this beast. It is sort of like having a micro-cray at each node. I have heard stories (I worked at FPS, and as such was entertained by some of the "war stories", most of which appeared in local news papers) about hardware and software "anomalies" which could make the original design quite painful to use. But in all fairness, the new management at FPS is going to great lengths to correct the problems. I also personally know most of the software team assigned to the T-Series, and I have a great deal of confidence in them. ---------------------------------------------------------------------- Date: Thu, 18 Jun 87 13:00:07 CDT From: grunwald@m.cs.uiuc.edu (Dirk Grunwald) Hullo, From: fosterm@ogc.edu Subject: iPSC performance measurements Status: RO The following report contains performance measurements we have made on the Intel hypercube (iPSC) for the 2.1 Release and the Beta 3.0 Release. Ditroff source for the report and sources for the test programs are available on request. -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Comparative Performance of Two Intel iPSC Node Operating Systems Mark Foster Computer Science and Engineering Oregon Graduate Center Beaverton, OR 97006 Revision 1.2 of 86/11/07 A program has been constructed to to measure bandwidth and latency of node-to-node communica- tions in the Intel iPSC. The purpose of the meas- urements were to establish a comparison between the iPSC Release 2.1 Node Operating System ("k286") and the iPSC Beta 3.0 Node Operating Sys- tem ("nx"). This report compares the performance of the two systems and shows the improvements realized in the Beta 3.0 release. 1. Test Environment. The tests were run on a D-5 (32-node) system. The entire suite of tests were run under both the Release 2.1 node operating system and the Beta 3.0 node operating system. 2. Test Algorithm. The test involves measurements of communication between exactly two nodes. One node is designated as a base node and another as a relay node. The base node sends a messag of a particular size to the relay node; the relay node col- lects the entire message then sends it back to the base node. The base node uses its system clock to measure the elapsed time between sending of the message and receipt of the returned message. This send-relay-receive loop is repeated 2500 times to create an composite sum of elapsed time. Initial synchronization of the base and relay node is ensured to prevent timing measurement of any period during which the relay is not yet ready to begin the test. 3. Results. Three main trials were run for each type of communication: November 1986 - 2 - i) adjacent neighbor, (ii) one-hop neighbor, and (iii) two-hop neighbor. For each trial, the message size was varied from 0 bytes to 8K bytes. The bandwidth statistics were calculated with the formulas: k286bpmsec = total_bytes / k286msec nxbpmsec = total_bytes / nxmsec where k286msec and nxmsec are the average elapsed times for a given test, in milliseconds, and where total_bytes = total_messages * aggregate_length total_messages = 2500 * 2 This value reflects the number messages passed between two nodes. aggregate_length= (test parameter: User Message Size) + overhead In calculation of the statistics, an additional 20 bytes per 1K packet was added to account for per-packet over- head. Nodes User Message k286msec nxmsec k286 nx speedup Size (bytes) (bytes/msec) (bytes/msec) ratio 0 1 0 20309 11044 4.92 9.05 1.84 0 1 10 21611 7888 6.94 19.02 2.74 0 1 500 25414 11634 102.31 223.48 2.18 0 1 1024 29842 16035 174.92 325.54 1.86 0 1 4096 83822 64265 249.10 324.90 1.30 0 1 8192 158768 128011 263.03 326.22 1.24 0 3 0 29896 9577 3.34 10.44 3.12 0 3 10 31757 10335 4.72 14.51 3.07 0 3 500 38430 17211 67.66 151.07 2.23 0 3 1024 45016 23993 115.96 217.56 1.88 0 3 4096 265514 82553 78.64 252.93 3.22 0 3 8192 418403 161532 99.81 258.52 2.59 0 7 0 40341 12607 2.48 7.93 3.20 0 7 10 42222 13450 3.55 11.15 3.14 0 7 500 51816 22809 50.18 113.99 2.27 0 7 1024 61278 32757 85.19 159.36 1.87 0 7 4095 367382 96894 56.82 215.44 3.79 0 7 8192 612427 185931 68.19 224.60 3.29 November 1986 - 3 - 3.1. Anomalies. Two anomalies were noted in this examination. Of particular note are the bandwidth values for n-hop (n > 0) message sizes greater than 1024 using the k286 kernel: the data rate actually decreases for message sizes of 4K and 8K. This problem appears to have been corrected in the Beta 3.0 ker- nel. Another, perhaps less significant, anomaly was detected when sending zero-size messages to an adjacent node. We found that the time taken to transmit empty mes- sages typically increased by 40 percent over one-byte mes- sages in the Beta 3.0 kernel. This problem only occurs for communication between adjacent nodes, and only for messages of length zero (timing for message lengths greater than zero are consistent with the characteristic performance curve). 3.2. Summary. We found that the message-passing performance of the Beta 3.0 system was improved by a maximum of almost 3.8 times over the 2.1 system. On the average, the performance increased 2.5 times. Our maximum measured communication bandwidth of the Beta 3.0 system, then, is slightly more than 1/3 Megabyte per second. The effective minimum node-to-node latency is approximately 1.5 milliseconds. Our measurements were found to be equivalent to measurements made by Intel Scientific Comput- ers (iSC) for the Beta 3.0 release. iSC reports that the 3.0 production version, to be released in November, has 1/2 Megabyte per second bandwidth and .893 milliseconds null- message latency. November 1986