"Although I usually occupy my time with disputes involving dogs, shopping cart collisions and lawnmower decibel levels I am willing to review a case of semantic technicalities. Then again, Siamak Tavallaei would probably offer a more technical, sedated, and accurate reply (if he/she is roaming around these parts … as in TNP’s August 9, ’22, “CXL Borgs IBM’s OpenCAPI”).I forwarded a link to this thread to Judge Judy and this was her response: A kind of Kung-Fu psychology for HPC gastronomy, I think, whereby the contents of the flying plates are not as important as their layouts, to maintain balance, and prevent indigestion in participating clients (opposite to fast-food cafeteria HPC). The trick (I think) will be to develop software that takes advantage of this restricted zonal coherency, to maintain the “psychologic” (a term borrowed from Gilbert Strang) illusion of a macroscopically NUMA system at the many-node level. As far as I understand it, coherency in both cases is snoop-message oriented but acts over limited ranges of memory addresses (blocks, regions) to avoid traffic jams caused by snoop congestion. Somehow though, I doubt that published CXL standards rely on the future results of such active research … so, to Paul’s query (from CXL whitepapers), CXL 1.x and 2.0 (PCIe 5.0) consider a single host processor that uses snoop messages to manage coherency of data cached (if any) in attached device(s) (CXL.cache protocol).ĬXL 3.0 (PCIe 6.0) is more interesting as it introduces: 1) enhanced coherency with “active” attached devices (GPUs, FPGAs), and 2) memory sharing among multiple hosts. Nice plugarooni, HuMo, of a fellow Frenchy, one of that new generation of folks with most sizeable cerebral cortices (much better than GPT-7), born on the very same day that Yann LeCun was at a Paris cafe, with a baguette and accordion, thinking of the textual organization of chapters in his upcoming thesis on horror backpropagation. And where does the CXL protocol fit into all of this? So, will Marvell try to create something like the “Apollo” optical switches that are at the heart of the TPUv4 clusters made by Google? Does it have other means to do something not quite so dramatic and still yield the kinds of results that will be needed for AI training? And how does the need for disaggregated and composable infrastructure fit into this as a possible side benefit of a shift to optical switching and interconnects. The physical size of current and future GPU clusters and their low latency demands means figuring out how to do optical interconnects. And, we assume, the scale needs are going to be even larger as the GPT parameters and token counts all keep growing to better train the large language model. While Nvidia uses high speed NVLink ports on the GPUs and NVSwitch memory switch chips to tightly couple eight Ampere or Hopper GPUs together on HGX system boards, and has even created a leaf/spine NVSwitch network that can cross connect up to 256 GPUs into a single system image, scaling up that GPU memory interconnect by two orders of magnitude is not yet practical. That is a factor of 15X effective performance increase between GPT 4 and GPT 5. To give a sense of the scale of what we are talking about, the GPT 4 generative AI platform was trained by Microsoft and OpenAI on a cluster of 10,000 Nvidia “Ampere” A100 GPUs and 2,500 CPUs, and the word on the street is that GPT 5 will be trained on a cluster of 25,000 “Hopper” H100 GPUs – with probably 3,125 CPUs on their host processors and with the GPUs offering on the order of 3X more compute at FP16 precision and 6X more if you cut the resolution of the data down to FP8 precision. Mizrahi was named a Technology Fellow in 2017 and a Senior Fellow and CTO for the entire company in 2020, literally as the coronavirus pandemic was shutting the world down. Mizrahi got his start as a verification engineer at Marvell and, excepting a one year stint at Intel in 2013 working on product definition and strategy for future CPUs, has spent his entire career as a chip designer at Marvell, starting with CPU interfaces on various PowerPC and MIPS controllers, eventually becoming an architect for the controller lione and then the chief architect for its ArmadaXP Arm-based system on chip designs. We wanted to get a sense of how AI is driving network architectures, and had a chat about this with Noam Mizrahi, corporate chief technology officer at chip maker Marvell. It is as if all of a sudden, all demand curves have gone hyper-exponential. Or more precisely, it has thrown the balance of these three as the datacenter has evolved to know it completely out of whack. Artificial intelligence has taken the datacenter by storm, and it is forcing companies to rethink the balance between compute, storage, and networking.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |