AI Data-center HW architecture

Instructor: Gil Bloch, NVIDIA Israel
Teaching Assistant: TBD
Lectures: 13 hours, 3 days
Academic Points: 1pts
Exam: TBD
Course Fees: 1700$ (See membership options)

For registration click here

May 19 – 21, 2024

Course Content:

Artificial Intelligence, and specifically deep neural networks become the single most interesting application. It is expected that a growing percentage of the world’s compute power will be dedicated to training and inferencing of neural networks for many tasks.

Training large neural network models such as Large Language Models (LLM) require specialized systems and standard datacenters cannot train such models in an efficient way.

This course aim to cover multiple aspects of designing and building high-performance large-scale datacenters (supercomputers) for modern and future neural network training. In this course, we will cover accelerated computing and the role of GPUs and specialized CPUs in future AI systems as well as the importance of high-performance interconnect.

The course consists of a series of lectures. Several lectures are based on published papers and other cover recent research performed in NVIDIA.

This course aims at several categories of participants. Novice participants can learn about design tradeoffs and directions. Moreover, participants with high performance networking experience can update their knowledge and will be able to tune their experience to the state-of-the art.

Topics:

This course will cover advanced topic in supercomputer system architecture focusing on the interconnection between the compute engines, including interconnect hierarchies, communication algorithms and in-network computing.

Prerequisites:

Computer Architecture (046267 or 236267) and Networks and Internet (044334)

Schedule:

Day 1 (19/05/2024)

1.1. Introduction to supercomputing systems – 9:30-10:45

Coffee break – 10:45-11:15

1.2. Convergence of HPC and Cloud – 11:15-12:30

Lunch break – 12:30-13:30

1.3. Distributed AI training techniques – 13:30-14:45

Coffee break – 14:45-15:15

1.4. Distributed AI training techniques – 15:15-16:30

Day 2 (20/05/2024)

2.1. Challenges in modern distributed AI training (data reduction) – 9:30-10:45

Coffee break – 10:45-11:15

2.2. Challenges in modern distributed AI training (data reduction/all-to-all) – 11:15-12:30

Lunch break – 12:30-13:30

2.3. In-network computing (data reduction) – 13:30-14:45

Coffee break – 14:45-15:15

2.4. In-network computing (programmability) – 15:15-16:30

Day 3 (21/05/2024)

3.1. System topology considerations (NUMA, PCI, NVLink, Network) – 9:30-10:45

Coffee break – 10:45-11:15

3.2. Routing and congestion control – 11:15-12:30

Lunch break – 12:30-13:30

3.3. Fault tolerance – 13:30-14:45

Coffee break – 14:45-15:15

3.4. AI factories vs. AI in the cloud – 15:15-16:30

Bio:

Gil Bloch is an HPC and AI specialist with broad experience in fast interconnect technologies for clusters, datacenters and cloud computing. His current responsibilities include co-design and in-network computing for HPC and machine learning. Gil is a teacher of Fast Networks and RDMA programming in the Hebrew University of Jerusalem (HUJI) and in Ben Gurion University of the Negev (BGU).

Before working on in-network computing, Gil had multiple engineering and architecture positions including network adapters and switches ASIC design and architecture, RDMA offload ASIC and open-source networking software for high performance computing. Gil is an author/co-author of multiple patents in the area of computer networks and network adapters. Gil holds a BSc degree in Electrical Engineering from the Technion, Israel Institute of Technology.

Please leave your details here

19/05/2024 – 21/05/2024