Deprecated programming methodologies slow down your system, please upgrade!

Romain Jouin kindly asked me to give a talk to his students at Leonard de Vinci University about my opinion, as an IT practicioner, on the evolution of computer architecture.

To avoid paraphrasing experts, I took inspiration from ACM queue and I decided to organize my talk around scientific articles that enlighten me. You may download the related PowerPoint. I hope that the PPT quotes the various authors are precise enough. Should any of them feel offended, he can contact me and I'll amend the PPT accordingly.

The main goal of this opinionated speech was to demonstrate that, today, every programmer must understand parallelism and develop, as much as possible, natively parallelizable code. Sequential programming is over. If this idea is rather familiar - or not - to junior programmers, it's groundbreaking for older ones, who are familiar with a more or less sequential programming paradigm like me. I'm not sure that I convinced anyone. Maybe my points looked rather pedantic and I should have given an hands-on talk, supported by real-world examples. Bet let's move ahead.

My speech revolved around three points:

  • first, facts that show the limits of hidden parallelism, ie parallelism that the programmer is not aware of, that leads to consider that parallelism is provided by the system by design
  • second, work done or in progress to get around these limits
  • and third, new questions that parallelism and distributed systems raise

The diversity of parallelism

One CPU and many programs to execute : concurrency

In the 90s, people talked about multi-tasks operating systems. Unix systems implemented preemptive multitasking while Windows 3.x implemented cooperative multitasking (see this Wikipedia article for instance).

More recently, Smartphone OS implement other forms of multi-tasking. You will find very interesting information about Android multitasking here for instance.

Multi-tasking covers the scenarios where many memory-independent non-mutually-blocking tasks are running, that each task needs only a portion of the CPU, and that the system can switch between tasks.

One program and many CPUs to execute : parallelism

When running computationally-heavy tasks, refining instruction sets to speed up things, increasing processor speeds, delegating tasks to peripheral chips, and distributing the load on many CPU are quite intuitive strategy. These strategies were or are still used :

One program and many data nodes to process : (? no name)

Sometimes, data is to voluminous to be stored on a single compute node. Rather than transferring data to a single processing node (limited local cache and limited data transfer rate also known as von Neumann bottleneck), program is distributed over data nodes. "Big data" architectures and HPC/MPI architecture implement this kind of mechanism.

The limits of hidden parallelism

In february 2019, following their Turing Lecture, John L. Hennessy and David A. Patterson published an article, A New Golden Age for Computer Architecture.

They enumerate the challenges brought to the computer industry by the following major trends:

  • the end of single CPU performance increase (Moore's law)
  • the limits of CPU power consumption and cooling (Dennard scaling) that hinders the continuous of recent chips at full speed, depending on the thermal dissipation architecture
  • the limits of instruction level parallelism and speculative execution (that, apart from being not so efficient, poses security issues)

They suggested a few potential solutions too:

  • Domain specific architecture (sometimes named heterogeneous computing)
  • Programming languages that induce some level of parallelism
  • New hardware development cycles

In a nutshell, parallelism isn't possible if the programmer / designer is not parallelism-educated. Hardware abstractions, low-level optimization that more or less worked when the steady growth of CPU speed masked their relative ineffectiveness, are not relevant anymore. There are huge sources of optimization, but they suppose to rewrite software for parallelism and heterogeneous computing.

Work done or in progress

In my speech, I emphasized three domains where huge progress has been made:

  • Domain specific architectures
  • Theoretical foundations
  • Programming languages and frameworks

Domain specific architectures, versatility vs performance

Outside the HPC domain, graphic cards manufacturers (Nvidia, AMD, Intel, ...) were the foremost manufacturers of DSA with their graphic cards / GPU and the associated software libraries (for instance CUDA, ...).

Then come processing units designed to perform well on Deep Neural Network (DNN), such as Google TPU (Tensorflow Processing Unit). See this article for instance. Many startups are trying to build the next computer best-suited for deep neural networks.

IBM Watson, or even efforts to build Quantum computers, are a step further toward proprietary, closed and cloud-only architectures.

And finally, libraries and tools, like Xilink Sdaccell, have been released to abstract the development of FPGA custom solution.

Many improvements have been made to X86/IA64 architecture and their chipsets too:

  • RAM capacity and speed
  • SSD and SATA buses
  • NVMe
  • PCIe buses
  • 10Gbe network
  • ...

In the short term, portability and standardization are at stake. Finding a way through the myriad of solutions is rather disconcerting... wait and see or make trials with parsimony.

Surprisingly, or not so surprisingly if we consider that network is more mature than machine learning, on the other end of the spectrum, network manufacturers are migrating to software defined and linux-based architectures. In the long term, we can anticipate that DNN will consolidate with:

  • a few high level frameworks like Tensorflow or Pytorch
  • drivers that plug heterogeneous hardware into these frameworks
  • a few low-level standardized interfaces to hardware (GPU, FPGA, xxPU)

Theoretical foundations

Since Leslie Lamport's article about time-keeping in distributed systems, the theoretical work of distributed systems and parallel programming have flourished. Algorithms have been designed to solve the various problems faced by distributed software designer. The reference book "Distributed Algorithms", by Nancy A. Lynch, details many of them. Distribution is a real shift from the traditional conception of computing embodied by Turing machine and its linear strip of tape.

Programming languages and framework

Big data architectures and distributed databases, for example, implement distributed algorithms or lays upon theoretical works without their users knowing the implication of it. Apache Spark uses a theoretical model, Directed Acyclic Graphs, to split and schedule tasks. For instance, Raft, a distributed library in Go, implements a consensus algorithm. And Tensorflow computing model lays on graph theory too.

Programming languages offers a more mixed landscape. Some examples are very interesting:

Some authors think that C, still broadly used, isn't relevant for system programming because C abstraction is artificial and doesn't reflect the real architecture of a modern machine. Surprisingly, Python doesn't offer multi-thread by default. See this article to know more about this fact. But multi-threading is questionable, as stated in this blog post. And new languages or functionalities added to existing languages try to go around multi-threading. Channels in go, streams in Java or mechanisms, or new mechanisms included in functional programming languages (Scala, Javascript...) promote the design of non-sequential code.

Future questions

There are many areas where parallel, concurrent or distributed systems may be improved:

These systems are not deterministic. Testing, debugging, understanding or reproducing failure conditions are still a big hurdle. New tools are required to be efficient.

Big distributed systems are unstoppable. Everything, from backup to upgrade, must be designed so that administrators run these tasks without stopping the systems, which supposes that, at some point, the system may be in an intermediate state, partially backup, but still running, or in a transient versions. Distributed snapshots algorithms such as Chandy-Lamport algorithm make you aware of the local nature of time. Time is an illusion...

Monolithic applications were complex to change because you sometimes have to break the monolith. But micro-service, distributed applications are complex to version, unless you find a way to keep the coherence of the whole system.

And finally, universities, MOOC, mainly offer traditional cursus about sequential algorithms, complexity, without material about parallel programming and distributed systems. How will we train the next generation of developers to both traditional algorithms and distributed ones ? Will they need a background in distributed systems and parallel programming if abstractions are implemented to hide it ?

A a conclusion, I hope that the opinions reflected above will change, as I dig deeper into the domain, and thanks to the reactions of readers, and that I'll be able to clarify many points raised in this post in future posts.