Computing infrastructures

This page contains a brief reworking of the Computing Infrastructures lectures I attended at Politecnico di Milano.

I usually write this kind of summary at the end of the study process for each subject, in order to reshape the concepts and be sure I have fully understood them. I hope it can be helpful to future students.

1. COMPUTING INFRASTRUCTURE, WHAT IS IT?

In short, it is a system that provides hardware and software useful for the execution of other services.

There are many systems belonging to this category, each of which has characteristics that respond to a particular type of need (computing power, price, size, availability, architecture, …).

It is said that, taken together, the various types of computing infrastructures form a Computing Continuum, in the sense that it is as if they formed a continuous spectrum of characteristics from which it is possible to “pick” the component that suits our needs.

We will analyze the most important subcategories below.

IOT devices

Let us imagine we want to manage the on/off logic of the air conditioner in our room. The rule is very simple: if the temperature is above a value X measured by a sensor, then we turn on the cooling system.

When dealing with similar cases a lot of computing power is not necessary. On the contrary, it is often useful to use very simple devices that consume very little, so that they can be battery-powered or, in general, make the system much more efficient. Size also benefits, since industrial devices can become really tiny.

These components are precisely IOT (Internet Of Things) devices. A famous example everyone knows is the Arduino UNO, but there are many others.

Embedded Computers

There are use cases in which the power available on an IOT device is not enough: we need a real “mini computer” that allows parallelism, maybe even graphics acceleration, … . These are the cases where embedded computers are used: more powerful devices, but also larger in size and more power-hungry than IOT devices.

It is possible to buy ready-made devices (such as the famous Raspberry PI), but companies often design their own embedded component, so they can include all and only the components necessary to deliver their service, reducing size and costs.

There is an entire course at Poli about this kind of device, but this is not the core of the present summary.

Maximum computational power: Data Centers

Are you planning to run a cloud service used by millions of users? Or maybe you want to train a neural network with billions of parameters? In that case you need a huge amount of computing power: you need a data center.

Data centers are buildings containing hundreds or thousands of servers, which must be placed in an orderly way (aisles, racks), powered (power supply system with possible backups), cooled (air or liquid cooling system), and managed in terms of computing resource partitioning. This last aspect is usually handled through two main tools, similar but different:

Virtual Machines: VMs are like “containers” that allow installing entire operating systems inside them, to which virtual resources generated by “splitting” the real physical resources of the machine are provided.
Containers: similar to VMs, containers do not run entire OSs, but instead allow installing applications and the libraries needed to run them. This makes containers lighter and faster than VMs, but less flexible.

DCs and the SaaS, PaaS, IaaS, … approaches have allowed service providers to access cheaper and more stable solutions and, consequently, have made it possible to lower prices for end users. The advantages just mentioned are not free, though: the price paid is the need for a stable, high-bandwidth internet connection, a higher environmental impact (for several factors, one of which is the fact that software often runs on machines with architectures not optimized for the specific workload, so the process efficiency is much lower than in the case where the provider chooses the server type to use), an increased risk from a privacy standpoint, and increased latency.

All these negative aspects make DCs an excellent or a terrible solution depending on the use context. For example, if we were building a self-driving car, choosing to run the model that decides when to steer or brake on a data center would be truly terrible: we need a decision latency much lower than what a DC can provide. The best option would probably be to run the model locally, at least for crucial decisions.

The missing link: Edge/Fog computing

Let us imagine we are in the following context: we are building a system consisting of:

a huge number of IOT sensors that record raw data related to road safety (e.g., temperature, humidity, …) installed over a large territory.
a cloud system that receives these raw data and generates immediate decisions (e.g., which road is safer to take at the moment) and long-term decisions (e.g., accident percentage given certain territory characteristics).
autonomous vehicles that receive the immediate decisions computed by the cloud system and act accordingly.

So how should we build our cloud system in this case? A first idea could be to “simply” use DCs, but that would create two problems:

sending raw data directly to DCs and processing there what we could do locally is economically disadvantageous, because every bit we send, process, and store is also something we pay for.
latency might be excessive for our use case depending on the DC location.

A solution to these problems is to implement the concept of Fog/Edge Computing (two concepts that often overlap, I will explain the differences soon). This would consist of placing, between the IOT sensors and the DCs, “intermediate-power” components (e.g., Embedded Devices, or even actual computers with Intel/AMD chips, …) located closer to the sensors, allowing the first processing of raw data and computing the immediate decisions to send to the cars, then sending the pre-processed data (and therefore much smaller in size) to the Data Center, which can then proceed with long-term processing (maybe even computationally more complex) and storage.

The changes brought by implementing this paradigm can be observed from two points of view:

Data perspective: data processing happens near (geographically speaking) the IOT sensors that generate them. This is what is called Edge Computing.
Network perspective: data are pre-processed by a node present in the same LAN where the IOT sensors are, which will then send the result of processing to DCs/vehicles. This is what is called Fog Computing.

You can see see how the difference is so small that the concepts often overlap.

The evolution of datacenters

Let us try to create a chronological map of the evolution of digital services and the underlying infrastructures.

1. On-Device services

Initially, the only way to distribute digital services was to create applications that ran entirely on clients. This approach complicates update management and hardware optimization: with every new update it is necessary to convince users to update their app version and, moreover, one cannot control the hardware on which the user runs the service, so the correct functioning of the software cannot be tested in every circumstance.

On the other hand, this approach simplifies privacy, security, and response times for applications where these characteristics are critical, so it is still used in certain contexts.

2. Cloud services: Datacenter

With the improvement of global network infrastructure it became possible to move the computational load from the client to separate “computing centers” (servers), leaving only the app front-end and any time-critical functionalities on the user device. This approach brings advantages and, obviously, disadvantages that make it more or less suitable for different contexts.

First of all, now the provider has full control over the hardware on which the back-end runs, so they can perform software optimizations, test the service on the specific hardware, update software by fixing bugs or introducing new features in a completely transparent way for the user, who will not have to install anything on their device besides the front-end. Other resulting advantages are the possibility to provide services whose requirements exceed what single clients could support, moving processing remotely with reduced consumption and temperature on the user side (great for battery-powered devices like laptops or smartphones).

The disadvantages mirror those of the on-device solution: higher minimum response times, dependence on a broadband internet network, possible privacy and security problems.

The servers needed to deliver services from multiple providers are often placed in the same datacenter, to make management more efficient in terms of cooling, availability, reliability (we will analyze these concepts later), and costs. Often, a datacenter contains hundreds of servers, arranged in redundant buildings and powered by diversified and redundant energy sources as well. The reason is that datacenters are considered “critical infrastructures”, that is, infrastructures whose operating rate (we will specify this concept later) must exceed 99.99%.

3. Services that grow: Warehouse-Scale Computers (WSC)

Among the services that moved to the cloud, some have grown enormously in terms of number of users and computing requirements, to the point of “occupying” entire datacenters for the delivery of a single service. This is how WSCs were born: datacenters made of homogeneous hardware, owned by a single organization, inside which a few dozen applications run whose goal is the delivery of a single service. Unlike traditional datacenters, where countless installed applications are independent and do not communicate with each other, in Warehouse-Scale Computers applications cooperate to achieve a common goal.

Examples include Google services, Amazon, Dropbox, … .

The spread of these structures has also been encouraged by the rise of services that by their nature need to be installed on WSCs, such as search and indexing services or training generative neural models.

4. Virtualization: WSC as datacenter

The spread of virtualization technologies (which we will talk about later) has made it possible to create WSCs whose delivered service is similar to that offered by a classic datacenter: hosting services (SaaS, PaaS, IaaS, ..) are offered through many homogeneous machines “hidden” by a “virtualization layer”. What the user sees is only the possibility to “buy” computing power proportional to the payment offered.

This is the case of AWS or Microsoft Azure.

Geographic organization of datacenters

The requirements of large companies delivering digital services are not satisfied by a single datacenter: many are needed. But how do we distribute them in a coherent and sensible way?

The most commonly used approach is the one explained below, based on a hierarchical subdivision of the territory.

I) Geographic zones

The world is divided into geographic zones, that is, political regions in which different rules and laws apply, especially regarding data management and user privacy. Ideally, a provider should be “physically” present in each of the geographic zones in which it intends to deliver its service. Note that geographic zones may not be very distant geographically, but they are from a legal standpoint.

II) Computing regions

Within each geographic zone there are two or more computing regions. A computing region is an area where the maximum round-trip latency is 2ms. It is also the minimum granularity level visible to the user.

Usually, datacenter operators place their computing centers in areas with diversified risk, meaning that, when choosing the installation territory for two datacenters, it is unlikely they will both be placed in the same area sensitive to earthquakes and tsunamis; maybe one datacenter will be placed in that area if considered strategic, but another will be installed far away, in a low-risk place. In this way, in case of disaster, one of the two DCs will continue to function and deliver the service.

Synchronization between the two datacenters will not be perfect due to the large distance, but for disaster recovery imperfect synchronization is more than acceptable.

III) Availability zones

Within the same computing region it is possible to place multiple datacenters, each of which will be an availability zone. This further subdivision brings several advantages:

the possibility to run latency-critical operations;
greater resistance to faults (increased availability and reliability within the computing region). Moreover, if there are at least 3 availability zones, it is possible to recover data in case of a fault or service interruption in one of the datacenters: if there were only two DCs and at some point there was a data discrepancy due to a fault, it would not be possible to understand which version of the data is correct. With 3 datacenters it is possible instead to use quorum, that is, we consider as correct the version shared by the two intact datacenters, making it possible to identify the single faulty server.
consequent possibility to run availability/reliability-critical applications.

Overview of possible server types

It is interesting to note how the basic scheme of the entire datacenter and that of a single server are very similar. Both consist of:

Processing component -> in the case of the whole DC, these are one or more servers; in the case of a server, these are one or more CPUs (with multiple cores) and, possibly, one or more GPUs/TPUs.
Fast memory component -> in the case of the DC, these are high-speed disks (e.g., SSDs) that make data quickly accessible. In the case of a server, this is RAM.
Long-term storage component -> in the case of DCs these are HDDs; in the case of a server these are SSDs or HDDs (or possibly SAN, we will see later what that is).
Networking -> the DC has an ultra-broadband connection that connects it to the rest of the world. The single server also has an individual network connection.
Power supply system -> the DC is provided with a complex power system. The single server also needs power.
Cooling -> heat dissipation is an essential aspect of both the datacenter and individual servers. We will see later that different solutions can be used, with approaches at different granularities (cooling across the whole DC or for individual aisles/racks/servers).
Failure recovering -> the cooling or power system could malfunction: it is necessary to install recovery systems that prevent malfunctions (faults) from turning into actual errors (failures). This also happens at the level of the single server.

In servers, these fundamental elements can be assembled in different ways depending on the use case and system requirements. We will analyze the most common solutions in this paragraph.

Tower Server

The components of this type of server are assembled inside a tower similar to the “home” computers we are used to: a vertical case containing motherboard, processor, power supply, cooling, … . This approach brings advantages in terms of modularity and temperatures: the tower case allows components to be housed comfortably, leaving a lot of free space. The server is therefore easily upgradable with new components and, given the low density of components inside it, easily coolable. A further pro is the price, being the cheapest solution.

On the other hand, precisely the unused space that enables the advantages above causes the area used for a single server to be much larger than that required by servers built with other approaches (which we will explain soon). This creates a cost increase due to the purchase of new space for the DC so high that it cancels the economic advantage due to the lower price of this solution.

Legitimate question: ok, they occupy more space. But if I build a multi-floor datacenter, expanding vertically, I do not have to buy new land to install more datacenters. Why is this approach not used?

Sad answer: the problem is that in a datacenter the density of IT components is so high that the structure must sustain several times the weight that a normal building is required to support. It would be necessary to use specific construction techniques with a cost increase beyond that of simply buying more land. For this reason, datacenters are often built as single-floor buildings (a dammuso).

In general, Tower Servers are an excellent solution for small companies that need at most about ten servers, but become inefficient in terms of occupied area when numbers grow.

Rack Server

Rack servers solve the problems of the tower alternative, inevitably introducing others.

The general concept is to introduce racks in the datacenter, that is, shelves on which to stack servers with much smaller dimensions than a typical tower. Specifically, the dimensions are:

width = 19 inches ≈ 48.26 cm,
height = 1, 2, 4 … U, where one U (unit) corresponds to 1.75 inches = 4.445 cm (damn imperial measurement system).

In this way the density of computing power increases greatly, drastically reducing the required space. Usually, each rack hosts at its top components that are not exactly servers, but support them (e.g., router, power, …). All servers in a rack share these components, minimizing the number of cables and simplifying management. The components installed in this area are called TOR <-> Top Of Rack (e.g., TOR Switch).

Ok, but what happens if we increase component density? Temperatures increase too. This has consequences both on the single server (if generated heat increases, processor frequency will decrease to try to reduce temperature), and inter-server consequences (if servers A and B are placed one above the other, and on server A a very intensive application is running that drastically increases temperatures, heat will also be transferred to server B, which will lower its clock frequency as a containment measure. In this way, even if the two servers run completely separate and independent applications, there is an undesired indirect interaction).

For this reason, advanced cooling systems are needed to dissipate all the heat generated in racks, with a consequent cost increase for hardware to purchase and consumed electrical energy.

Server management also becomes more complex: with many components in a small space, when a problem occurs it becomes harder to identify its cause and, given the reduced modularity of rack servers, eventual repair becomes more complex.

Blade Server

Blades are the latest server form factor available. The structure of a blade server is as follows:

The “outermost component” is the rack, identical to that used for rack servers and compliant with the same standard dimensions specified by the IEE.
Inside the rack, the chassis that will contain the blade servers is inserted. The chassis complies with rack standard dimensions, with a width of 48.26cm and a height of multiple U, with U = 4.445cm. This container integrates switching and power for the blade servers and also a possible liquid cooling system.
The chassis described has vertical slots into which the blade servers are inserted. These are even more “miniaturized” servers: compute density is extremely high, with consequent temperature increase.

In practice, blade servers relate to rack servers as rack servers relate to tower servers: there is further simplification of cable management and increased performance per occupied area at the expense of higher temperatures and prices.

Datacenter management is also simplified here because the chassis containing the blades makes monitoring and maintenance easier, reducing the “hard part” to managing the chassis interface rather than the individual servers inside.

A new downside in the blade solution is loss of standardization: although the chassis complies with IEEE standards, the blade servers inside it do not have a standardized format. Blade servers use proprietary standards, which creates a “dependency” on the provider initially chosen.

2. THE MAIN COMPONENTS OF A SERVER/DATACENTER

Incredible but true: in the following paragraphs I will “go deeper” into the main elements that make up a server.

Data processing

In servers, data are fetched from memory, modified, processed, overwritten, … . Different components are used for this purpose, but they all share a “similar” basic structure, which can be schematized (stripping it down) as follows:

ALU (Arithmetic Logic Unit) -> cores that execute computation operations.
Cache memory -> SRAM memories (Static RAM, different from the DRAM you can buy as a memory expansion for your computer) that keep the data being worked on very close to the cores. They are divided into L1, L2, L3, where L1 is the smallest and fastest, L3 the largest and slowest.
Control unit -> decides which instructions to execute and when, managing branches and memory accesses.

How these elements are distributed on a module determines its type and use case.

CPU

The main (and, until a few decades ago, only) computing element is the CPU (Central Processing Unit). It is a “general” computing component, designed for sequential computation or with irregular memory access. It is usually built around a very “bulky” control unit, which can reorder instructions so that they are executed on parallel cores efficiently without changing program behavior and without cache access interference by the ALUs. The ALUs themselves are very powerful and able to perform complex operations. The cache has a moderate size and allows keeping the essential data for computation “close” to the ALUs for faster access.

In general, this type of component performs best in single-thread or lightly parallel contexts, where the large control unit, the moderate cache, and the (few) powerful ALUs outperform other technologies (for reasons we will explain soon).

GPU

As the name suggests, Graphic Processing Units were initially created for graphics processing: images and video frames consist of many pixels that can be processed independently (in parallel) with simple operations. This is the key concept characterizing GPUs: instead of having a few powerful ALUs, they contain thousands of simple ALUs, less powerful than those in CPUs but far more numerous, to process pixels in parallel.

If initially this approach was used for graphics, today there are many applications for which such a component makes data processing much faster (training neural networks, scientific computation, …).

But if GPUs are so powerful, why do we not use them for everything, completely replacing CPUs? Let us see.

Graphic Processing Units were born from a problem: according to Moore’s “law”, the number of transistors physically placeable inside a chip doubles every 18 months; however, demand for graphics power for modern applications (gaming, rendering, but above all AI) grows much faster. This made it necessary to find a way to drastically increase performance for certain types of tasks without having a proportional increase in the number of transistors in chips.

So the GPU approach and parallelism were adopted: we cannot place thousands of powerful ALUs like those in CPUs in a single chip together with an equally large cache and control unit, but we can “shrink” cache and control unit, inserting thousands of very simple ALUs, good at executing only certain kinds of critical instructions (especially matrix operations).

This makes GPUs not suitable for heavy sequential loads. Specifically, GPU ALUs are less powerful than CPU ALUs, so performance on a single (or few) cores is slower. Moreover, the GPU does not run the operating system. When one wants to compute using a GPU, one must go through the path

# Data read
DRAM -> CPU_cache -> GPU_VRAM

# Data save
GPU_VRAM -> CPU_cache -> DRAM

The CPU_cache -> GPU_VRAM step (and vice versa) is a bottleneck that introduces a fixed, unavoidable overhead. To understand the impact of this latency, imagine that transfer time between processor cache and GPU VRAM is 30 seconds. There are two cases:

if we use the GPU for sequential computations and not GPU-optimized workloads, where each “work cycle” (the work the GPU performs on the data once they are in its VRAM) lasts 1 minute, the overhead weighs heavily on performance.
if instead we run massive parallel computations, where a “work cycle” lasts 10 hours, the 30 seconds overhead becomes irrelevant compared to the parallel computation advantage, which would have required the CPU 1000X the same time.

GPUs are therefore exceptional components for certain tasks and have made it possible to overcome Moore’s law obstacle for massive parallel computations, but they are not universally suitable.

Multiple GPUs

The increase in performance requirements due to the growth of neural networks made it necessary to connect multiple GPUs so they can work in parallel as if they were a single component. To do so, (proprietary) protocols like NVLink, Nvidia’s Multiple GPUs protocol, are used. This is a protocol where one or more host CPUs are connected to several GPUs, which are in turn connected to each other through a very high-bandwidth PCI-express connection. No further details were given because this is not the core of the course.

TPU

As said, GPUs were initially created for graphics and, although they are not as general-purpose as CPUs, they are still fairly general-use components (limited to parallel processing). But can we do better? Can we create a component specifically designed to train neural networks?

This is the idea behind TPUs (Tensor Processing Units), in which Google has:

limited the CPU_cache -> GPU_VRAM bottleneck problem by replacing VRAM with HBM (High Bandwidth Memory),
increased the amount of SRAM so that data can be stored closer to compute units,
replaced (at least partially) the “semi-general” ALUs of GPUs with MACs (Multiply-ACcumulate), even more specialized units (in efficiency and speed) for multiply-add computations, fundamental for neural network training.

These measures allow achieving a much higher throughput for the same area and energy spent.

FPGA

All very nice. But what if we wanted to program a chip without building it ourselves? Building a chip is very expensive: it must be designed and then manufactured by (few) chip producers, spending time and money. FPGAs (Field Programmable Gate Array) offer an alternative to this process: instead of fabricating a custom chip, one takes a reconfigurable device composed of CLBs (Configurable Logic Blocks) and a programmable interconnection network. The user describes the hardware with VHDL/Verilog, reconfiguring the chip.

Design flexibility is extreme: custom pipelines and specific numeric formats can be created. Moreover, in many real-time and streaming cases, FPGAs offer very low latency and good efficiency per watt compared to CPUs and, for well-structured loads, also compared to GPUs.

The price to pay is programming complexity: a non-trivial synthesis flow is required, compilation times are long, and hardware knowledge is needed. Moreover, usage is less flexible: if CPUs and GPUs are general-purpose devices, once an FPGA is programmed it can be used “only” for that task (although some families support partial reconfiguration, which allows changing parts of the design at runtime).

In general, this type of component is excellent for low-latency streaming, inference pipelines, or for building prototypes before a possible ASIC (Application Specific Integrated Circuit) version.

Data storage

Let us now talk about the human tendency to < accumulate stuff > .

Since I am a student, I am currently poor and the only possessions I have are a couple of keyboards I use for practice, the PC I am writing this article on, and a strong desire to do things. I need very little space to live (and believe me, I tested it given the size of rooms in Milan).

Probably in the future I will find a job, earn a bit more money and buy more things (hopefully not too many more, since I love living with a minimalist approach), and I will need a larger space.

There always comes, in a man’s life, the point where the things you have are too many and you need to make space, you need to keep only the truly important things. Sure, it would be nice to keep even the useless souvenir bought in Malta during the 1995 vacation with friends, but there is just no room.

Why am I making this nonsensical speech? Because it is exactly what is happening with the data we produce.

The trend of the ratio between data creation and storage

In the 80s and 90s most data were human-generated, and most of these data were stored on user devices. For this reason, the amount of generated data was much lower, and so was the demand for storage tools.

Today not only have user data increased, but they represent a tiny part of total generated data. Most created information is not human-generated, but recorded through sensors or produced through generative AI. This has drastically increased the need for space, which does not grow at the same rate as data growth, making it practically impossible to store everything. Companies are forced to filter: they store only the most important data or condense them into higher-level information, losing the possibility of performing further higher-granularity evaluations later or training neural models on raw data.

There is no solution to the problem at the moment, it is simply a bottleneck we have to live with.

Ok, but what do we store the generated data on? Let us analyze the main storage devices used in DCs.

Tapes

Tapes are the first form of “computer storage” used in datacenters. They consist of magnetic tapes wound on reels on which data can be written and read sequentially. Obviously, access is extremely slow: you must first scroll the tape to the correct point and then unwind the part containing the desired information.

Despite this drawback, tapes are still used as “cold storage”, that is, a long-term storage method that allows keeping data cold for very long periods. If stored properly, tapes are very durable and can last even several decades. Use cases can vary: data backups, long-term scientific data archiving, … .

HDD

Hard Disk Drives are more versatile storage devices, with “random access” (not sequential like tapes). They are composed of:

Disks: stacked magnetic platters rotating at a fixed speed indicated in RPM (Rotations Per Minute). Each disk is divided into tracks, which are circular rings of radius r on the disk, in turn divided into sectors or data blocks, the smallest memory unit that can be read/written. Each sector has a unique LBA (Logical Block Address) that identifies it. Usually, the operating system of the device connected to the disk groups multiple sectors into a larger structure called a cluster, whose size can vary from 1 to many sectors and is therefore the minimum memory unit the OS can read/write. Warning: sector and especially cluster sizes must be evaluated carefully: when the OS writes to disk it can allocate/delete only amounts of memory that are multiples of the cluster size. This means that if a piece of data has size (N-1) cluster_size < file_size < N cluster_size, an amount of space equal to N cluster_size - file_size is wasted, creating internal fragmentation, where as reads and writes increase, wasted memory grows.
Arm, R&W head: the moving arm allows moving across the rotating disk and reading the desired sector through the head mounted on its tip.

Ok, we understand the general process: the disk rotates, the head moves across tracks, and a sector is read when the arm position and disk rotation make the head pass exactly over it. It is largely a mechanical process, and for this reason there are non-negligible latencies to consider in data access. Let us therefore build a simplified model that allows computing the average access time given main disk characteristics.

We can consider HDD access time as the sum of the following components:

Seek delay -> time the arm takes to move to the correct track. We consider:

\[t_{seek}^{avg} = t_{max}/3\]

Rotation delay -> time needed for the disk to bring the sector under the head. We consider it equal to:

$t_{rotat.}^{avg} = 60/2 * \text{RPM} \text{ [seconds]}$

Data transfer time -> actual read and transfer time, often expressed in $Mb/s$. (Note: remember that the conversion factor between Mb, Kb, … is 1024).
Control overhead -> before accessing a sector, it is necessary to wait for the “preliminary operations” performed by the disk controller for read/write. We will often neglect this factor because it is dominated by other components.

Access time is dominated by the first two delays. However, there are cases where these delays are negligible and data reading is much faster: when reading is sequential (data are on contiguous sectors). For this reason, the concept of locality is introduced, that is, the percentage of data that are organized contiguously among those we want to access.

In general:

Access time without locality (random access) ->

$t_{no\_loc} = t_{seek}^{avg} + t_{rotat.}^{avg} + t_{transf.} + t_{contr. ovh.}$

Access time with full locality (sequential access) ->

\[t_{loc} = t_{transf.} + t_{contr. ovh.}\]

Usually, locality is neither 0% nor 100%, but an intermediate percentage. In that case, access time can be approximated as

\[t = t_{loc} * X\% + t_{no\_loc}*(1-X\%)\]

This concept leads us to define another kind of fragmentation: External Fragmentation. Unlike internal fragmentation, this type does not waste memory, but causes a performance loss in reading and writing when locality is too low (because data are “fragmented” and many rotation + seek operations are needed to read all chunks on disk and reconstruct the full information).

The time computed above is called service time, and it is the time the disk takes to satisfy a read/write request (for simplicity we consider them of the same average duration). Often, however, the operating system must handle multiple storage access requests, so they are placed in a queue. The sum between the time needed for request R to start (its “turn” in the queue) and the time the disk takes to satisfy it (service time) is called response time.

There are 4 main approaches to manage W/R request queues, that is, disk scheduling:

FCFS (First Come First Served): basically a FIFO queue, where request order is not changed. Ok if we want a truly simple approach, but the worst in terms of access speed.
SSTF (Shortest Seek Time First): attempts to minimize access time by reordering the queue so that requests on sectors close to the current head position are executed first. This does reduce access time, but has a serious problem: if a request R concerns a sector far from the current head position and requests keep arriving for closer sectors, R will not be executed for a long time. To fix this, the following solutions were devised.
SCAN: the idea is simple: the head performs a fixed read movement, from the innermost track to the outermost, then reads back in the opposite direction. Requests are reordered to follow this movement. In this way, the starvation problem of the previous solution is eliminated, at the cost of slightly higher latency. However, this approach has a feature that is not always acceptable: the central part of the disk is served more often than the outer parts, making the queue not “fair”. To solve this, the next solution was devised.
C-SCAN: instead of reading both on the “outward” and “return” path, in C-SCAN the head reads from the innermost track to the outermost, then performs a very fast return without reading. The cycle restarts. In this way a “fair” queue is obtained where all sectors have the same probability of being served first, at the cost of further increased latency.
C-LOOK: a C-SCAN that wants to seem smarter. The only difference is that scanning does not start from track 0, but from the smallest track currently present in the request queue. The advantage is that return movement is shortened, potentially limiting latency. The negative aspect is that if during the head return movement a request is made on a sector lower than those previously present in the queue, it will be ignored until the next scan.

Disk scheduling can be implemented either at OS level or in the HDD controller.

An additional way to try to reduce HDD access times is caching, that is, inserting a fast memory (DRAM or SSD) in the device case. In that memory, the data most likely to be accessed by the user will be saved (rule of thumb: the user accesses 10% of the data 90% of the time), so the impression is that of having a very large memory at very high speed. Once again, new approach equals new problems to solve. Here the issue is that on write, the user receives confirmation when data are written in cache, but not necessarily when the cache -> disk transfer is also complete. If in this context HDD power is turned off, the device remains in an inconsistent state where cached data are lost (if DRAM is used). To avoid this, a backup battery is integrated, large enough to ensure that transfer from fast to stable memory completes even in case of external power loss. A side effect is increased device price.

SSD

Solid State Drives are designed as a high-performance alternative to HDDs. Unlike HDDs, their operation is not based on moving mechanical parts, but on NAND memory chips.

The “atom” of an SSD memory is the cell, the device that can actually maintain information over time. Each cell maintains a charge, and it is by reading this charge that information is extracted. There are different cell storage/read approaches (in chronological order):

Single Level Cell (SLC): the most expensive and performant method, and the first devised. Charge is quantized to only two levels, so each cell contains only 1 bit of information (0 or 1).
Multi Level Cell (MLC): to lower costs, charge is quantized to 4 levels, so 2 bits per cell can be stored. This reduces the number of cells needed for the same amount of data and thus lowers price, at the cost of a small overhead due to decoding.
Triple Level Cell (TLC): the number of quantization levels is increased to 6 so it can contain 6 bits of information (I know, the naming multi=2 and tri=3 is not the happiest ever devised). In this way costs are lowered further. The problem is that write/read must be very precise with charge thresholds: as quantization levels increase, the probability of read/write inaccuracy increases.

All three solutions are on the market today and the buyer chooses depending on the use case.

From a higher-level perspective, SSD memory is organized like this:

SSD blocks: the minimum amount of memory that can be erased. (Note: you can write a file in a cell only if it is “erased”). Each block contains many pages.
SSD pages: pages are the minimum readable/writable memory unit. Each page is in one of three states: ERASED, VALID, INVALID.

Yes, in SSDs blocks contain pages and not the other way around.

The mismatch between minimum r/w amount and minimum erase amount creates a big problem: in theory SSD read and write speed would be much higher than HDD speed due to the absence of moving mechanical parts; however, what happens in reality (if the precautions explained later are not taken) is that after some writes, performance of the two devices converges. Why does this happen?

The issue is that, as said, we cannot write to a cell until it is ERASED. Suppose we have a block of 5 cells, 4 VALID and one INVALID, and we want to write data to the INVALID page (containing the data flagged as no longer useful): we cannot simply overwrite it as we did on HDDs (hardware limitation), but must first erase it. Since the Block is the minimum erasable unit, the following sequence is required:

read and copy all block contents into some sort of cache;
remove the INVALID page from the cache;
erase the entire block so the target page becomes free;
rewrite into the block all 4 pages that were previously VALID plus the new one.

As is easy to see, all this creates a huge delay in completing requests.

Latency is not the only negative effect of this phenomenon (called write amplification): memory cells have a limited number of write cycles, after which they stop functioning properly (it is no longer possible to store new data, as if they became read-only). The write amplification problem greatly shortens SSD lifetime by consuming available write cycles in memory cells.

Do you think the problems end here? Not at all. SSDs were created to be seen by the OS as if they were HDDs, but surprise: THEY ARE NOT HDDs :D

The write amplification problem is co-caused by the fact that the OS “talks” to the SSD using the “read sectors” and “write sectors” commands typical of HDDs (because in HDDs data are not actually erased, but flagged as invalid and then overwritten), while the “language a Solid State Disk understands” is made of “read”, “program”, and “erase” commands.

Moreover, the OS uses the SSD considering the sector subdivision, while the device is organized in blocks and pages. So an intermediate layer is needed to adapt the two interfaces, and this layer is called Flash Transation Layer and is integrated inside the SSD. This component handles mapping between OS virtual addresses and real flash memory addresses, which can be done at different granularities:

Page level mapping: always immediate correspondence, but the page table can reach 1GB per TB of storage, which could be inconvenient.
Block based mapping: instead of mapping pages, we map blocks. This greatly reduces mapping table size, but adds write overhead (if even one page changes, an entire block must be handled).
Hybrid mapping: keep the block-based table in memory, but for the most frequently accessed data keep a subtable with page-level mapping.
Page mapping + caching: keep the page-level approach, but load into DRAM only the portion of the table currently used (or the portion related to the most frequently used data).

Returning to the write amplification problem, to limit derived issues two additional components were introduced:

Garbage collector: a component that, through low-priority processes, “packs” VALID pages into the “minimum number of blocks” so that blocks with INVALID pages can be erased, allowing a future write request to be executed without overhead. Note that these operations are not free: they cost in terms of SSD life, because the number of writes on cells increases.

Note: the garbage collector assumes it knows which pages are deleted. The problem is that old file systems did not truly delete data, but simply removed metadata so they were “lost track of”. On HDDs, overwriting was enough. The SATA trim command was introduced to allow OSs to delete (mark as INVALID).
Wear levelling: as said, cell write cycles are limited. However, cells are not all used equally: some data are accessed and replaced much more frequently than others that are never deleted. Cells containing the first kind wear out much more than the second. Wear levelling simply swaps the two kinds of data: place frequently overwritten files in new cells, while “cold” data in already worn cells, leveling SSD usage and extending its life.

With all these problems, how long can an SSD last? It is hard to answer: it depends on usage type, technology, and external conditions. In general, it can last from a few months (in unrealistically bad cases where everything is continuously overwritten) up to decades.

RAID

How do we get storage with more capacity, performance and/or reliability? This is exactly the purpose of RAID (Redundant Array of Inexpensive/Indipendant Disks).

The idea is to take multiple disks, put them “together” inside a “case” equipped with a controller that exposes outward the typical interface of a single storage device, obtaining (depending on RAID type):

more memory, given by the sum of the capacities of the single storages;
more redundancy (copying onto multiple disks at the same time) and thus more reliability;
more speed thanks to splitting files into chunks and writing them in parallel onto multiple disks.

In reality it is not possible to get the maximum of all these characteristics using a single RAID type; each variant has pros and cons that make it more suitable for one case and less for others.

Note: in the following evaluations we will use letter S to indicate sequential read/write speed of a single disk, R to indicate random write/read speed of a single disk, X to indicate capacity of a single disk.

RAID 0

This is the “maximum performance” variant. It consists in parallelizing N disks and splitting data into chunks, that is, small pieces to read/write in parallel (N at a time) onto disks.

What we obtain is sequential and random read/write speed equal to SN and RN, but very low reliability, measurable through Mean Time To Fault, that is, the average time before a system fault happens (permanent data loss), which for RAID 0 is

\[MTTF_{RAID_0}=MTTF_{single}/N\]

Total capacity is $XN$ (the maximum possible*).

RAID 1

But we are anxious people and we do not like living with the fear of losing our data: we want them safe! To do that, the simplest (and effective, only in terms of reliability) way is RAID 1.

The concept is simple: we have N disks in parallel and every time we write something, we write it on all disks simultaneously. In this way we obtain N copies, so Mean Time to Fault is

\[MTTF_{RAID1} = \dfrac{MTTF_{singleDisk}^2}{2 \cdot MTTR}\]

From here on, given N disks, we consider duplicating only once (only two copies). In this way we have N/2 disks containing copy_1 and the other N/2 disks with copy_2 (maybe in striping = RAID 0).

In this case, read and write speeds are:

Random read: if reads are random, we can parallelize them across all disks and obtain read speed equal to:

\[S_{RAID_1} = N*S_{single}\]

This is the ideal usage hypothesis for RAID 1;

Sequential read: if reads are sequential, we can use only one of the two copies (thus $N/2$ disks) so throughput is halved:

$R_{RAID_1} = (N/2) * R_{single}$

Random/sequential write: in both cases we must write the same data twice, so we have:

\[S_{RAID_1} = (N/2) * S_{single}\]

and

\[R_{RAID_1} = (N/2) * R_{single}\]

We can conclude that RAID 1 is an excellent solution in terms of reliability and in cases where there are few disk writes and many random reads, while it performs worse in other cases. Note that we lose half of available storage capacity.

RAID 0+1

By mixing the previous two solutions we can obtain a safe system that does not give up a performance boost. There are two possible combinations, and the first we analyze is RAID 0+1.

Consider N disks: we first duplicate memory (RAID 1), obtaining two copies made of N/2 disks, and on each we write data in striping (RAID 0). We therefore have two copies of N/2 disks with striping.

For example, with 6 disks we would obtain 2 copies (managed by a RAID 1 controller) of 3 disks each. Inside each copy all data are stored, organized as 3 striped disks (the 3 disks within one partition are managed by a RAID 0 controller).

The system can withstand the loss of 1 disk, because when this happens the system can still use the second striped copy. However, in that condition the system becomes a RAID 0 because the RAID 0 controller managing the problematic trio says “bye bye”.

\[MTTF_{RAID0+1} = \dfrac{MTTF_{singleDisk}^2}{N^2 \cdot G \cdot MTTR} \text{, where} \begin{cases} G=\text{number of RAID1 duplications;} \\ N= \text{number of stripes per duplication.} \end{cases}\]

RAID 1+0

The second combination is characterized by reversing the two techniques: first, data are striped across N/2 disks (RAID 0), then each stripe disk is duplicated (RAID 1).

In the case of a system composed of 6 disks, there would be 3 disk stripes (managed by a RAID 0 controller), each containing the duplication of a stripe disk (duplication managed by a RAID 1 controller for each stripe).

In this case the system can withstand up to N/2 disk failures! Consider the example above: if one disk in each of the 3 stripes broke, RAID 0 controllers would still provide data through the other disk. This is an obviously lucky case: if two failed disks belonged to the same stripe group, data would still be irreparably compromised.

Despite this, the ability to withstand more failures makes RAID 1+0 generally better than 0+1: performance and capacity are practically identical, but reliability is higher in the 1+0 configuration.

\[MTTF_{RAID1+0} = \dfrac{MTTF_{singleDisk}^2}{N \cdot G \cdot MTTR} \text[, where ] \begin{cases} G=\text{number of RAID1 duplications;} \\ N= \text{number of stripes per duplication.} \end{cases}\]

RAID 4

Security is nice, but it would also be nice not to lose half our storage capacity to get it. What can we improve in our strategy?

RAID 4 is born exactly for this: given N disks, N-1 are used to store data (striping, similarly to RAID 0) and 1 to store parity. But what does parity mean?

In practice an operation (e.g., XOR) is computed using all “same-level chunks” on the N-1 data disks, and the result is stored on the Nth disk. For example, with 5 disks, and if the first chunk of the data disks are 1,0,0,1, then the parity disk will contain for that chunk level the data $1+0+0+1=0$. This allows, in case of disk malfunction, recovering data by performing the inverse operation. Suppose disk 3 breaks, and we have $1+0+...+1=0$; then the missing data are:

\[0-(1+0+1)=0\]

Capacity becomes $N-1$ and data survive the loss of a single disk. Performance:

Sequential/random read: excellent, equal to:

\[S_{RAID_4} = (N-1) * S_{single}\]

and

$R_{RAID_4} = (N-1) * R_{single}$

Sequential write: excellent performance here too. Since we write only on “one chunk level at a time”, we can write in parallel on all 4 disks and then store a single parity result on the parity disk. In practice:

$S_{RAID_4} = (N-1) * S_{single}$

Random write: here problems appear. If we write on 4 random levels of our 4 disks, for each level parity must be computed and stored on the parity disk. This means we can parallelize the 4 random writes, but then we have 4 additional concurrent writes on the parity disk, making it a serious bottleneck. In practice:

\[R_{RAID_4} = R_{single}/2\] \[MTTF_{RAID4} = \dfrac{MTTF_{singleDisk}^2}{N \cdot (N - 1) \cdot MTTR}\]

RAID 5

The solution to the serious random write problem above is distributing parity bits across all disks. RAID 5 does this.

Thus, even in random write cases there is a natural distribution of write load across the disks. What we obtain is improved performance:

Sequential read/write: unchanged compared to RAID 4:

\[S_{RAID_5} = (N-1) * S_{single}\]

and

$R_{RAID_5} = (N-1) * R_{single}$

Random read: often random reads can be parallelized across all disks rather than N-1, since parity bits are distributed. This yields:

$R_{RAID_5} = (N) * R_{single}$

Random write: there is no longer the single parity disk bottleneck. However, for each write it is necessary to: 1) read the data to replace; 2) read the parity bit; 3) write the new data; 4) update the parity bit. This means:

\[R_{RAID_5} = \dfrac{N}{4} * R_{single}\] \[MTTF_{RAID5} = \dfrac{MTTF_{singleDisk}^2}{N \cdot (N - 1) \cdot MTTR}\]

RAID 6

Basically like RAID 5, but 2 “distributed parity disks” are used, so it can withstand the loss of two disks.

\[MTTF_{RAID6} = \dfrac{2 \cdot MTTF_{singleDisk}^3}{N \cdot (N - 1) \cdot (N - 2) \cdot MTTR^2}\]

“Mathematical” summary of reliability of the various RAID systems

Considering MTTF = MeanTimeToFailure and MTTR = MeanTimeToRepair, we have the following formulas:

RAID scheme	MTTF formula
$RAID_{0}$	$MTTF_{RAID0} = \dfrac{MTTF_{singleDisk}}{N}$
$RAID_{1}$	$MTTF_{RAID1} = \dfrac{MTTF_{singleDisk}^2}{2 \cdot MTTR}$
$RAID_{0+1}$	$MTTF_{RAID0+1} = \dfrac{MTTF_{singleDisk}^2}{N^2 \cdot G \cdot MTTR}$
$RAID_{1+0}$	$MTTF_{RAID1+0} = \dfrac{MTTF_{singleDisk}^2}{N \cdot MTTR}$
$RAID_{4}$	$MTTF_{RAID4} = \dfrac{MTTF_{singleDisk}^2}{N \cdot (N - 1) \cdot MTTR}$
$RAID_{5}$	$MTTF_{RAID5} = \dfrac{MTTF_{singleDisk}^2}{N \cdot (N - 1) \cdot MTTR}$
$RAID_{6}$	$MTTF_{RAID6} = \dfrac{2 \cdot MTTF_{singleDisk}^3}{N \cdot (N - 1) \cdot (N - 2) \cdot MTTR^2}$

Server <-> storage connection

So far we talked about storage as an “isolated” object. Obviously, an HDD/SSD/RAID would be useless if not connected to a device that uses it. There are different ways to make this connection.

DAS (Direct Attached Storage)

This is the classic solution of directly “attaching” the storage device to the server. From the operating system, the file system will show storage as a disk that can be fully accessed. The advantage is ease of implementation and low latency, but file sharing between different servers is limited in this way (sure, we could copy everything to a USB stick and pass it to another server, but it would be inefficient and unrealistic for a DC).

NAS (Network Attached Storage)

A better solution for file sharing is to use a NAS, that is, one or more storage devices connected to the network and accessible through the network. This sharing is “file based”, meaning the OS does not see accessible disks, but files shared through the network (the file manager is “behind” the network). It can be an “economic” and simple solution, but has a big drawback: in case of network congestion in the DC (maybe during a high-demand moment), file access could slow down a lot (since files travel on the same network as IP packets).

It is a suitable solution for small datacenters.

SAN (Storage Area Network)

Here, instead of connecting storage via ethernet, we implement a dedicated storage network (fiber optics), separate from the internet network. In this case the file system sits “in front” of the network and storage devices are seen as disks directly connected to servers.

Data access is more robust than NAS and scalability is greater, as are implementation costs. In general it is more suitable for large datacenters.

Note: if a datacenter implements a NAS or a SAN, that does not mean there are no DAS as well. For example, one could implement a NAS and, if a server must operate on remote data, it could fetch them through the network, copy them into DAS once, and access them multiple times with lower latency and less dependence on the current network state.

Networking

We have talked a lot about networks, now it is time to analyze how a datacenter internal network is actually structured. But what is an IP packet? Is it physically a packet or not?

These and other doubts will be resolved in the following paragraph.

Fundamental concepts

First it is important to clarify the following concepts:

North<->South traffic: “vertical” network traffic, entering and leaving the datacenter, proportional to the number of requests and packets that come from outside and are sent outward by servers. This type of traffic was predominant in the first era of the internet, when servers were mostly “independent” machines.
East<->West traffic: “horizontal” traffic, due to communication between servers belonging to the same database. Why should there be horizontal communication inside a DC? We already saw that with the rise of Warehouse Computing entire datacenters are dedicated to delivering a single service, and this means multiple servers delivering different subservices must communicate so the overall task is carried out. Note that for every external request, multiple communications between internal servers often occur, and today East-West traffic is much denser than North-South.
Bisection bandwidth: imagine we have N servers interconnected. We want to divide this set of servers into two halves by “bisecting” it. If we do it at the point with the narrowest bandwidth (the bottleneck), the bandwidth obtained at that point is the bisectional bandwidth.
Oversubscription: suppose we have a rack containing multiple servers. As said earlier, at the top of racks there is usually an area for modules useful to servers, such as a router. All servers of the rack connect to that router. Consider downlink = connection from outside toward servers and uplink = connection from servers toward outside (toward other routers or the internet). Oversubscription means providing more bandwidth in downlink than in uplink. For example, if we have 20Gb total downlink and 10Gb uplink, oversubscription is 1:2 (uplink : downlink). It is useful to lower implementation costs and avoid wasting resources, but must be designed carefully because, in case of many outbound packets, the router would become an important bottleneck.

With these concepts clarified, we can analyze the various DC connection architectures. In general, two approaches are used: router-centric solutions where routers are network hubs connecting nodes, and server-centric solutions where servers themselves act as routers, integrating multiple network interfaces to connect multiple nodes and perform packet forwarding in addition to “classic” computation processes.

We will focus mainly on the first type (though examples of other categories will also be mentioned “superficially”).

Three Tier Architecture

Router Centric Architecture

This is the classic architecture of old datacenters. It consists of 3 levels well adapted to the physical structure of DCs. Starting from the bottom and therefore from individual servers:

Access Level Switches: servers in a single rack are all connected to the same router, placed at the top of the rack and called TOR (Top Of Rack) router. This is the Access Level Switch. It has modest bandwidth, proportional to the relatively small number of devices to manage (single rack).
Aggregation Level Switches: the intermediate level is composed of a set of routers with higher capacity, aggregating several Access Level Routers. Often, the physical position of aggregation routers is a dedicated rack placed at the end of the aisle. For this reason, people talk about EOR (End Of Row) servers.
Core Level Switches: the highest level consists of one or more Core Routers to which all aggregation switches are connected. Routers at this level have very high bandwidth and manage the connection between the outside of the DC and the servers inside it.

Everything looks very neat. There is only one problem: Three Tier is excellent for North-South traffic and therefore was dominant in the past, but is not equally performant when East-West traffic increases. Imagine sending a packet from Server_A to Server_B. In the worst case, it must go through:

Server_A -> access_switch_A -> aggregation_switch_A -> core switch -> aggregation_switch_B -> access_switch_B -> Server_B

It is easy to see that as horizontal traffic increases, pressure on core and aggregation switches becomes huge, and scaling the infrastructure to handle that increase without changing architecture is very expensive.

Note: redundant connections are often present between Aggregation Level Switches and between Core Level Switches to increase bandwidth and network reliability, at the cost of more complex packet routing management.

EOR Architecture

Router Centric Architecture

If desired, the Three Tier concept can be flattened into a Two Tier: the access level is removed and all routers of a single row are connected to the EOR router. This reduces the number of cables, but attention must be paid to load on end-of-row routers, which will carry much higher traffic.

Spine-Leaf Architecture

Router Centric Architecture

This is one of the most used approaches in modern datacenters. It is an architecture borrowed from telephony, using only two levels:

Leaf Layer: what we previously called the access layer, that is, the TOR router.
Spine Layer: the higher layer, aggregating multiple leaf layers.

The difference compared to Three Tier is that in Spine-Leaf all Leafs are connected to all Spines.

Let us make an example. Leaf Layer: we have 1000 servers, each with 10Gb bandwidth. Suppose we have 100 racks with 10 servers each, so each TOR router manages 10 servers. To have no oversubscription (incoming bandwidth equals outgoing), we could choose a Leaf Switch with 10 downlink ports and 10 uplink ports, each 10Gb.

Spine Layer: given the network structure, we have 100 leafs each with total outgoing bandwidth $10*10Gb=100Gb$. We want to choose our spine structure: total bandwidth to support is $100*100Gb=10{,}000Gb$. We could handle it in different ways.

The classic method is to connect one uplink output of the leaf switch to a different spine switch. Let us see how.

Consider a single Leaf Switch. With 10 uplink ports and wanting to connect each port to a different spine, we need 10 spines.
We have 100 Leaf Switches, and on each spine one port from each leaf is connected, so each spine has 100 ports.
Bandwidth of a single spine port must match the connected leaf port bandwidth, which is 10Gb. With 100 ports on a spine, total bandwidth on a single spine is $100_{\text{ports}}*10Gb = 1{,}000Gb$
Total uplink bandwidth is $10_{\text{Spine_switches}}*1{,}000Gb = 10{,}000Gb$

An alternative method could be to consolidate spine switches and, for example, use only 5 spines but with twice the ports, or 5 spines with the same number of ports but where each port bandwidth is double. In short, everything to achieve Oversubscription = 0.

Advantages of this architecture are much higher East-West bandwidth (given duplicated paths between nodes) and low latency (in the worst case, the number of hops from one server to another is 3). Moreover, it is possible to build the entire network infrastructure using a single router type, while in Three Tier the three levels used switches with very different capacities. This decreases initial costs and simplifies system management.

Pod based Architecture / Fat three

Router Centric Architecture

Let us now make Leaf-Spine more scalable. Start with a Leaf-Spine network and group it into subgroups called Pods (point Of Delivery): each POD is composed of K Leafs (renamed Edges) and K Spines (renamed aggregators).

For each pod we have:

K Edges switches, with K Downlink ports ($K$ servers per edge, total $K*K$) and K Uplink ports, each connected to an Aggregator.
K Aggregators switches, with K Downlink ports connected to the K Edges and K Uplink ports, which we will see how to connect.

Pods are not directly connected to each other, and here we introduce a new level: Core Switches, which connect Pods together. We said Aggregators have $K$ Uplink ports, and since a single Pod has $K$ aggregators, total Uplink ports leaving the Pod are $K*K=K^2$. Each of these ports is connected to a different Core Switch.

With $P = 2K$ Pods, total uplink ports that the cores must handle is $N_{\text{Pods Up. Tot.}}=K^2*2K=2K^3$.

If, for example, we use core switches with $2K$ Downlink ports, the number of required cores is $N_{\text{Core Switches}}2K^3/2K=K^2$.

Note that with $K^2$ cores of $2K$ ports we have $2K^3$ aggregator uplink ports, and uplinks equal downlinks in configurations with no oversubscription, so $N_{\text{Aggr. Dwnlnk}} = N_{\text{Edg Uplnk}} = N_{\text{Edg Dwnlnk}}=N_{\text{Srvrs}}$, all equal to $2K^3$.

We can see how in this architecture there are two types of traffic:

Intra-pod traffic: if two servers are connected to the same pod, they communicate through multiple paths of only 2 hops (basically exactly like Leaf-Spine):

Server_A -> Access_A -> Aggregator -> Access_B -> Server_B

Extra-pod traffic: if the two servers are connected to different pods, the multiple paths connecting them are 3 hops:
```
Server_A -> Access_A -> Aggregator_A -> Core -> Aggregator_B -> Access_B -> Server_B
```
In any case, latency is low and the structure is very modular: it is enough to increase the number of cores to introduce additional pods, with the difference that, unlike the classic Three Tier, here the network can be built with homogeneous routers and ultra-high-bandwidth cores are not required. Moreover, East-West bisectional bandwidth is much higher, making this approach very suitable for modern datacenters.

CamCube

Server Centric Architecture

The architectures treated so far were all router-centric. Let us now take a brief look at server-centric solutions and, later, hybrid ones.

We start with CamCube, an approach based on direct server-to-server connections, arranged in a toroidal topology (3D Torus Topology). This solution uses the concept of “locality”, meaning that if two servers are “close” in the toroidal lattice, the path connecting them is shorter.

This removes the router management problem, simplifying maintenance and lowering costs. At the same time, however, it introduces greater complexity in packet routing management, which must be performed through servers that need multiple NICs (Network Interconnection Cards, or something like that).

DCell

Hybrid Architecture

This is a recursive network structure using both server-to-server connections and commodity routers. In practice:

$Dcell_0$: the base structure. It is composed of $N<8$ servers connected via a commodity router.
$Dcell_k$: a higher structure composed of $N-1$ $DCell_{k-1}$.

To increase network size, a DCell level is added. This makes the network easily scalable and very cheap, at the cost of generally longer paths and therefore higher latency.

BCube

Hybrid Architecture

This is also a recursive architecture. The fundamental block is $BCube_0$, made of $N$ servers connected to an N-port switch. Higher levels $BCube_k$ are formed by combining $N$ $BCube_{k-1}$ and $N^2$ switches with $N$ ports.

Advantages are good bisectional bandwidth and the fact that performance degrades in a “harmonic” way when faults occur. At the same time, cabling becomes very long and maintenance becomes harder.

MDCube

Hybrid Architecture

A variant useful to simplify the interconnection of Containers (we will talk about them in more detail later). Each container is mapped to an ID that associates it with a specific multidimensional tuple. Servers inside the same container are connected via (obviously) Intra-container connections. Connections between different containers are high-bandwidth and are made between servers whose tuple-id differs by only one bit (e.g., 010 <-> 011). These connections are called High-speed inter-container connections.

This forms a lattice similar to the previous case.

Here, the number of cables decreases a lot, simplifying DC maintenance.

3. SUPPORT MODULES FOR THE COMPUTE PART

When talking about datacenters, people often focus on servers, networks, and in general the “computing” part of the system. In reality, often less than half of DC space and consumed energy are used to power servers. So where do the rest of the resources go?

Supply systems

A fundamental DC module is power supply, for several reasons:

the overall system consumes a lot and needs an underlying power structure able to deliver the required energy amount.
system availability must be very high, so the power system must be extremely reliable.

For these reasons, supply systems include very robust primary power systems, associated with backup systems in case the former malfunction. In general, two complementary recovery systems are used:

Immediate backup systems: systems that activate immediately when a power interruption occurs. Their immediacy comes with the downside that they cannot power the DC for long periods, because storable energy is limited. Examples are intuitive emergency batteries, or more “exotic” Rotary UPS Systems, that is, systems with a centrifuge containing a large mass acting as a kinetic energy accumulator. In case of power interruption, the centrifuge keeps rotating due to conservation of momentum, powering the system for a short time (what a strange solution).
Medium-term backup systems: as said, the previous solutions power the DC for a short interval (minutes or hours), useful to start less immediate but longer-lasting backup systems, such as diesel generators (with little regard for environmental sustainability).

“Global” cooling systems in a datacenter

Every DC element (servers, routers, supply systems) produces heat proportional to its energy density. This makes it necessary to use systems that extract heat and dissipate it outside (with little regard for global warming).

We will soon talk about the main techniques used, but first we must clarify a concept: racks are arranged in rows separated from each other, forming aisles. Aisles are not all the same: there are cold aisles, corresponding to the region where server fronts are accessible, and hot aisles, where rack rear interfaces face. In practice:

|| hot_aisle - R_server_F - cold_aisle - F_server_R - hot_aisle - - R_server_F - cold_aisle - F_server_R - … ||

(where R stands for rear and F stands for front)

Moreover, DCs are often built on a raised floor, which allows easier cable routing and, as we will see, also plays a role in heat dissipation.

From the cooling technique standpoint, three are mainly used:

Open Loop Cooling: the simplest method, cooling the inside environment using outside air filtered from impurities. Cold air enters from side openings and hot air produced by components is somehow channeled upward and expelled. It is simple and cheap, but suitable only for small datacenters without high energy density, because otherwise it is not sufficient to cool components efficiently.
Dual Closed Loop Cooling: to improve cooling effectiveness, another component is added: the CRAC. It is called “dual loop” because there is a first cooling cycle where air is taken from hot aisles and routed to the CRAC, which cools it and sends it through the raised floor to cold aisles to cool servers; the second cooling cycle is internal to CRACs, which, being heat machines, use thermodynamic processes (expansion, compression, …) to cool a liquid flowing in internal pipes and, in contact with hot DC air, cools it. Note that “closed loop” means air never leaves the datacenter. This is useful for two reasons: preventing impurities from entering the DC and allowing reducing oxygen concentration, lowering fire probability.
Triple Closed Loop Cooling: further improvement is possible by introducing Cooling Towers. These are large structures through which a coolant liquid flows, exchanging heat with the outside through the tower’s large surface, cooling down and re-entering the circulation. The liquid cooled by the cooling tower is placed in contact with the still-hot liquid inside the CRACs (the pipes for the two liquids are close, but liquids are not mixed), improving efficiency. The downside is much higher initial infrastructure cost.

Cooling systems for individual racks/servers

Often “general” cooling is not enough to keep servers at peak capacity and it becomes necessary to implement systems directly on heat sources.

On-chip cooling: as energy density increases (think blade servers), applying dissipation directly on chips became common. Air solutions (fans) or liquid ones can be used; with the latter, maintenance must be very meticulous, because coolant leaks could severely damage the servers below.
Immersion cooling: a more exotic cooling system: servers are placed in racks filled with a specific liquid that does not interact with electronic components, making heat dissipation even more efficient. Implementation difficulty increases further.
In-rack cooling: a more “calm” solution, where cooling (fans or liquid) is placed at the back of racks to extract heat close enough to the source, but not so close as to create serious direct problems to servers in case of malfunction.

Modular Container-based Datacenters

We briefly mention a particular type of datacenter, contained inside standard containers (shipping containers). These systems integrate (in small scale) all the modules above, thus forming transportable DCs that can be moved “where needed”.

For example, if we wanted to reduce latency for a specific geographic region by moving frequently executed computations to a DC node closer to request sources, we could bring a container-based DC there.

Indicators of datacenter “quality”

As for any system, it is useful to define evaluation parameters for datacenters:

PUE (Power Usage Effectiveness): an indicator of how much energy is actually used to power the DC “compute part”, computed as $PUE=\frac{\text{Energy}_\text{tot}}{\text{Energy}_\text{compute}}$. Its inverse is efficiency $\mu = \frac{\text{Energy}_\text{compute}}{\text{Energy}_\text{tot}}$ and measures the percentage of DC energy actually used to power the “compute part”.
Datacenter Tiers: 4 categories (tiers) to classify datacenters based on availability (we will go deeper later, but basically the percentage of time the system is working = $\frac{\text{UPtime}}{(\text{UPtime}+\text{DOWNtime})}$). The tiers are:
- Tier 1: systems with a single IT chain and a single power and cooling chain. The single system is still highly reliable ($\text{av}_\text{Tier_1}>99.671%$) but a single fault interrupts system operation.
- Tier 2: all Tier 1 requirements plus redundant “general” power systems (emergency batteries, Rotary UPS, diesel generators, …). Expected availability is $\text{av}_\text{Tier_2}>99.741%$.
- Tier 3: IT systems have dual power supply and it is possible to maintain any IT component without interrupting service. $\text{av}_\text{Tier_3}>99.982%$
- Tier 4: maximum “reliability” level: dual power supply is added to the cooling system and the system becomes “fault tolerant”, that is, the datacenter does not interrupt a service due to a single fault (meaning that when a module breaks, the system automatically uses the corresponding redundant module). $\text{av}_\text{Tier_4}>99.995%$

Warning: do not confuse datacenter tiers with network tiers.

4. DEPENDABILITY

Prerequisites

Let us first define some fundamental concepts, which we already had a taste of earlier:

MTTF (Mean Time To Failure): the average time interval before the first failure of a system occurs.
MTBF (Mean Time Between Failures): the average time interval between two consecutive system failures. For single components $MTTF=MTBF$.
Fault_rate = $\lambda = 1/MTBF$

Dependability

Now we move to the main topic of the chapter: defining indicators to evaluate how much we “trust” a given system. In particular, we will use the concept of Dependability, consisting of:

Reliability: the probability that the system has no service interruptions up to time t. It is associated with long-term reliability and is a key principle for systems that are hard to repair, such as space devices, where high reliability is expected because repair is (when possible) costly in time and resources. It can be calculated (only for systems with exponential R distribution) as $R(t)=e^{\frac{1}{\lambda}*t}=e^{\text{MTBF}*t}$ (consider that for single components MTBF = MTTF).
Availability: the probability that the system works at time $t$. It is extremely important for real-time systems, where components may be easily repairable (so reliability is useful but not fundamental) but it is crucial that when a request arrives the system responds promptly. It is computed as $Av=\frac{\text{UPtime}}{\text{Uptime}+\text{DOWNtime}}$.
Mantainability: in various TDE exercises the term $MTTR = \text{Mean Time To Repair}$ appears. It is the only reference to repairability of components, because we did not cover it in the course.
Safety (not covered).
Security (not covered).

Dependability of composite systems

Suppose we connect multiple components:

Series connection: given two systems A and B connected in series, knowing the reliability of single modules we can compute overall reliability as $R_{\text{tot}} = R_A * R_B$. The same applies to availability: $Av_{\text{tot}} = Av_A * Av_B$.
Parallel connection: given two systems A and B connected in parallel, if we know the reliability of single modules we can compute $R_\text{tot}=1-(1-R_A)(1-R_B)$. Similarly, $Av_\text{tot}=1-(1-Av_A)(1-Av_B)$.

Note that a system made only of series components with exponential R distribution is still exponential. The same holds for systems made only of parallels. This means that in these cases one can directly use $R_\text{tot}=e^{\frac{1}{\lambda_tot}}$. This does not hold for systems where series and parallels are mixed, where one must first compute individual parts and then combine.

Different types of Fault

Ok, we like having a reliable system we can tRuSt, but reliable with respect to what? “proBlems”.

Since this is not a degree of formality acceptable for definitions, in engineering we distinguish:

Fault: a software or hardware defect present in our system. It exists, but might never cause problems (or it might blow up our datacenter).
A drone with a slightly crooked wing. It might still reach the planned point B.
Error: when the Fault generates a problem that manifests in system behavior. This does not yet mean the system has failed its task: it might recover somehow and still deliver the intended service.
Due to the crooked wing, the drone follows a different path than expected, but through GPS corrections still lands at the planned point.
Failure: the chain of events ends with system failure, meaning it did not perform its task. Obviously the worst case.
The drone crashes badly.

There are two approaches to handle such problems:

Avoidance: verify the design so well that we are sure “problems” will not occur.
Tollerance: expect problems, but introduce redundancy and error checking to make the system fault-resistant.

Usually, the adopted approach is a mix between the two extremes.

5. VIRTUALIZATION AND CLOUD COMPUTING

After talking about hardware, let us talk about how it is partitioned among processes. Most modern Cloud Computing services are based on Virtualization technology.

Virtualization

Virtualization is a technology that allows organizing physical (real) resources of a machine into virtual resources, then making them available to multiple virtual machines. The software running on each VM (Virtual Machine) will have the impression of being installed on dedicated servers, while in reality it is running on one of the VM instances installed on a common server.

This approach has multiple advantages:

Resource usage in a datacenter can be optimized thanks to more flexible management of resources.
Security among VM instances is ensured by a high level of isolation: VMs cannot directly communicate with each other and behave as if they were on different machines. This also means that an attack on one VM does not propagate to others.
Software installation can be made independent of hardware by virtualizing the latter and providing the same interface regardless of underlying real hardware.

Ok, but how is it implemented more specifically?

Machine IT levels

Before diving into the main topic we must introduce Machine IT levels, that is, layers in which we can organize how a computing system works. From bottom to top:

0. Digital Logic Level: the hardware level. Individual logic modules.
1. Microarchitecture Level: the set of logic modules and how they are connected.
2. Istruction Set Level: the level that provides the software interface to communicate with hardware (ISA = Istruction Set Architecture). There are two types of ISA: the System ISA with instructions accessible by the OS and hiding hardware management complexity, and the User ISA, containing instructions accessible by the user (developer) to run their app.
3. Operating System Level: as the name suggests, the OS level. Here the OS exposes to applications the ABI = Application Binary Interface, allowing them to use User ISA and system calls, that is, requests to use shared hardware resources.
4. Assembly Language Level and 5. Problem-oriented Language Level.

Basic virtualization architecture

Virtualization system modules are:

Hypervisor or VM Monitor: manages the virtualization layer and provides the virtualized interface upward. Theoretically called Hypervisor if installed directly on hardware (level 2) and VM Monitor if hosted on an OS (level 3). In practice, the term Hypervisor is often used for both.
Virtual Machines: modules installed on the hypervisor. As we will see, they can be of various types.

System Virtual Machines

We can distinguish two macro-types of virtualization systems, each divided into many subcategories.

The first macro-type we analyze is System VM: here, the Hypervisor virtualizes the entire ISA, giving the impression to the OSs in the VMs installed on it that they are running on a machine entirely dedicated to them (or on another compatible machine when the real server would not be, in the case of Emulation). System VMs can be installed at different IT layers:

If we install virtualization directly on hardware (level 2 of IT layers) we get a bare metal system (type 1 hypervisor). Two approaches are possible: a Hypervisor with a Monolithic Kernel, containing all drivers for hardware communication and managing ISA calls to hardware, or a Microkernel system, where the kernel is much smaller and part of software-hardware interaction (including driver installation) is assigned to a Service VM running on the hypervisor, where the drivers needed for system operation can be installed. The two solutions have pros and cons: with a monolithic kernel we get better performance (path is VM -> HPV and not VM -> SrvVM -> HPV, so less latency) and better isolation between VMs (since the only common node is the hypervisor), while the microkernel offers more flexibility, because drivers can be installed in the Service VM without recompiling the whole kernel.
It is also possible to install virtualization on the OS, in which case it is OS Hosted (type 2 hypervisor). This provides greater flexibility with benefits similar to the microkernel approach (instead of installing drivers in the VM kernel, they are managed by the underlying OS) at the cost of lower performance. Note that even if the virtualization software runs on a host OS, it provides the VMs above it with the complete ISA interface of a compatible “hardware”.

Still within System VM, emulation can be approached in two different ways:

Full Virtualization: a complete and identical ISA interface is presented to hosted OSs, matching what they expect. This allows running unmodified OSs, at the cost of lower performance than the next option, because in this case a “full emulation” is needed.
Para Virtualization: the hypervisor presents an ISA interface that is similar but not identical to the underlying hardware. The OS knows it is installed on a VM and, instead of performing system calls, performs hypercalls, read by the hypervisor and converted into calls to the underlying ISA. This yields higher speed and efficiency, but only modified OSs can run.

Note: paravirtualization can be done both bare-metal and OS-hosted.

Process Virtual Machines

Here the hypervisor provides higher levels with the ABI (Application Binary Interface), placing itself at level 3 of IT layers. Virtualization software is also called Runtime Software. VMs in this category can run single processes (for example, the Java Virtual Machine).

Containers

Similarly to Process Virtual Machines, containers bring virtualization to the operating system level (3). They are preconfigured modules with all the libraries and variables needed to correctly run a given application. In this way container behavior becomes predictable regardless of the environment where it is installed, simplifying process management in DCs. They can be very useful for services provided as Platform As A Service (PAAS).

Cloud Computing

The technologies above made it possible to develop complex computing/storage/networking services delivered through the internet on-demand: Cloud Computing.

Below are the main Cloud Computing approaches (there are many more):

SaaS (Software As A Service): the provider delivers the full software through the cloud and the user only has to input data to use it (e.g., Gmail, GDrive, …).
PaaS (Platform As A Service): the cloud offers a simplified interface (API) so the user (developer) writes their application by targeting those APIs. System management is simplified since the OS and lower levels are managed by the service provider, but there is a risk of being locked in to a single provider environment because APIs are proprietary.
IaaS (Infrastructure As A Service): on-demand access to virtual resources on which to install “your own” operating system. The provider manages only the underlying VMs, leaving the user responsible for ensuring the rest works correctly. Similar services are DaaS (Data As A Service) and CaaS (Communication As A Service), providing storage and network infrastructures via the internet.

Different approaches are also possible from the infrastructure ownership standpoint:

Public Cloud: services publicly accessible through some form of payment (your data are also a form of payment).
Private Cloud: cloud infrastructure managed by a single organization, which is also the one using resources. Useful where privacy and data confidentiality are critical, and cheaper long-term for very large services.
Community Cloud: similar to private, but managed by multiple federated companies.
Hybrid solutions: any combination of the previous solutions (e.g., private cloud in everyday operation plus some public cloud when demand peaks cannot be handled with own resources).

6. OK, THAT IS ENOUGH

In the course we also went deeper into some models to study the performance of systems like datacenters or disks, but I suggest going into those topics through exercises rather than theory.

For this reason, I stop here.

Good luck for the exam! ☘️

Angelo Antona, 10 August - 5 September 2025

Computing infrastructures

1. COMPUTING INFRASTRUCTURE, WHAT IS IT?

IOT devices

Embedded Computers

Maximum computational power: Data Centers

The missing link: Edge/Fog computing

The evolution of datacenters

1. On-Device services

2. Cloud services: Datacenter

3. Services that grow: Warehouse-Scale Computers (WSC)

4. Virtualization: WSC as datacenter

Geographic organization of datacenters

I) Geographic zones

II) Computing regions

III) Availability zones

Overview of possible server types

Tower Server

Rack Server

Blade Server

2. THE MAIN COMPONENTS OF A SERVER/DATACENTER

Data processing

CPU

GPU

Multiple GPUs

TPU

FPGA

Data storage

The trend of the ratio between data creation and storage

Tapes

HDD

SSD

Yes, in SSDs blocks contain pages and not the other way around.

RAID

Note: in the following evaluations we will use letter S to indicate sequential read/write speed of a single disk, R to indicate random write/read speed of a single disk, X to indicate capacity of a single disk.

RAID 0

RAID 1

RAID 0+1

RAID 1+0

RAID 4

RAID 5

RAID 6

“Mathematical” summary of reliability of the various RAID systems

Server <-> storage connection

DAS (Direct Attached Storage)

NAS (Network Attached Storage)

SAN (Storage Area Network)

Networking

Fundamental concepts

Three Tier Architecture

Router Centric Architecture

EOR Architecture

Router Centric Architecture

Spine-Leaf Architecture

Router Centric Architecture

Pod based Architecture / Fat three

Router Centric Architecture

CamCube

Server Centric Architecture

DCell

Hybrid Architecture

BCube

Hybrid Architecture

MDCube

Hybrid Architecture

3. SUPPORT MODULES FOR THE COMPUTE PART

Supply systems

“Global” cooling systems in a datacenter

Cooling systems for individual racks/servers

Modular Container-based Datacenters

Indicators of datacenter “quality”

4. DEPENDABILITY

Prerequisites

Dependability

Dependability of composite systems

Different types of Fault

A drone with a slightly crooked wing. It might still reach the planned point B.

Due to the crooked wing, the drone follows a different path than expected, but through GPS corrections still lands at the planned point.

The drone crashes badly.

5. VIRTUALIZATION AND CLOUD COMPUTING

Virtualization