How much page space is used for the processor. What is hard drive cache memory and what is it for? Cache architectures and principles

07.01.2024

Cache - memory (cache, cash, buffer- eng.) - used in digital devices as a high-speed clipboard. Cache memory can be found on computer devices such as processors, network cards, CD drives and many others.

The operating principle and architecture of the cache can vary greatly.

For example, a cache can serve as a regular clipboard . The device processes the data and transfers it to a high-speed buffer, where the controller transmits the data to the interface. Such a cache is intended to prevent errors, hardware check data for integrity, or to encode a signal from a device into an understandable signal for the interface, without delays. This system is used, for example, in CD/DVD CD drives.

In another case, the cache can serve to storing frequently used code and thereby speeding up data processing. That is, the device does not need to calculate or look up the data again, which would take much longer than reading it from the cache. In this case, the size and speed of the cache plays a very important role.

This architecture is most often found on hard drives and central processing units ( CPU).

When devices are operating, special firmware or dispatcher programs may be loaded into the cache, which would work more slowly with ROM(read only memory).

Most modern devices use mixed cache type , which can serve as a clipboard as well as storing frequently used code.

There are several very important functions implemented for the cache of processors and video chips.

Merging execution units . Central processing units and video processors often use a fast shared cache between cores. Accordingly, if one core has processed information and it is in the cache, and a command is received for the same operation, or to work with this data, then the data will not be processed by the processor again, but will be taken from the cache for further processing. The kernel will be offloaded to process other data. This significantly increases performance in similar but complex calculations, especially if the cache is large and fast.

Shared cache, also allows kernels to work with it directly, bypassing the slow .

Cache for instructions. There is either a shared, very fast L1 cache for instructions and other operations, or a dedicated cache for them. The more instructions stored in a processor, the larger the instruction cache it requires. This reduces memory latency and allows the instruction block to function almost independently. When it is full, the instruction block begins to periodically become idle, which slows down the speed of calculation.

Other functions and features.

It is noteworthy that in CPU(central processing units), applied hardware error correction (ECC), because a small error in the cache can lead to one continuous error during further processing of this data.

IN CPU And GPU exists cache hierarchy , which allows you to separate data for individual cores and general ones. Although almost all data from the second level cache is still copied to the third, general level, but not always. The first cache level is the fastest, and each subsequent one is slower, but larger in size.

For processors, it is considered normal three and less cache levels. This allows for a balance between speed, cache size and heat dissipation. It is difficult to find more than two cache levels in video processors.

Cache size, performance impact and other characteristics.

Naturally, the larger the cache, the more data it can store and process, but there is a serious problem.

Big cache- This big budget. In server processors ( CPU), the cache can use up to 80% transistor budget. Firstly, this affects the final cost, and secondly, energy consumption and heat dissipation increase, which is not comparable to the productivity increased by several percent.

One of the important factors that increases processor performance is the presence of cache memory, or rather its volume, access speed and distribution among levels.

For quite some time now, almost all processors have been equipped with this type of memory, which once again proves the usefulness of its presence. In this article, we will talk about the structure, levels and practical purpose of cache memory, which is very important. processor characteristics.

What is cache memory and its structure

Cache memory is ultra-fast memory used by the processor to temporarily store data that is most frequently accessed. This is how we can briefly describe this type of memory.

Cache memory is built on flip-flops, which, in turn, consist of transistors. A group of transistors takes up much more space than the same capacitors that make up the RAM. This entails many difficulties in production, as well as limitations in volume. That is why cache memory is a very expensive memory, while having negligible volumes. But from this structure comes the main advantage of such memory - speed. Since flip-flops do not need regeneration, and the delay time of the gate on which they are assembled is small, the time for switching the flip-flop from one state to another occurs very quickly. This allows the cache memory to operate at the same frequencies as modern processors.

Also, an important factor is the placement of the cache memory. It is located on the processor chip itself, which significantly reduces access time. Previously, cache memory of some levels was located outside the processor chip, on a special SRAM chip somewhere on the motherboard. Now, almost all processors have cache memory located on the processor chip.

What is processor cache used for?

As mentioned above, the main purpose of cache memory is to store data that is frequently used by the processor. The cache is a buffer into which data is loaded, and despite its small size (about 4-16 MB) modern processors, it gives a significant performance boost in any application.

To better understand the need for cache memory, let's imagine organizing a computer's memory like an office. The RAM will be a cabinet with folders that the accountant periodically accesses to retrieve large blocks of data (that is, folders). And the table will be a cache memory.

There are elements that are placed on the accountant’s desk, which he refers to several times over the course of an hour. For example, these could be phone numbers, some examples of documents. These types of information are located right on the table, which, in turn, increases the speed of access to them.

In the same way, data can be added from those large data blocks (folders) to the table for quick use, for example, a document. When this document is no longer needed, it is placed back in the cabinet (into RAM), thereby clearing the table (cache memory) and freeing this table for new documents that will be used in the next period of time.

Also with cache memory, if there is any data that is most likely to be accessed again, then this data from RAM is loaded into cache memory. Very often, this happens by co-loading the data that is most likely to be used after the current data. That is, there are assumptions about what will be used “after”. These are the complex operating principles.

Processor cache levels

Modern processors are equipped with a cache, which often consists of 2 or 3 levels. Of course, there are exceptions, but this is often the case.

In general, there can be the following levels: L1 (first level), L2 (second level), L3 (third level). Now a little more detail on each of them:

First level cache (L1)– the fastest cache memory level that works directly with the processor core, thanks to this tight interaction, this level has the shortest access time and operates at frequencies close to the processor. It is a buffer between the processor and the second level cache.

We will consider volumes on a high-performance processor Intel Core i7-3770K. This processor is equipped with 4x32 KB L1 cache 4 x 32 KB = 128 KB. (32 KB per core)

Second level cache (L2)– the second level is larger-scale than the first, but as a result, has lower “speed characteristics”. Accordingly, it serves as a buffer between the L1 and L3 levels. If we look again at our example Core i7-3770 K, then the L2 cache memory size is 4x256 KB = 1 MB.

Level 3 cache (L3)– the third level, again, is slower than the previous two. But it is still much faster than RAM. The L3 cache size in the i7-3770K is 8 MB. If the previous two levels are shared by each core, then this level is common to the entire processor. The indicator is quite solid, but not exorbitant. Since, for example, for Extreme-series processors like the i7-3960X, it is 15 MB, and for some new Xeon processors, more than 20.

What is processor cache?

A cache is a part of memory that provides maximum access speed and speeds up computation speed. It stores the pieces of data that the processor requests most often, so that the processor does not need to constantly access system memory for them.

As you know, this is a part of computer equipment that is characterized by the slowest data exchange speeds. If the processor needs some information, it goes to the RAM via the bus of the same name for it. Having received a request from the processor, it begins to delve into its annals in search of the data the processor needs. Upon receipt, the RAM sends them back to the processor along the same memory bus. This circle for data exchange was always too long. Therefore, the manufacturers decided that they could allow the processor to store data somewhere nearby. The way the cache works is based on a simple idea.

Think of memory as a school library. The student approaches the employee for a book, she goes to the shelves, looks for it, returns to the student, prepares it properly and proceeds to the next student. At the end of the day, he repeats the same operation when the books are returned to her. This is how a processor without a cache works.

Why does the processor need a cache?

Now imagine that the librarian is tired of constantly rushing back and forth with books that are constantly demanded from her year after year, day after day. He acquired a large cabinet where he stores the most frequently requested books and textbooks. The rest that have been placed, of course, continue to be stored on the same shelves. But these are always at hand. How much time he saved with this cabinet, both for himself and for the others. This is the cache.

So, the cache can only store the most required data?

Yes. But he can do more. For example, having already stored frequently required data, it is able to assess (with the help of the processor) the situation and request information that is about to be needed. So, a video rental customer who requested the movie “Die Hard” with the first part will most likely ask for the second. And here she is! The same goes for the processor cache. By accessing RAM and storing certain data, it also retrieves data from neighboring memory cells. Such pieces of data are called cache lines.

What is a two-level cache?

A modern processor has two levels. Accordingly, the first and second. They are designated by the letter L from the English Level. The first - L1 - is faster, but is small in volume. The second - L2 - is a little larger, but slower, but faster than RAM. The first level cache is divided into an instruction cache and a data cache. The instruction cache stores the set of instructions that the processor needs for calculations. Whereas the data cache stores quantities or values needed for the current calculation. And the second level cache is used to load data from the computer's RAM. The working principle of cache levels can also be explained using a school library example. So, having filled the purchased cabinet, the librarian realizes that there is no longer enough for books, for which he constantly has to run around the hall. But the list of such books has been finalized, and you need to buy the same cabinet. He didn’t throw away the first one - it’s a pity - and simply bought the second one. And now, as the first one is filled, the librarian begins to fill the second one, which comes into play when the first one is full, but the necessary books do not fit into it. It's the same with cache levels. And as microprocessor technology develops, processor cache levels grow in size.

Will the cache continue to grow?

Hardly. The pursuit of processor frequency also did not last long, and manufacturers found other ways to increase power. Same with the cache. Specifically speaking, the volume and number of levels cannot be inflated endlessly. The cache should not turn into another stick of RAM with slow access speed or reduce the size of the processor to half the size of the motherboard. After all, the speed of data access is, first of all, energy consumption and the performance cost of the processor itself. Cache misses (as opposed to cache hits), where the processor accesses cached memory for data that is not there, have also become more frequent. The data in the cache is constantly updated using various algorithms to increase the probability of a cache hit.

Almost all developers know that the processor cache is a small but fast memory that stores data from recently visited memory areas - the definition is short and quite accurate. However, knowing the boring details about the cache mechanisms is necessary to understand the factors that affect code performance.

In this article we will look at a number of examples illustrating various features of caches and their impact on performance. The examples will be in C#; the choice of language and platform does not greatly affect the performance assessment and final conclusions. Naturally, within reasonable limits, if you choose a language in which reading a value from an array is equivalent to accessing a hash table, you will not get any interpretable results. Translator's notes are in italics.

Habracut - - -

Example 1: Memory Access and Performance

How much faster do you think the second cycle is than the first?

int arr = new int;
// first
for (int i = 0; i< arr.Length; i++) arr[i] *= 3;
// second
for (int i = 0; i< arr.Length; i += 16) arr[i] *= 3;

The first loop multiplies all values in the array by 3, the second loop only multiplies every sixteenth value. The second cycle only completes 6% work the first cycle, but on modern machines both cycles are executed in approximately equal time: 80 ms And 78 ms respectively (on my machine).

The solution is simple - memory access. The speed of these loops is primarily determined by the speed of the memory subsystem, and not by the speed of integer multiplication. As we will see in the next example, the number of accesses to RAM is the same in both the first and second cases.

Example 2: Impact of Cache Lines

Let's dig deeper and try other step values, not just 1 and 16:

for (int i = 0; i< arr.Length; i += K /* шаг */ ) arr[i] *= 3;

Here are the running times of this loop for different step values K:

Please note that with step values from 1 to 16, the operating time remains virtually unchanged. But with values greater than 16, the running time decreases by about half every time we double the step. This does not mean that the loop somehow magically starts running faster, just that the number of iterations also decreases. The key point is the same operating time with step values from 1 to 16.

The reason for this is that modern processors do not access memory one byte at a time, but rather in small blocks called cache lines. Typically the string size is 64 bytes. When you read any value from memory, at least one cache line gets into the cache. Subsequent access to any value from this row is very fast.

Because 16 int values occupy 64 bytes, loops with steps from 1 to 16 access the same number of cache lines, or more precisely, all cache lines of the array. At step 32, access occurs to every second line, at step 64, to every fourth.

Understanding this is very important for some optimization techniques. The number of accesses to it depends on the location of the data in memory. For example, unaligned data may require two accesses to main memory instead of one. As we found out above, the operating speed will be two times lower.

Example 3: Level 1 and 2 cache sizes (L1 and L2)

Modern processors typically have two or three levels of caches, usually called L1, L2, and L3. To find out the sizes of caches at different levels, you can use the CoreInfo utility or the Windows API function GetLogicalProcessorInfo. Both methods also provide information about the cache line size for each level.

On my machine, CoreInfo reports 32 KB L1 data caches, 32 KB L1 instruction caches, and 4 MB L2 data caches. Each core has its own personal L1 caches, L2 caches are shared by each pair of cores:

Logical Processor to Cache Map: *--- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 *--- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 **-- Unified Cache 0, Level 2, 4 MB, Assoc 16, LineSize 64 --*- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 --*- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 --** Unified Cache 1, Level 2, 4 MB, Assoc 16, LineSize 64
Let's check this information experimentally. To do this, let's go through our array, incrementing every 16th value - an easy way to change the data in each cache line. When we reach the end, we return to the beginning. Let's check different array sizes; we should see a drop in performance when the array no longer fits into caches of different levels.

The code is:

int steps = 64 * 1024 * 1024; // number of iterations
int lengthMod = arr.Length - 1; // array size -- power of two
for (int i = 0; i< steps; i++)
{
// x & lengthMod = x % arr.Length, because powers of two
arr[(i * 16) & lengthMod]++;
}

Test results:

On my machine, there are noticeable drops in performance after 32 KB and 4 MB - these are the sizes of the L1 and L2 caches.

Example 4: Instruction Parallelism

Now let's look at something else. In your opinion, which of these two loops will execute faster?

int steps = 256 * 1024 * 1024;
int a = new int ;
// first
for (int i = 0; i< steps; i++) { a++; a++; }
// second
for (int i = 0; i< steps; i++) { a++; a++; }

It turns out that the second loop runs almost twice as fast, at least on all the machines I tested. Why? Because commands inside loops have different data dependencies. The first commands have the following chain of dependencies:

In the second cycle the dependencies are:

The functional parts of modern processors are capable of performing a certain number of certain operations simultaneously, usually not a very large number. For example, parallel access to data from the L1 cache at two addresses is possible, and simultaneous execution of two simple arithmetic instructions is also possible. In the first cycle, the processor cannot use these capabilities, but it can in the second.

Example 5: Cache Associativity

One of the key questions that must be answered when designing a cache is whether data from a certain memory region can be stored in any cache cells or only in some of them. Three possible solutions:

Direct Mapping Cache,The data of each cache line in RAM is stored in only one ,predefined cache location. The simplest way to calculate the mapping is: row_index_in_memory % number_of_cache_cells. Two lines mapped to the same cell cannot be in the cache at the same time.
N-entry partial-associative cache, each line can be stored in N different cache locations. For example, in a 16-entry cache, a line may be stored in one of the 16 cells that make up the group. Typically, rows with equal least significant bits of indices share one group.
Fully associative cache, any line can be stored in any cache location. The solution is equivalent to a hash table in its behavior.

Direct-mapped caches are prone to contention, for example, when two rows compete for the same cell, alternately evict each other from the cache, the efficiency is very low. On the other hand, fully associative caches, although free from this disadvantage, are very complex and expensive to implement. Partially associative caches are a typical trade-off between implementation complexity and efficiency.

For example, on my machine, the 4 MB L2 cache is a 16-entry partial-associative cache. The entire RAM is divided into sets of lines according to the least significant bits of their indices, lines from each set compete for one group of 16 L2 cache cells.

Since the L2 cache has 65,536 cells (4 * 2 20 / 64) and each group consists of 16 cells, we have a total of 4,096 groups. Thus, the lower 12 bits of the row index determine which group this row belongs to (2 12 = 4,096). As a result, rows with addresses that are multiples of 262,144 (4,096 * 64) share the same group of 16 cells and compete for space in it.

For the effects of associativity to take effect, we need to constantly access a large number of rows from the same group, for example, using the following code:

public static long UpdateEveryKthByte(byte arr, int K)
{
const int rep = 1024 * 1024; // number of iterations
Stopwatch sw = Stopwatch.StartNew();
int p = 0;
for (int i = 0; i< rep; i++)
{
arr[p]++;
P += K; if (p >= arr.Length) p = 0;
}
Sw.Stop();
return sw.ElapsedMilliseconds;
}

The method increments every Kth element of the array. When we reach the end, we start again. After quite a large number of iterations (2 20), we stop. I made runs for different array sizes and K step values. Results (blue - long running time, white - short):

Blue areas correspond to those cases when, with constant data changes, the cache is not able to accommodate all required data at once. A bright blue color indicates an operating time of about 80 ms, almost white - 10 ms.

Let's deal with the blue areas:

Why do vertical lines appear? Vertical lines correspond to step values at which too many rows (more than 16) from one group are accessed. For these values, my machine's 16-entry cache cannot accommodate all the necessary data.
Some of the bad stride values are powers of two: 256 and 512. For example, consider stride 512 and an 8 MB array. With this step, there are 32 sections in the array (8 * 2 20 / 262 144), which compete with each other for cells in 512 cache groups (262 144 / 512). There are 32 sections, but there are only 16 cells in the cache for each group, so there is not enough space for everyone.
Other step values that are not powers of two are simply unlucky, which causes a large number of hits to the same cache groups, and also leads to the appearance of vertical blue lines in the figure. At this point, lovers of number theory are invited to think.
Why do vertical lines break at the 4 MB boundary? When the array size is 4 MB or less, the 16-entry cache behaves like a fully associative cache, that is, it can accommodate all the data in the array without conflicts. There are no more than 16 areas fighting for one cache group (262,144 * 16 = 4 * 2 20 = 4 MB).
Why is there a big blue triangle at the top left? Because with a small step and a large array, the cache is not able to fit all the necessary data. The degree of cache associativity plays a secondary role here; the limitation is related to the size of the L2 cache.
For example, with an array size of 16 MB and a stride of 128, we access every 128th byte, thus modifying every second array cache line. To store every second line in the cache, you need 8 MB of cache, but on my machine I only have 4 MB.
Even if the cache were fully associative, it would not allow 8 MB of data to be stored in it. Note that in the already discussed example with a stride of 512 and an array size of 8 MB, we only need 1 MB of cache to store all the necessary data, but this is impossible due to insufficient cache associativity.
Why does the left side of the triangle gradually gain in intensity? The maximum intensity occurs at a step value of 64 bytes, which is equal to the size of the cache line. As we saw in the first and second examples, sequential access to the same row costs practically nothing. Let's say, with a step of 16 bytes, we have four memory accesses for the price of one.
Since the number of iterations is the same in our test for any step value, a cheaper step results in less running time.

The discovered effects persist at large parameter values:

Cache associativity is an interesting thing that can manifest itself under certain conditions. Unlike the other problems discussed in this article, it is not so serious. It's definitely not something that requires constant attention when writing programs.

Example 6: False Cache Partitioning

On multi-core machines, you may encounter another problem - cache coherence. Processor cores have partially or completely separate caches. On my machine, the L1 caches are separate (as usual), and there are also two L2 caches shared by each pair of cores. The details may vary, but in general, modern multi-core processors have multi-level hierarchical caches. Moreover, the fastest, but also the smallest caches belong to individual cores.

When one core modifies a value in its cache, other cores can no longer use the old value. The value in the caches of other cores must be updated. Moreover, must be updated the entire cache line, since caches operate on data at the row level.

Let's demonstrate this problem with the following code:

private static int s_counter = new int ;
private void UpdateCounter(int position)
{
for (int j = 0; j< 100000000; j++)
{
s_counter = s_counter + 3;
}
}

If on my four-core machine I call this method with parameters 0, 1, 2, 3 simultaneously from four threads, then the running time will be 4.3 seconds. But if I call the method with parameters 16, 32, 48, 64, then the running time will be only 0.28 seconds.

Why? In the first case, all four values processed by threads at any given time are likely to end up in one cache line. Each time one core increments a value, it marks cache cells containing that value in other cores as invalid. After this operation, all other kernels will have to cache the line again. This makes the caching mechanism inoperable, killing performance.

Example 7: Hardware Complexity

Even now, when the principles of cache operation are no secret to you, the hardware will still give you surprises. Processors differ from each other in optimization methods, heuristics and other implementation subtleties.

The L1 cache of some processors can access two cells in parallel if they belong to different groups, but if they belong to the same group, only sequentially. As far as I know, some can even access different quarters of the same cell in parallel.

Processors may surprise you with clever optimizations. For example, the code from the previous example about false cache sharing does not work on my home computer as intended - in the simplest cases the processor can optimize the work and reduce negative effects. If you modify the code a little, everything falls into place.

Here's another example of weird hardware quirks:

private static int A, B, C, D, E, F, G;
private static void Weirdness()
{
for (int i = 0; i< 200000000; i++)
{
<какой-то код>
}
}

If instead<какой-то код>Substitute three different options, you can get the following results:

Incrementing fields A, B, C, D takes longer than incrementing fields A, C, E, G. What's even weirder is that incrementing fields A and C takes longer than fields A, C And E, G. I don’t know exactly what the reasons for this are, but perhaps they are related to memory banks ( yes, yes, with ordinary three-liter savings memory banks, and not what you thought). If you have any thoughts on this matter, please speak up in the comments.

On my machine, the above is not observed, however, sometimes there are abnormally bad results - most likely, the task scheduler makes its own “adjustments”.

The lesson to be learned from this example is that it is very difficult to completely predict the behavior of hardware. Yes, Can predict a lot, but you need to constantly confirm your predictions through measurements and testing.

Conclusion

I hope that everything discussed above has helped you understand the design of processor caches. Now you can put this knowledge into practice to optimize your code.

Cache is a memory built into the processor into which the most frequently used data (commands) of RAM is written, which significantly speeds up operation.

L1 cache size (from 8 to 128 KB)
Level 1 cache size.
Level 1 cache is a block of high-speed memory located directly on the processor core.
Data extracted from RAM is copied into it.

Storing core instructions improves processor performance due to faster data processing speed (processing from the cache is faster than from RAM).

The capacity of the first level cache is small and amounts to kilobytes.
Typically, “older” processor models have a larger L1 cache.
For multi-core models, the amount of L1 cache memory for one core is indicated.

L2 cache size (from 128 to 12288 KB)
Level 2 cache size.
L2 cache is a block of high-speed memory that performs the same functions as the L1 cache (see "L1 Cache Capacity"), but has a lower speed and larger capacity.

If you are choosing a processor for resource-intensive tasks, then a model with a large L2 cache will be preferable.
For multi-core processors, the total amount of second-level cache memory is indicated.

L3 cache size (from 0 to 16384 KB)
Level 3 cache size.
The integrated L3 cache, combined with a fast system bus, forms a high-speed data exchange channel with system memory.

As a rule, only CPUs for server solutions or special editions of “desktop” processors are equipped with third-level cache memory.

For example, processor lines such as Intel Pentium 4 Extreme Edition, Xeon DP, Itanium 2, Xeon MP and others have third-level cache memory.

Twin BiCS FLASH - new 3D flash memory technology

On December 11, 2019, at the IEEE International Electronic Devices Meeting (IEDM), TOKYO-Kioxia Corporation announced 3D flash memory technology - Twin BiCS FLASH.

AMD Radeon Software Adrenalin Edition 2020 Driver 19.12.2 WHQL (Added)

On December 10, AMD introduced the mega driver Radeon Software Adrenalin 2020 Edition 19.12.2 WHQL.

Windows 10 Cumulative Update 1909 KB4530684

On December 10, 2019, Microsoft released cumulative update KB4530684 (Build 18363.535) for Windows 10 November 2019 Update (version 1909) on x86, x64 (amd64), ARM64, and Windows Server 2019 (1909) processors for x64-based systems.

NVIDIA Game Ready GeForce 441.66 WHQL Driver

The NVIDIA GeForce Game Ready 441.66 WHQL driver includes support for MechWarrior 5: Mercenaries and Detroit: Become Human, and also adds G-SYNC support for MSI MAG251RX and ViewSonic XG270 monitors.

views

Save VKontakte