Temporal locality (locality in time): If an item is referenced, it will tend to be referenced again soon. Most programs contain loops, so instructions and data are likely to be accessed repeatedly, showing high amounts of temporal locality.
Spatial locality (locality in space): if an item is referenced, items whose addresses are close by will tend to be referenced soon. Since instructions are normally accessed sequentially, programs show high spatial locality. Accesses to data also exhibit a natural spatial locality. For example, sequential accesses to elements of an array or a record will naturally have high degrees of spatial locality.
Memory hierarchy: We take advantage of the principle of locality by implementing the memory of a computer as a memory hierarchy. A memory hierarchy consists of multiple levels of memory with different speeds and sizes. The faster memories are more expensive per bit than the slower memories and thus are smaller.
Block: Minimum unit of information that can be either present or not present in the two-level hierarchy. Usually we transfer an
entire block when we copy something between levels.
Hit: If the data requested by the processor appears in some block in the upper level, this is called a hit.
Miss: If the data is not found in the upper level, the request is called a miss. The lower level in the hierarchy is then accessed to retrieve the block containing the requested data.
Hit rate / ratio: The fraction of memory accesses found in the upper level; it is often used as a measure of the performance of the memory hierarchy.
Hit time: The time to access the upper level of the memory hierarchy, which includes the time needed to determine whether the access is a hit or a miss.
Miss penalty: The time to replace a block in the upper level with the corresponding block from the lower level, plus the time to deliver this block to the processor.
Cache: The name chosen to represent the level of the memory hierarchy between the processor and main memory in the first commercial computer to have this extra level.
Direct Mapped Cache: Each memory location is mapped directly to exactly one location in the cache.
Tag: Because each cache location can contain the contents of a number of different memory locations, how do we know whether the data in the cache corresponds to a requested word? That is, how do we know whether a requested word is in the cache or not? We answer this question by adding a set of tags to the cache. The tags contain the address information required to identify whether a word in the cache corresponds to the requested word.
Valid Bit: We need a way to recognize that a cache block does not have valid information. For instance, when a processor starts up, the cache does not have good data, and the tag fields will be meaningless. We need to know that the tag should be ignored for such entries. The most common method is to add a valid bit to indicate whether an entry contains a valid address.
Measuring Cache Performance:
CPU time can be divided into the clock cycles that the CPU spends executing the program and the clock cycles that the CPU spends waiting for the memory system. Normally, we assume that the costs of cache accesses that are hits are part of the normal CPU execution cycles. Thus:
CPU time = (CPU execution clock cycles +
Memory-stall clock cycles) × Clock cycle time
For simplicity, we may assume that the memory-stall clock cycles come primarily from cache misses. In real processors, the stalls generated by reads and writes can be quite complex, and accurate performance prediction usually requires very detailed simulations of the processor and memory system.
Memory-stall clock cycles = Read-stall cycles + Write-stall cycles
Placement of Blocks in Cache:
If the cache is
direct-mapped, a block can go in exactly one place in the cache. On the other extreme, there is
fully-associative caches where a block can go in any place. On the other hand, for the
set-associative cache, there are a fixed number of locations where each block can be placed. A set-associative cache with n locations for a block is called an n-way set-associative cache.
Example:
The location of a memory block whose address is 12 in a cache with eight blocks varies for directmapped, set-associative, and fully associative placement.
In direct-mapped placement, there is only one cache block where memory block 12 can be found, and that block is given by (12 modulo 8) = 4. In a two-way set-associative cache, there would be four sets, and memory block 12 must be in set (12 mod 4) = 0; the memory block could be in either element of the set. In a fully associative placement, the memory block for block address 12 can appear in any of the eight cache blocks.
For a set-associative cache, since the block may be placed in any element of the set, all the tags of all the elements of the set must be searched. In a fully associative cache, the block can go anywhere, and all tags of all the blocks in the cache must be searched.
NOTE: CLion is a new C/C++ IDE by JetBrains.
Choosing Which Block to Replace:
When a miss occurs in a direct-mapped cache, the requested block can go in exactly one position, and the block occupying that position must be replaced. In an associative cache, we have a choice of where to place the requested block, and hence a choice of which block to replace.
least recently used (LRU): A replacement scheme in which the block replaced is the one that has been unused for the longest time.
LRU replacement is implemented by keeping track of when each element in a set was used relative to the other elements in the set. For a two-way set-associative cache, tracking when the two elements were used can be implemented by keeping a single bit in each set and setting the bit to indicate an element whenever that element is referenced. As associativity increases, implementing LRU gets harder.
Question: What is "out-of-order processor"? It seems when a cache miss occurs, they dont wait but go on executing instructions.
autotuning: The performance challenge for algorithms is that the memory hierarchy varies between different implementations of the same architecture in cache size, associativity, block size, and number of caches. To cope with such variability, some recent numerical libraries parameterize their algorithms and then search the parameter space at runtime to find the best combination for a particular computer. This approach is called autotuning.
Comments