OOO Execution of Memory Operations

OOO Execution of Memory Operations

P6 Caches Blocking caches severely hurt OOO
A cache miss prevents from other cache requests (which could possibly be hits) to be served Hurts one of the main gains from OOO – hiding caches misses Both L1 and L2 cache in the P6 are non-blocking Initiate the actions necessary to return data to cache miss while they respond to subsequent cached data requests Support up to 4 outstanding misses Misses translate into outstanding requests on the P6 bus The bus can support up to 8 outstanding requests Squash subsequent requests for the same missed cache line Squashed requests not counted in number of outstanding requests Once the engine has executed beyond the 4 outstanding requests subsequent load requests are placed in the load buffer

OOO Execution of Memory Operations
The RS operates based on register dependencies RS cannot detect memory dependencies movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx movl %eax, -4(%ebp) # eax ← MEM[ebp-4] RS dispatches memory uops when data for address calculation is ready, and the MOB and Address Generation Unit (AGU) are free AGU computes the linear address Segment-Base + Base-Address + (Scale*Index) + Displacement Sends linear address to MOB, to be stored in Load Buffer or Store Buffer MOB resolves memory dependencies and enforces memory ordering Some memory dependencies can be resolved statically store r1,a load r2,b Problem: some cannot store r1,[r3];  can advance load before store  load must wait till r3 is known

Load and Store Ordering
x86 has small register set  uses memory often Preventing Stores from passing Stores/Loads: 3%~5% perf. loss P6 chooses not allow Stores to pass Stores/Loads Preventing Loads from passing Loads/Stores: big perf. loss P6 allows Loads to pass Stores, and Loads to pass Loads Stores are not executed OOO Stores are never performed speculatively there is no transparent way to undo them Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when the store has both its address and its data, and there are no older stores awaiting dispatch Store commits its write to memory (DCU) at retirement

Store Implemented as 2 Uops
Store decoded as two independent uops STA (store-address): calculates the address of the store STD (store-data): stores the data into the Store Data buffer The actual write to memory is done when the store retires Separating STA & STD is important for memory OOO Allows STA to dispatch earlier, even before the data is known Address conflicts resolved earlier  opens memory pipeline for other loads STA and STD can be issued to execution units in parallel STA dispatched to AGU when its sources (base+index) are ready STD dispatched to SDB when its source operand is available

Memory Order Buffer (MOB)
Store Coloring Each Store allocated in-order in Store Buffer, and gets a SBID Each load allocated in-order in Load Buffer, and gets LBID + current SBID Load is checked against all previous stores Stored with SBID ≤ store’s SBID Load blocked if Unresolved address of a relevant STAs STA to same address, but data not ready Missing resources (DTLB miss, DCU miss) MOB writes blocking info into load buffer Re-dispatches load when wake-up signal received If Load is not blocked  executed (bypassed) LBID SBID Store - 1 Load 2 3 4

MOB (Cont.) If a Load misses in the DCU
The DCU marks the write-back data as invalid Assigns a fill buffer to the load, and issues an L2 request When critical chunk is returned, wakeup and re-dispatch the load Store → Load Forwarding Older STA with same address as load and data ready  Load gets its data directly from the SB (no DCU access) Memory Disambiguation MOB predicts if a load can proceed despite unknown STAs Predict colliding  block Load if there is unknown STA (as usual) Predict non colliding  execute even if there are unknown STAs In case of wrong prediction The entire pipeline is flushed when the load retires

Pipeline: Load: Allocate
Schedule AGU LB Write DTLB DCU WB MOB Retire IDQ RS ROB LB Allocate ROB/RS, MOB entries Assign Store ID (SBID) to enable ordering Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Bypassed Load: EXE
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB RS checks when data used for address calculation is ready AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. Write load into Load Buffer DTLB Virtual → Physical + DCU set access MOB checks blocking and forwarding DCU read / Store Data Buffer read (Store → Load forwarding) Write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Blocked Load Re-dispatch
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB MOB determines which loads are ready, and schedules one Load arbitrates for MEU DTLB Virtual → Physical + DCU set access MOB checks blocking/forwarding DCU way select / Store Data Buffer read write back data / write block code Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Load: Retire
Alloc Schedule AGU LB Write Retire IDQ RS ROB MOB DTLB DCU WB LB Reclaim ROB, LB entries Commit results to RRF Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Store: Allocate
Schedule AGU SB Retire IDQ RS ROB DTLB SB Allocate ROB/RS Allocate Store Buffer entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Store: STA EXE
Alloc Schedule AGU SB V.A. Retire IDQ RS ROB DTLB SB P.A. SB RS checks when data used for address calculation is ready dispatches STA to AGU AGU calculates linear address Write linear address to Store Buffer DTLB Virtual → Physical Load Buffer Memory Disambiguation verification Write physical address to Store Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Store: STD EXE
Alloc Schedule SB data Retire IDQ RS ROB SB RS checks when data for STD is ready dispatches STD Write data to Store Buffer Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

Pipeline: Senior Store Retirement
Alloc Schedule Retire IDQ RS ROB SB DCU MOB SB When STA (and thus STD) retires Store Buffer entry marked as senior When DCU idle  MOB dispatches senior store Read senior entry Store Buffer sends data and physical address DCU writes data Reclaim SB entry Now moving on with pipelines. First is the life of an ADD which will flow thru the various pipelines. As reference pipestages in green are those that got added to the original Merom pipeline. Our ADD instruction is part First part is we need to fetch the bytes. Does 16B a cycle from I$. Figure out where the instructions lie in those 16B and then steer them and write them into the IQ. Rate is at most 6 instructions per cycle.

The life of a Load… … Instruction Q RAT RS ROB EXE Retire MOB
R3MEM(R2+50) R0 Store Buffer Arch Reg. R1 R2 R3 RF0 RS RF0MEM(R2+50) load Buffer ROB V Addr. BC 1 V(R2+50) Not Valid Data Cache # Valid Rdy Data DST Phys. Reg. Ld data R3 Ld X R3 … ALU1 AGU EXE dTLB R2+50 Retire 1 entry in the ROB, RS and Load Buffer + rename in RAT Dispatch Load address calculation to AGU when source is ready – Release RS entry AGU updates the address in the Load buffer. Pipeline proceeds to dTLB Load Buffer checks for blocking conditions and dispatches the Load to the DCU DCU sends the result to RS and updates the ROB with the load result Load will retire as any other instruction (when all previous instructions have retired) – RAT updated LB and ROB entry are released

The life of a Store… … RAT Instruction Q RS ROB EXE Retire MOB
MEM(R2+50)  R3 Store Buffer Arch Reg. R0 R1 V Addr. Data Snr Data Cache R2 1 V(R2+50) Not Vld Not Vld V(RF0) 1 R3 RF0 RS STA: R2+50 load Buffer ROB STD: RF0 V Addr. BC # Valid Rdy Data DST Phys. Reg. St X X St X X … ALU1 AGU dTLB EXE R2+50 Retire 1 entry in the ROB, 2 in the RS and 1 in the Store Buffer Dispatch Store address calculation to AGU when source is ready – Release RS entry AGU updates the address in the Store buffer  update the Store Buffer & provide addr. to depending loads Store pipeline proceeds to dTLB. Physical address will be updated in the SB Dispatch Store Data when Data is ready  update the Store Buffer & provide data to depending loads The Store Buffer updates the ROB entry The Store will retire from the ROB as any other instruction (when all previous instructions have retired) After this, the Store is marker as Senior Store in the Store Buffer The Store buffer will initiate a DCU write. When the write is done, the SB reclaims the entry

Question בשאלה זו נתייחס למעבד עם OOOE ו- Speculative Execution
נתון קטע הקוד הבא: 1000 load R2,R1,30 ; R2=m[R1+30] 1004 store R2,20,R1 ; m[R2+20]=R1 1008 load R3,R1,100 ; R3=m[R1+100] 100C store R1,40,R3 ; m[R1+40]=R3 1010 add R1,R1,10 ; R1=R1+10 1014 blt R1,100,1000 ; if (R1<100) PC=1000 הנחות פקודת הקפיצה נחזית כנלקחת בתחילת הביצוע בכל כתובת N בזיכרון קיים הערך,N וכן R1=R2=R3=10 למען פשטות נניח כי הכתובות בתוכנית הן פיזיות ואין צורך בתרגום. L1 data cache מחזירdata תוך מחזור שעון אחד, אך הוא ריק בתחילת הביצוע. L2 data cache מחזירdata תוך 7 מחזורי שעון, והוא מכיל את כל הכתובות המבוקשות כבר בתחילת הביצוע.

אלוקציה של פקודות בכל מחזור ניתן לבצע אלוקציה לארבע פקודות (ויש לפחות 4 פקודות מוכנות לאלוקציה) ה-ROB, MOB, וה- RS הם גדולים ואינם מתמלאים.

ביצוע של פקודות ישנן אינסוף יחידות ביצוע.
פקודה יכולה להיכנס לביצוע במחזור שלאחר האלוקציה, בתנאי שכל הנתונים להם היא זקוקה כבר מוכנים. פקודה שממתינה לנתון יכולה להיכנס לביצוע מייד במחזור שלאחריו הוא מוכן. ביצוע פקודת ALU אורך מחזור שעון אחד. ביצוע פקודת branch אורך מחזור אחד. אם החיזוי מתגלה כשגוי, במחזור הבא מבוצע flush (בזמן t+1). הפקודות מהמסלול הנכון מבצעות אלוקציה 5 מחזורים לאחר flush (בזמן t+6).

ביצוע של פקודות – המשך פקודת load נשלחת לביצוע כאשר הנתונים לחישוב הכתובת מוכנים. במחזור הראשון מחושבת הכתובת במחזור השני נבדק התנאי הבא: עבור כל פקודת store הקודמת ל-load, הכתובת של ה-store ידועה ומתקיים: או שהכתובת של ה-load שונה מהכתובת של ה-store, או ששתי הכתובות שוות, וה-data של ה-store כבר ידועה. במחזור השלישי, במידה והבדיקה מצליחה, הנתון מתקבל מה-L1 cache (אם יש hit), או ישירות מה-MOB ע"י store to load forwarding במידה והבדיקה מצליחה אך יש L1 cache miss וכן אין store to load forwarding, הנתון מתקבל במחזור העשירי מה- L2 cache. במידה והבדיקה נכשלת, ה-load הוא חסום (blocked). כאשר מוסר תנאי החסימה, ה-load נשלח שוב לביצוע, ומדלגים על המחזור הראשון (מתחילים בבדיקת התנאי). פקודת store נשלחת לביצוע כאשר הנתונים לחישוב הכתובת מוכנים. חישוב הכתובת אורך מחזור שעון אחד, ובסופו נכתבת הכתובת ל-MOB. באופן בלתי תלוי, כאשר הנתון לכתיבה לזיכרון מוכן, במחזור הבא הוא נכתב ל-MOB

של פקודותCommit פקודה יכולה לבצע commit החל מהמחזור שלאחר סיום הביצוע, ובתנאי שהפקודה שלפניה ביצעה/מבצעת commit. אין מגבלה על כמות הפקודות שמבצעות commit בכל מחזור פקודת store מבצעת את הכתיבה אל ה-cache בזמן post-commit.

… Summary… 4 wide machine
L1: 1 cycle L2: 7 cycles Alu, Branch: 1 cycle L1 empty / L2 always hits… T: T+1: Flush pipeline T+6: Alloc on the good path 1 2 3 4 10 … Addr. calculation Memory checks L1 Hit Forwarding (From MOB) L2 Hit Load 7 cycles All previous Store: ≠ addr. Same addr & data Rdy Retry after block Addr. calculation Store Data Ready MOB update

Fill this table… Arch. reg value after commit Addr. for LD & ST
Data for LD & ST Alloc Time 4 / cycle Src reg: Pi / Ri: Store: Src1: addr Src2: data Time Src ready Time exe 0: ready 1: addr blocking 2: data not ready Pdst instruction R1 R2 R3 addr data T alloc src1 src2 Imm T src1 ready T src2 ready T exe Load block code T data ready T commit load R2=m[R1+30] 1 store m[R2+20]=R1 2 load R3=m[R1+100] 3 store m[R1+40]=R3 4 add R1=R1+10 5 blt if (R1<100) 6 7 8 9 10 11 Fill this table…

הנחיות למילוי הטבלה R1, R2, R3 - ערכי הרגיסטרים הארכיטקטוניים לאחר commit. יש להקיף בעיגול את הערך של הרגיסטר הארכיטקטוני שאליו הפקודה כותבת. במידה והפקודה אינה מגיעה ל-commit יש להשאיר שדות אלה ריקים. addr – כתובת הגישה לזיכרון – עבור פקודות load ו-store בלבד. data – ערך זיכרון שנקרא או נכתב – עבור פקודות load ו-store בלבד. T alloc: הזמן בו מבוצעת אלוקציה לפקודה (ארבע פקודות בכל מחזור, החל מ- T=1) src1, src2: מספרי הרגיסטרים המשמשים כ-sources לפקודה: Pi עבור רגיסטר פיזי, ו-Ri במידה וקוראים ישירות את הרגיסטר הארכיטקטוני. עבור store: src1 – הרגיסטר המשמש לחישוב הכתובת. src2 – הרגיסטר המכיל את הנתון. Imm – במידה ולפקודה יש Imm, ערך ה- Imm. T src1 ready , T src2 ready: הזמן בו מוכן כל אחד ערכי ה-sources לפקודה. אם ה-src מוכן בזמן האלוקציה, אז זמן זה יהיה שווה לזמן האלוקציה. אם הפקודה שמחשבת את הערך של src מסיימת ביצוע בזמן T, ה-src מוכן בזמן T.

הנחיות למילוי הטבלה – המשך
R1, R2, T exe: הזמן בו הפקודה נשלחת לביצוע. אם כל ה-src-ים של פקודה מוכנים בזמן T, ניתן לשלוח את הפקודה לביצוע בזמן T+1. Load block code (רלוונטי רק בפקודות load): קוד החסימה של ה-load. 0 – אין חסימה. 1 – חסימה כתוצאה מ-unresolved store address 2 – חסימה כתוצאה מ- waiting for store data במידה וה-load נחסם יותר מפעם אחת, יש לרשום את כל קודי החסימה. T data ready: עבור store: הזמן בו ה-data לכתיבה לזיכרון מוכן. עבור load: הזמן בו מתקבל ה-data (מה-cache או ישירות מה-MOB). T commit: הזמן בו הפקודה מבצעת commit

10 40 60 110 50 20 130 120 30 Pdst instruction R1 R2 R3 addr data
T alloc src1 src2 Imm T src1 ready T src2 ready T exe Load block code T data ready T commit load R2=m[R1+30] 10 40 1 store m[R2+20]=R1 60 2 load R3=m[R1+100] 110 3 store m[R1+40]=R3 50 4 add R1=R1+10 20 5 blt if (R1<100) 6 7 130 8 120 9 30 11

Src reg: Pi / Ri: Store: Src1: addr Src2: data 0: ready 1: addr blocking 2: data not ready Pdst instruction R1 R2 R3 addr data T alloc src1 src2 Imm T src1 ready T src2 ready T exe Load block code T data ready T commit load R2=m[R1+30] 10 40 1 30 2 11 12 store m[R2+20]=R1 60 P0 20 Std: 2 Sta:12 13 load R3=m[R1+100] 110 100 21 22 3 store m[R1+40]=R3 50 P2 Std:22 Sta: 2 23 4 add R1=R1+10 5 blt if (R1<100) 6 7 130 8 120 9 . . . . . Load Memory checks 14 21 13 12 11 … L2 Hit Store R2 (pb0) is known Addr. Calc: PB0+20 L1 miss

Src reg: Pi / Ri: Store: Src1: addr Src2: data 0: ready 1: addr blocking 2: data not ready Pdst instruction R1 R2 R3 addr data T alloc src1 src2 Imm T src1 ready T src2 ready T exe Load block code T data ready T commit load R2=m[R1+30] 10 40 1 30 2 11 12 store m[R2+20]=R1 60 P0 20 Std: 2 Sta:12 13 load R3=m[R1+100] 110 100 21 22 3 store m[R1+40]=R3 50 P2 Std:22 Sta: 2 23 4 add R1=R1+10 5 blt if (R1<100) P4 6 1, 2 24 7 130 P6 Std:4 Sta:24 25 8 120 9 .

Src reg: Pi / Ri: Store: Src1: addr Src2: data 0: ready 1: addr blocking 2: data not ready Pdst instruction R1 R2 R3 addr data T alloc src1 src2 Imm T src1 ready T src2 ready T exe Load block code T data ready T commit load R2=m[R1+30] 10 40 1 30 2 11 12 store m[R2+20]=R1 60 P0 20 Std: 2 Sta:12 13 load R3=m[R1+100] 110 100 21 22 3 store m[R1+40]=R3 50 P2 Std:22 Sta: 2 23 4 add R1=R1+10 5 blt if (R1<100) P4 6 1, 2 24 7 130 P6 Std:4 Sta:24 25 8 120 26 9 P8 Sta:4 Std:26 27 P10

Pdst instruction R1 R2 R3 addr data T alloc src1 src2 Imm T src1 ready T src2 ready T exe Load block code T data ready T commit load R2=m[R1+30] 10 40 1 30 2 11 12 store m[R2+20]=R1 60 P0 20 Std: 2 Sta:12 13 load R3=m[R1+100] 110 100 21 22 3 store m[R1+40]=R3 50 P2 Std:22 Sta: 2 23 4 add R1=R1+10 5 blt if (R1<100) P4 6 1, 2 24 7 130 P6 Std:4 Sta:24 25 8 120 26 9 P8 Sta:4 Std:26 27 P10

OOO Execution of Memory Operations

מצגות קשורות

מצגת בנושא: "OOO Execution of Memory Operations"— תמליל מצגת:

מצגות קשורות

על הפרויקט

משוב

היכנס

התחבר באמצעות רשת חברתית:

OOO Execution of Memory Operations

מצגות קשורות

מצגת בנושא: "OOO Execution of Memory Operations"— תמליל מצגת:

מצגות קשורות

על הפרויקט

משוב