Download - MAFA07 19 Heiner
-
8/13/2019 MAFA07 19 Heiner
1/31
Fault Tolerant ICAP Controller for High-ReliableInternal Scrubbing
Jonathan Heiner, Nathan Collins, & Michael Wirthlin
This work was supported by Lockheed Martin under a grant from the University Projects program withcollaboration by Tim Gallagher and Jon Wilson.
-
8/13/2019 MAFA07 19 Heiner
2/31
Brigham Young University
Outline
FPGA Scrubbing Overview
Internal Configuration Access Port (ICAP) Internal ICAP Architecture
High Reliability Scrubber
Radiation Test & Results Future Work & Summary
-
8/13/2019 MAFA07 19 Heiner
3/31
Brigham Young University
FPGA Fault Tolerant Strategy FPGAs provide SEU mitigation
through redundancy andscrubbing
Triple Modular Redundancy(TMR)
Triplicate module to introduceredundancy
Vote on outputs of triplicatedmodule
Use greatest common result
Configuration Scrubbing
Readback frame data Compare frame to original
Correct erroneous bits in frame
Writeback frame to FPGA
A A
A
A
V
FPGA
Memory Comparitor
Modifier
...010101010...
...010101010...
...010111010...
Original FrameData
ErroneousFrame Data
CorrectedFrame Data
-
8/13/2019 MAFA07 19 Heiner
4/31
Brigham Young University
Continuous Time Reliability
TMR w/Repair
Non-redundantTMR
(no repair)
t
t1
101
t2t3
t 31
1 20 1
t 211
t
t3
t 31
1 20)2(1 tt +
1
Non Redundant
TMR w/o Repair
TMR w/Repair
-
8/13/2019 MAFA07 19 Heiner
5/31
Brigham Young University
Configuration Scrubbing Example
ScrubColumn
x
ConfigurationUpset
-
8/13/2019 MAFA07 19 Heiner
6/31
Brigham Young University
Configuration Scrubbing Example
x Configuration
UpsetRepaired
-
8/13/2019 MAFA07 19 Heiner
7/31
Brigham Young University
External Components
RadHard Memory
Configuration Controller
Dedicated IO
Traditional Scrubbing
External Scrubbers
Blind Scrubbing
Read-back Scrubbing
Rad HardMemory
DataAddress
OE
FPGA
Data
WriteCSBusyDoneProgInitCCLK
SelectMap
ConfigurationController
-
8/13/2019 MAFA07 19 Heiner
8/31
-
8/13/2019 MAFA07 19 Heiner
9/31
Brigham Young University
Internal Scrubbing
Described in XAPP714 Architecture
Based on ICAP configuration port
Internal Scrubbing Architecture
Control LogicPicoBlazeProcessor
PicoBlazeBRAM
Frame ECC ICAP
DMA BRAM
ICAP DMAFPGA Primitives
Configuration Memory
-
8/13/2019 MAFA07 19 Heiner
10/31
Brigham Young University
Internal Scrubbing Strategy
Internal Scrubbing Process
Perform readback of each frame via ICAP interface
Use FrameECC to detect errors
Correct errors based on FrameECC syndrome value
Write corrected frame back via ICAP interface
Advantages
No external memory, external controller, or external IO pins
Disadvantages
Additional circuit area required for scrubbing circuit
Reliability of scrubber
-
8/13/2019 MAFA07 19 Heiner
11/31
Brigham Young University
Internal Configuration Access Port (ICAP)
Internal interface to configuration port
Active readback and re-configuration
Similar to SelectMap (separate I/Out data bus)
Hard-wired Logic
Current application usage Dynamic Partial Reconfiguration
Encryption
Fault Tolerance/Injection
CONFIG_MODE
CONFIG_START
RESET_FPGA
DATA
R/W
CE
CLK
DONE
BUSY
ICAP
SELECTMAP
DATA
-
8/13/2019 MAFA07 19 Heiner
12/31
Brigham Young University
Detailed Architecture Overview
Frame ECC
Hard-wired internal
component
Performs SECDEDalgorithm on frame
Provides syndromeword and error bitvalues
Directly connectedto read-port of ICAP
Control LogicPicoBlazeProcessor
PicoBlazeBRAM
Frame ECC ICAP
DMA BRAM
ICAP DMAFPGA Primitives
Configuration Memory
-
8/13/2019 MAFA07 19 Heiner
13/31
Brigham Young University
Detailed Architecture Overview
Control LogicPicoBlazeProcessor
PicoBlazeBRAM
Frame ECC ICAP
DMA BRAM
ICAP DMAFPGA Primitives
Configuration Memory
ICAP DMA
Provides ICAP with
data every clockcycle
Stores ICAP output
to DMA BRAM Transmits BRAM
content to controllogic
-
8/13/2019 MAFA07 19 Heiner
14/31
Brigham Young University
PicoBlaze Processor
8-bit programmable
Controller
Performs scrubbinglogic
BRAM contains pre-compiled scrubbingprogram
Software used forease of modifyinglogic
Detailed Architecture Overview
Control LogicPicoBlazeProcessor
PicoBlazeBRAM
Frame ECC ICAP
DMA BRAM
ICAP DMAFPGA Primitives
Configuration Memory
-
8/13/2019 MAFA07 19 Heiner
15/31
Brigham Young University
Control Logic
Synchronize data
transfer betweenpicoblaze and ICAPDMA
Maintains timing anddata requirements
Detailed Architecture Overview
Control LogicPicoBlazeProcessor
PicoBlazeBRAM
Frame ECC ICAP
DMA BRAM
ICAP DMAFPGA Primitives
Configuration Memory
-
8/13/2019 MAFA07 19 Heiner
16/31
Brigham Young University
Scrubber Program
Initializes devices
Walk slow scan
Approx. 24ms to 278ms @ 100Mhz
Actual Detection
Run fast scan
Approx. 1.2ms to 14.6ms @100Mhz
Quick Detection (is there an errorsomewhere)
Patch Ignore SEU bymodifying parity bits
Correction Correct SEU
Initialization andSynchronization
Initial Walk
Initial RunScan
Perform RunScan
Perform WalkError
Detected?
PerformCorrection
ErrorDetected?
Perform Patch
-
8/13/2019 MAFA07 19 Heiner
17/31
Brigham Young University
High Reliability Scrubber
Internal Scrubber is susceptible to configuration upsets
Logic used to implement scrubber may be affected bySEUs
Upsets within the scrubber logic may limit the ability
of the scrubber to repair the fault
SEU mitigation technique needed to insure reliablescrubbing
-
8/13/2019 MAFA07 19 Heiner
18/31
Brigham Young University
High Reliable ICAP Scrubber
Control LogicPicoBlazeProcessor
PicoBlazeBRAM
Frame ECC ICAP
DMA BRAM
ICAP DMAFPGA Primitives
TMR
MemoryScrubbing
No Mitigation
-
8/13/2019 MAFA07 19 Heiner
19/31
-
8/13/2019 MAFA07 19 Heiner
20/31
Brigham Young University
BRAM Scrubber
Specialized BRAMscrubber for Picoblaze
memory
Continuously read and
repair upsets within thememory
PICOBLAZE
PROCESSOR
(KCPSM 3 )
P OR T B
P OR T A
PO RT B
PO RT A
PO RT B
PO RT A
WRITE -
BACK
LOGIC
PICOBLAZE
PROGRAM CODE
(TEST 0)
BRAM SCRUBBER
AddressCounter
VOTER
VOTER
VOTER
-
8/13/2019 MAFA07 19 Heiner
21/31
Brigham Young University
Scrubber Design Utilization
62BRAM 1308 (12%)736 (6%)Slices
680 (3%)Non TMR
1082 (5%)Flip FlopsTMRResource
Virtex-4 LX-25
-
8/13/2019 MAFA07 19 Heiner
22/31
Brigham Young University
Radiation Test
Determine the reliability of ICAP scrubber
Measure reliability of non-TMR scrubber
Measure reliability of TMR scrubber
Test limitations
Operated behind another test
Did not have control over beam flux
Had to reconfigure with beam onPicture of test
-
8/13/2019 MAFA07 19 Heiner
23/31
-
8/13/2019 MAFA07 19 Heiner
24/31
Brigham Young University
Radiation Test Design
ICAP controller
TMR design
Non TMR design
No other FPGA circuitry
FPGA mostly empty
Detect and repairupsets in all areas ofFPGA
Unused logic
ICAP controller logic
Empty Logic Fabric
ICAP Controller(both TMR and non TMR)
-
8/13/2019 MAFA07 19 Heiner
25/31
Brigham Young University
Data Collection and Monitoring
UART
Transmit SEU data to PC
Provide status messages
Watchdog timers
Identify picoblaze failure
External PC Log data
Control LogicPicoBlazeProcessor
PicoBlazeBRAM
Frame ECC ICAP
DMA BRAM
ICAP DMAFPGA Primitives
UART
WatchdogTimers
ExternalPC
-
8/13/2019 MAFA07 19 Heiner
26/31
Brigham Young University
Configuration Upsets between Failure
0
2
4
6
8
10
12
14
0-200 201-400 401-600 601-800 801-1000 1000+
TMR
Non-TMR
TMR: 1682 SEUs between failureNon-TMR: 309 SEUs between failure
-
8/13/2019 MAFA07 19 Heiner
27/31
Brigham Young University
Multiple Bit Upsets
FRAME ECC does not identify location of failure withmultiple upsets within frame
Single Error Correction, Double Error Detection Syndrome can not locate failures
MBUs were detected but could not be corrected
MBUs accumulated during the test
Failures often occurred due to MBU accumulation
Presence of MBU significantly slowed downscrubbing
Performed configuration walk with MBU
1.7% of upsets were intra-frame MBUs
-
8/13/2019 MAFA07 19 Heiner
28/31
-
8/13/2019 MAFA07 19 Heiner
29/31
Brigham Young University
Failure Modes
Single Point Failures (were not isolated during test) UART I/O
ICAP Frame ECC
Failure Modes (isolated during test) Program crash
Invalid response from UART Repeat FAR & syndrome values
Repeat FAR but different syndrome values
Repeat sets of FAR & syndrome values
FAR increments till end of FPGA row Errors detected after test finished
Failed during reconfiguration
-
8/13/2019 MAFA07 19 Heiner
30/31
Brigham Young University
Conclusions
ICAP scrubber worked correctly as expected
Detected upsets within FPGA fabric during operation
Repaired SEUs within the device
Hi-Rel scrubber provided improved reliability
5.4x higher SEU to failure than non-TMR
1.4x higher MBU to failure than non-TMR
ICAP hi-rel scrubber reliability limited by MBUs
Cannot remove MBUs
Failure due to accumulation of MBUs
-
8/13/2019 MAFA07 19 Heiner
31/31
Brigham Young University
Future Work
MBU Detection & Correction
Investigate techniques for MBU correction
VHDL Scrubber Increased speed & possibly smaller circuit
Dynamic Partial Reconfiguration
Future uses of ICAP
Dynamic Partial Reconfiguration (bitstream compression)
Low cost Fault Injection