#### LA-UR-

Approved for public release; distribution is unlimited.

Title: Author(s): Intended for:



Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated by the Los Alamos National Security, LLC for the National Nuclear Security Administration of the U.S. Department of Energy under contract DE-AC52-06NA25396. By acceptance of this article, the publisher recognizes that the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or to allow others to do so, for U.S. Government purposes. Los Alamos National Laboratory requests that the publisher identify this article as work performed under the auspices of the U.S. Department of Energy. Los Alamos National Laboratory does not endorse the viewpoint of a publication or guarantee its technical correctness.

LA-UR-07-4998

# Domain Crossing Errors: Limitations on Single Device Triple-Modular Redundancy Circuits in Xilinx FPGAs

Heather Quinn\*, Keith Morgan, Paul Graham, Jim Krone, and Michael Caffrey Los Alamos National Laboratory

> Kevin Lundgreen Brigham Young University



\*hquinn@lanl.gov

UNCLASSIFIED



### Outline

- Multiple-bit upsets and triple-modular redundancy
- Fault injection and accelerator test results
- Simple probability model
- Conclusions





UNCLASSIFIED



## **Triple-Modular Redundancy and Xilinx SRAM FPGAs**

- Growing interest in using Xilinx FPGAs in space
  - Well-suited to signal processing applications
  - Reconfigurability can increase the usable lifetime of spacecraft
- Programming data stored in SRAM is vulnerable to singe-event upsets (SEUs)
  - Unlike ASICS, both the circuit state and the implemented circuit are affected
  - For the Virtex-I: protecting circuits with triple-modular redundancy (TMR) and removing SEUs on device through partial reconfiguration is effective against single-bit SEUs
  - Other researchers (Sterpone) have determined analytically that TMR defeats were possible from even a single-bit SEU
- Effectiveness of TMR in the presence of multiple-bit upsets (MBUs) is unknown
  - Can affect multiple resources on FPGA or manifest as multiple independent errors
  - Breaks TMR assumption that only one error exists in a system at a time



UNCLASSIFIED



### **Multiple-Bit Upsets Worsen Each Generation**



2V1000: Distribution of Event Sizes (100%) at 58.7 MeV-cm2/mg



### **Domain Crossing Errors**





### Operating

Domain Crossing Error

- Domain crossing errors occur when two or more domains produce identical errors due to an SEU
  - Majority voter unable to detect two wrong input signals
  - With triplicated voters, error must exist in two or more voters to propagate
  - Non-identical errors will vote out



UNCLASSIFIED



## **Test Setup: 2V1000 Circuits and Test Fixture**

#### Eight test circuits

- Two implementations of TMR (frequent voting within DUT and single off-chip vote) each with triplicated data signals, control signals, and voters
- Feed forward and feedback circuits
- High device utilization

### Fault Injection Tests

- Injected single bit and MBU SEU patterns across
  entire device
- Used 2V1000 accelerator data to guide MBU shapes
  - All 2-bit shapes, one 3-bit corner shape, 4-bit square shape

#### Accelerator Test

- In July 2007 at Indiana University Cyclotron Facility
- 6.6x10<sup>11</sup> total fluence over approximately two hours of testing
- One test circuit





UNCLASSIFIED



## **Domain Crossing Error Characteristics: Fault Injection Testing**

- All circuits exhibited MBU-induced DCEs, except one circuit's off-chip voting implementation
  - DCEs observed with even 2-bit MBUs
  - 1% of injected MBUs cause DCEs (averaged over frequent voting circuits)
    - Order of magnitude drop in DCEs for off-chip voting circuits
  - Wide range of DCE susceptibilities from tens to tens-of-thousands of DCEs
    - Range is strongly design-dependent
    - Decreasing voting, decreased DCEs
    - Design sensitivity and device utilization play a role

#### • Nearly all DCEs (99%) occurred in the configurable logic block (CLB) region:

- 75% entirely in routing,
- 22% spanning routing and look up tables (LUTs), and
- 2% in LUTs involving two LUTs in different slices
- CLB routing network is a concern, since 95% of the CLB SEUs occurred in routing in static characterization accelerator testing



UNCLASSIFIED



### **CLB Routing Network**

#### Every CLB routing switch has a CLB

- CLB has four slices
- Each slice has two LUTs, two user flip flops, and mode information
- Each slice can have its own data and control signals

#### • Every CLB routing switch responsible for

- Switch-to-slice communication of data and control signals
- Switch-to-switch communication of data and control signals
- In frequent voting circuits triplicated logic and voters from all three domains are often placed in one CLB
  - All of the data and control signals route through one switch
- In the future, we would like to study whether placing only one domain in a CLB and not allowing route through signals from other domains in the routing switch could reduce the number of DCEs





Operated by Los Alamos National Security, LLC for NNSA



UNCLASSIFIED

### Domain Crossing Error Characteristics: Accelerator Testing

### Observed 31 DCEs in two hours of testing

- 43% of observed DCE currently correlated to fault injection results
- DCE cross-section for the tested circuit is 6.6x10<sup>-11</sup> ± 3.8x10<sup>-13</sup> cm<sup>2</sup>/device
  - One order of magnitude smaller than the MBU cross-section for 2V1000
  - In fault injection tested design had 1% of the device affected by DCEs
- As well as DCEs, 19 single-event functional interrupt (SEFIs) were observed during the test
  - SEFI cross-section for device is  $4.1 \times 10^{-11} \pm 4.9 \times 10^{-13}$  cm<sup>2</sup>/device
  - DCEs on the same order of magnitude of SEFIs
- Same risk as a SEFI: a possibility, but a manageable problem
  - Design-dependency issues and mission-criticality



UNCLASSIFIED



### **Probability of Domain Crossing Error**

- Simple probability model described in paper
  - Based on the accelerator results to predict the rate
    of MBU occurrence
  - Based on the fault injection results
- Based on the probability model for the accelerator results of the 2V1000 there is a worst case probability of 0.36% at the highest tested LET of 58.7 MeV-cm<sup>2</sup>/mg that a DCE will occur on the Virtex-II for these designs
- Extending these results to the Virtex-5 using the 2V1000 fault injection results and the Virtex-5 accelerator data for an LET of 72.7 MeV-cm<sup>2</sup>/mg there is a worst case probability of 1.2% without including the >4-bit MBUs

Probability of DCE in 2V1000 Device



Probability of DCE in 5VLX50 Device





UNCLASSIFIED



### Conclusions

- In fault injection and accelerated testing of 2V1000 TMR circuits DCEs were observed
  - Even small MBUs can cause TMR failures
  - Approximately 0.1-1% of entire device affected
  - CLB routing network fragile in TMR schemes
  - Cross-section similar to the SEFI cross-section for the device
  - Problem is manageable, if designers are aware of potential DCE problems
  - Our model shows for our test circuits the worst case probability of a DCE is 0.36%, but is likely design-dependent

#### In the future methods and techniques for mitigating DCEs are needed:

- Avoid placing more than one domain in a CLB
- Avoid routing signals from one domain through another domain's switch



UNCLASSIFIED

