xilinx first 7nm device: versal ai core (vc1902)

© Copyright 2019 Xilinx

Hot Chips 31, Aug 20, 2019

Xilinx First 7nm Device: Versal AI Core (VC1902)

Sagheer Ahmad, Sridhar SubramanianVamsi Boppana, Shankar Lakka, Fu-Hing Ho, Tomai Knopp, Juanjo Noguera, Gaurav Singh, Ralph Wittig

@Xilinx Inc


Agenda

˃Versal OverviewWhat is Versal

Versal series overview

First Versal device

˃Key Blocks & FeaturesNOC, Memory, Interfaces, IOs, and SerDes

PS/PMC, Security, Config and Debug

Programmable Logic

˃AI EngineArray, Core

Compute, memory, and throughput

Benchmarks and use-cases performance


DEVICE CATEGORY

FPGA SoC ACAP

FEATURED PRODUCTS

Spartan

Artix

Kintex

Virtex

Zynq-7000

Zynq UltraScale+ MPSoC

Versal

Zynq UltraScale+ RFSoC

Xilinx Device Categories

>> 3

Versal = First ACAP device series

ACAP = Adaptive Compute Acceleration Platform


Versal Series Overview

ProgrammableLogic

Scalar Engines Adaptable Engines Intelligent Engines

Arm Dual-CoreCortex-R5Real-TimeProcessor

Arm Dual-CoreCortex-A72ApplicationProcessor

AIEngines

DSPEngines

28G

58G

112GPCIe & CCIX

(w/DMA)DDR HBM

MultirateEthernet

600GCores

DirectRF

MIPI

LVDS

GPIO

Block RAM

UltraRAM

Accelerator RAM

PlatformManagement

Controller

Processing System

Network On Chip

Compute Engines

– Scalar Processors in every device

– Enhanced Programmable Logic

– New AI and enhanced DSP Engines

NoC and Memory– High BW Network-on-Chip

– Hardened [LP]DDR4/5, and HBM

High-Speed Interfaces– PCIe & CCIX up to Gen5

– Ethernet MAC up to 600Gbps

SerDes and RF– SerDes Up to 112G PAM4

– Integrated ADC/DAC

>> 4


First Versal AI Core (VC1902)

Process Technology TSMC 7nm FF

# Transistors 37B

On-die memory 855Mb

# AI engine cores 400

# IOs 785

# SerDes 44

Shipping to Early Customers

AI Engines

PS & PMC

Se

rDe

s

Se

rDe

s

VN

OC

Colu

mn

VN

OC

Colu

mn

VN

OC

Colu

mn

VN

OC

Co

lum

n

DDR MC, PHY & IOs

PCIe & CCIX

Eth

ern

eta

nd

PC

Ie

DS

P C

olu

mn

DS

P C

olu

mn

DS

P C

olu

mn

DS

P C

olu

mn

HNoC

HNoC

PL

>> 5


Highly Configurable & Scalable

Configurable topology, ports, routing, and QoS

Compiler to generate use-case specific routing, QoS, ...

NoC extends for die-to-die connectivity

Versal NoC (Network-on-Chip)

Vertical NoC- 2 physical channels, each w/ 8 VCs- 7 NMU and 7 NSU per column- >0.5Tbps of bidirectional bandwidth per column

Horizontal NoC- 4 physical channels, each w/ 8 VCs- 4 NSU ports per DDR Controller- >1Tbps bidirectional bandwidth per row

Packetized High-speed NoC

All of SoC building blocks & PL connected via NoC

Packetized w/ VCs & end-to-end ECC protection

Clocking & Power Management

Clock forwarding to minimize clock jitter & power

Aggressive clock-gating & data bypass

Data movement efficiency critical for compute acceleration

AW

AR

WR

B

RESP

AW

AR

WR

B

RESP

REQ

REQ

Ingress EgressHigh speed

transport

One Switch

NMU

NSU

NoC (Conceptual)

>> 6


Memory Subsystem and IO

Unified Memory Subsystem

Unified memory subsystem, but can be customized

Transaction reordering & QoS for multiple traffic types

Parallel IOs

644x high performance XPIOs for DDR, MIPI, …137x high density multiprotocol IOs for up to 3.3v

HD

IOs

HD

IOs

MIO

s

256b DDR w/ 4x 64b or 8x 32b channels

Optimized for 64b or 32b memory channels

32b granularity more efficient for some use-cases

DDR4 up to 3200 and LPDDR4 up to 4266 Mbps

DDR Memory Controller

DDR & XPIOs

>> 7


PCIe, Ethernet, and SerDes

4x 100G Multi-rate Ethernet

Multi-rate (100/50/40/25/10Gbps) Ethernet

MACs with RS-FEC. 1588 support.

6x PCIe Gen4

Up to Gen4 x16 with End-point & Root-port

Smart storage or IO-Hub accelerator

Se

rDe

s

PC

IeG

bE

s

SerDes

44x 32Gbps multi protocol transceivers

Supports 100+ protocol/rate combinations

Se

rDe

s

PC

Ie

>> 8


Versal CCIX

CCIX and PCIe (CPM)– 2nd generation of CCIX coherent accelerator link

Coherent Home-node and L2 cache– Home-node for coherent peer processing

– L2 for caching capability for PL accelerator kernels

CCIX ESM (Extended Speed Mode)– Supports PCIe Gen4 x16 for CCIX & PCIe

– Supports up to CCIX 25Gbps 2x8

CPM

PCIe w/ CCIX

AXI4

AXIS

GT

s

Ph

ysic

al L

aye

r

Lin

k La

yer

CC

IX

T L

ay

er

PC

Ie

T L

ay

er

PL

Clock, Reset,

Debug

PCIe w/ CCIX

DMA & Bridge

Ph

ysic

al L

aye

r

Lin

k La

yer

PC

Ie

T L

ay

er

CC

IX

T L

ay

er

GT

s

PS/NoC

XP

IPE

XP

IPE

AXIS

CCIX

to CHI

Bridge

CCIX

to CHI

Bridge

Cache

Coherent

Mesh

L2

Cache

L2

Cache Loca

l

Cac

he

Loca

l

Cac

he

CHI

CHI

Use

r

Ke

rne

l

Use

r

Ke

rne

l

AT

CA

TC

CPM Interconnect

Coherent Load/Store Memory Semantics>> 9


Versal Processor System (PS)

PS in all Versal devices

3rd generation of PS integration

First generation with PS in all devices

Host for embedded, control for acceleration

Processor System

Dual-core A72 APU

2x Cores with 1MB L2 Cache with ECC

Coherency and virtualization support

Dual core R5 RPU w/ lockstep

2x Cores with 256KB TCM & 256KB OCM

ASIL-C(D) capable functional safety

>> 10


Versal PMC and Security

PMC (Platform Mgmt Controller)

PL

Triple

Redundant

MicroBlaze

AE

S-k

ey

Device

Efuse

BBRAM

PUF

AES-GCM SHA3-384RSA/

ECDSA

TRNG

Decryption Authentication

Key-loading

Boot

ROMRAM

User

Boot/Flash

Interfaces

Voltage &

Temperature

Monitor

Internal

Clock

Generator

PS

NoC,

DDR,

ME, ...

Platform Management Ctrl (PMC)Gateway for Boot/Config, Security, Power mgmt, …Dual-core triple redundant MicroBlaze subsystem

Crypto accelerator engines (RSA,ECDSA,AES,SHA)

Security & MonitorsHardware RoT with authentication and encryption

Key storage & management including PUF support

Distributed Voltage & Temp monitors

50Gbps Configuration Interface

Typical PL kernel configuration in sub 10msec

8x faster PL configuration time per config-bit

10Gbps Debug & Trace Interface

New HSDP (High-Speed Debug Port) serial interface.

100x faster than JTAG for debug & trace

>> 11


Versal Programmable Logic (PL)

158Mb of URAM & BRAM

Distributed URAM and BRAM columns

Customizable memory hierarchy

50% lower power than previous gen

900K LUTs (2M LC) and 1.8M Flops

4x Larger CLB (8 LUTs 32 LUTs)

16 Flip-Flops 64 Flip-Flops

Increased local routing ( lower global routing)

Imux registers for pipeline & time borrowing

Versal CLB

Ne

w C

LB I

nte

rco

nn

ect

>> 12


Versal Programmable Logic DSP

DS

P C

olu

mns

DS

P C

olu

mns

DS

P C

olu

mns

DS

P C

olu

mns

DS

P C

olu

mns

DS

P C

olu

mns

Versal DSP58

FP32/16 floating point

INT8/16/24 and CMPLX18 fixed point

1968x DSP58

Distributed DSP columns

2.8TFLOP/s FP32 Peak

11.8TOP/s (INT8) Peak

>> 13


Versal SiliconE

ng

ines

Inte

rfa

ces &

IP

100G Multi-Rate MAC• Passed internal loopback

without GTY at full speed

32Gb/s Transceivers• Passed backplane PRBS31

• 7.16ps total &180fs random jitter

PCIe & CCIX• Passed Gen3 compliance

• Clean link at Gen4 x4

DDR Memory• DDR4 running at 3200Mb/s

• LPDDR4 running at 4266Mb/s

Programmable Logic• PL state machines

• Data across NoC & AI Engines

Scalar Engines• Boots 64-bit Linux

• A72, R5, PMC all running

AI Engines• All 400 AI Engine Tiles functional

Network-on-Chip (NoC)• Running error-free @ 3200 Mb/s

• Arbitration across engines

>> 14


400 AI Engine Tiles

133 TOPs (INT8) Peak

AI Engine: Array

Non-blocking Interconnect Mesh

20Tbps row x-sectional bandwidth

10 32-bit channels per column and 8 per row

Distributed Memory Hierarchy

12.5MB distributed L1 memory

Multi-bank local memory shared w/ neighboring tiles

Distributed DMA

>> 15


AI Engine: Core

Local, Shareable Memory• 32KB Local, 128KB Addressable

32b Scalar RISC Processor• 2 Scalar Ops / Stream Access Vector Processor

• 512-bit SIMD Datapath• 2 Vector Loads / 1 Mult / 1 Store• vec128int8• vec8fp32

Memory Interface

Scalar Unit

ScalarRegister

File

Scalar ALU

Non-linear Functions

Vector Register

File

Fixed-Point Vector Unit

Floating-Point Vector Unit

Vector Unit

Instruction Fetch & Decode Unit

AGU AGU AGU

Load Unit A Load Unit B Store Unit

Stream Interface

7+ Ops per cycle VLIW

>> 16


Multi-Precision Support

8 816

32

64

128

32x32SPFP

32x32Real

32x16Real

16x16Real

16x8Real

8x8Real

MACs / Cycle (per core)

AI Data Types Signal Processing Data Types

2

4

8

16

32x32Complex

32x16Complex

16x16Complex

16 Complexx 16 Real

MACs / Cycle (per core)

>> 17


AI Engine Memory Hierarchy

DRAM

L2 SRAM

L1 SRAM

Flexible XBAR

Adaptable L1 NOC with DMA

>> 18



DRAM

L2 SRAM

L1 SRAM

Multicast /

Broadcast

>> 19



> 64 GByteDRAM

16.3 MByteL2 SRAM

12.5 MByte(128 kByte 4 Core Cluster)

L1 SRAM

1.6 TB/s

102 GB/s

38 TB/s

>> 20


AI Engine: Compute Efficiency

95%

80%

98%

ML Convolutions FFT DPD

Vector Processor Efficiency

Peak Kernel Theoretical Performance

Block-basedMatrix Multiplication(32×64) × (64×32)

1024-ptFFT/iFFT

Volterra-basedforward-path DPD

˃ Adaptable, non-blocking interconnect

Flexible data movement architecture

Avoids interconnect “bottlenecks”

˃ Adaptable memory hierarchy

Local, distributed, shareable = extreme bandwidth

No cache misses or data replication

Extend to PL memory (BRAM, URAM)

˃ Distributed DMA for overlapping Compute and Comm.

Compute

Comm

Compute Compute

Comm Comm

>> 21


AI Engine: Performance Benchmark

4087

29250

0

5000

10000

15000

20000

25000

30000

35000

Alveo U250 xDNN Versal AI Core Series

Images/S

econd

GoogleNet Inference Performance (sub 2ms latency)

2043

11812

0

2000

4000

6000

8000

10000

12000

14000

Alveo U250 xDNN Versal AI Core Series

Images/S

econd

ResNet-50 Inference Performance

* *

*Versal AI Core (VC1902) projected performance

UltraScale+ series UltraScale+ series

>> 22


Accelerating AI Applications on Versal

NETWORK-ON-CHIP

AI Engines

Arm Dual-CoreCortex-R5

Arm Dual-CoreCortex-A72

I/O

TB/s of BandwidthPL-to-AI Engine

Scalar, Sequential& Complex Compute

Any-to-AnyConnectivity

Flexible Parallel Compute,Data manipulation

ML & Signal ProcessingVector, Compute Intensive

IntelligentAdaptableScalar

Video + AI

Genomics + AI

Risk Modeling + AI

Database + AI

Network IPS + AI

Storage + AI

Heterogeneous Accelerationfrom Data Center to the Edge

Deterministic Performance & Low Latency

Custom MemoryHierarchy

463 x 32KB +

967 x 4KB of RAM

>> 23


Accelerating 5G Wireless on Versal5G Wireless Infrastructure

Pa

cke

t P

roce

ssin

ga

nd

Wire

d B

ackh

au

l

Hig

he

r L

aye

r P

roce

ssin

g

Sw

itch

ing

Beam

Fo

rmin

g &

MM

IO

+ S

om

e B

aseband

Tra

nsfo

rms

Dig

ita

l Rad

io

AD

C /

DA

C

An

alo

gu

e R

ad

io

An

ten

na

Arr

ay

Ba

se

ba

nd

Pro

ce

ssin

g

Digital Radio w/ ADC/DAC

DUC: Digital Up ConverterDPD: Digital Pre-DistortionCPRI: Common Public Radio Interface

ADC/ DAC

DPDDUCCPRI

DPD Update

CFR

Channel

FilterHB1 2 HB2 2 LTE20

Channel


Channel


Channel


Mixing

DDS DDS DDS DDS

Peak

Detect and

Scale Find

HB4 2 Delay

PC-CFR

Peak

Detect and

Scale Find

Delay

PC-CFR

HB5 5/4 DPDFilter

1/4

CABS

9x9 DPDkernel

NHB3 = 43

NHB1 = 23NC = 89 NHB2 = 11

NHB5 = 41

30.72 MHz 30.72 MHz 61.44 MHz

491.5

2 M

Hz

122.8

8 M

Hz

614.4

MH

z

DPDLUTs

Coefficientto LUT

Conversion

Programmable Logic (PL)

AI Engine Array

Processor System (PS) : APU

AIE I/F

CoefficientsGain

MemoryActive/Shadow

Channel Filter

HB1 2 HB2 2 LTE20 HB3 2

NHB4 = 27

DPDFilter

2/4

DPDFilter

3/4

DPDFilter

4/4

DPDLUTs

Frequency

Domain

Measurements

Power Spectrum Estimate

DPDOutput

Versal

>> 24

AIE I/FAIE I/F

AIE I/FAIE I/FAIE I/FAIE I/FAIE I/F AIE I/F

Crest Factor ReductionShaping Up-sample Heterodyne Digital Pre-distortion


Software Programmable

Compile

Design

4G/5G/Radar

Library

Frameworks

AI

LibraryVision

Library

C/C++C/C++

AI Engine Compiler

Programming

Abstraction Levels

1

2

3Run

Architecture Overlay

Data Floww/ Xilinx libraries

Kernel ProgramData Flow w/ user defined libraries

>> 25


AI Engine ArrayPS PL

Unified Tool Chain for Programming

Xilinx SDK: Eclipse GUI

User-Directed System Partitioner

AI Engine CompilerARM C Compiler

System-C Virtual Simulation Platform

Core ISSQEMU

System-Level

Performance Analysis (using core profiling)

System-Level

Debugger (using core debugger)

Base Platform

ApplicationPerformance &

Partitioning Constraints

Binaries & Bitstream

Targets

SDK

Versal Device

Vivado

HLSRTL

IP

>> 26


Summary

Versal is the first generation of ACAP device– ACAP is a new class of device from Xilinx

Versal employs adaptable heterogeneous system architecture– New SW programmable AI Engine for diverse compute acceleration workloads

– New High-bandwidth Network-on-Chip integrated w/ hardened DDR Subsystem

– Processor System in all Versal devices

– Re-architected Programmable Logic

Xilinx first 7nm device: Versal AI Core VC1902– 133TOPs AI Engines, 12TOPs DSP Engines, and 900K LUTs

– 256b DDR4/LPDDR4, PCIe Gen4 & CCIX up to 25Gpbs

– For more details, refer to: www.xilinx.com/versal

>> 27


Adaptable.Intelligent.

xilinx first 7nm device: versal ai core (vc1902)

Documents