xilinx first 7nm device: versal ai core (vc1902)
TRANSCRIPT
© Copyright 2019 Xilinx
Hot Chips 31, Aug 20, 2019
Xilinx First 7nm Device: Versal AI Core (VC1902)
Sagheer Ahmad, Sridhar SubramanianVamsi Boppana, Shankar Lakka, Fu-Hing Ho, Tomai Knopp, Juanjo Noguera, Gaurav Singh, Ralph Wittig
@Xilinx Inc
© Copyright 2019 Xilinx
Agenda
˃Versal OverviewWhat is Versal
Versal series overview
First Versal device
˃Key Blocks & FeaturesNOC, Memory, Interfaces, IOs, and SerDes
PS/PMC, Security, Config and Debug
Programmable Logic
˃AI EngineArray, Core
Compute, memory, and throughput
Benchmarks and use-cases performance
© Copyright 2019 Xilinx
DEVICE CATEGORY
FPGA SoC ACAP
FEATURED PRODUCTS
Spartan
Artix
Kintex
Virtex
Zynq-7000
Zynq UltraScale+ MPSoC
Versal
Zynq UltraScale+ RFSoC
Xilinx Device Categories
>> 3
Versal = First ACAP device series
ACAP = Adaptive Compute Acceleration Platform
© Copyright 2019 Xilinx
Versal Series Overview
ProgrammableLogic
Scalar Engines Adaptable Engines Intelligent Engines
Arm Dual-CoreCortex-R5Real-TimeProcessor
Arm Dual-CoreCortex-A72ApplicationProcessor
AIEngines
DSPEngines
28G
58G
112GPCIe & CCIX
(w/DMA)DDR HBM
MultirateEthernet
600GCores
DirectRF
MIPI
LVDS
GPIO
Block RAM
UltraRAM
Accelerator RAM
PlatformManagement
Controller
Processing System
Network On Chip
Compute Engines
– Scalar Processors in every device
– Enhanced Programmable Logic
– New AI and enhanced DSP Engines
NoC and Memory– High BW Network-on-Chip
– Hardened [LP]DDR4/5, and HBM
High-Speed Interfaces– PCIe & CCIX up to Gen5
– Ethernet MAC up to 600Gbps
SerDes and RF– SerDes Up to 112G PAM4
– Integrated ADC/DAC
>> 4
© Copyright 2019 Xilinx
First Versal AI Core (VC1902)
Process Technology TSMC 7nm FF
# Transistors 37B
On-die memory 855Mb
# AI engine cores 400
# IOs 785
# SerDes 44
Shipping to Early Customers
AI Engines
PS & PMC
Se
rDe
s
Se
rDe
s
VN
OC
Colu
mn
VN
OC
Colu
mn
VN
OC
Colu
mn
VN
OC
Co
lum
n
DDR MC, PHY & IOs
PCIe & CCIX
Eth
ern
eta
nd
PC
Ie
DS
P C
olu
mn
DS
P C
olu
mn
DS
P C
olu
mn
DS
P C
olu
mn
HNoC
HNoC
PL
>> 5
© Copyright 2019 Xilinx
Highly Configurable & Scalable
Configurable topology, ports, routing, and QoS
Compiler to generate use-case specific routing, QoS, ...
NoC extends for die-to-die connectivity
Versal NoC (Network-on-Chip)
Vertical NoC- 2 physical channels, each w/ 8 VCs- 7 NMU and 7 NSU per column- >0.5Tbps of bidirectional bandwidth per column
Horizontal NoC- 4 physical channels, each w/ 8 VCs- 4 NSU ports per DDR Controller- >1Tbps bidirectional bandwidth per row
Packetized High-speed NoC
All of SoC building blocks & PL connected via NoC
Packetized w/ VCs & end-to-end ECC protection
Clocking & Power Management
Clock forwarding to minimize clock jitter & power
Aggressive clock-gating & data bypass
Data movement efficiency critical for compute acceleration
AW
AR
WR
B
RESP
AW
AR
WR
B
RESP
REQ
REQ
Ingress EgressHigh speed
transport
One Switch
NMU
NSU
NoC (Conceptual)
>> 6
© Copyright 2019 Xilinx
Memory Subsystem and IO
Unified Memory Subsystem
Unified memory subsystem, but can be customized
Transaction reordering & QoS for multiple traffic types
Parallel IOs
644x high performance XPIOs for DDR, MIPI, …137x high density multiprotocol IOs for up to 3.3v
HD
IOs
HD
IOs
MIO
s
256b DDR w/ 4x 64b or 8x 32b channels
Optimized for 64b or 32b memory channels
32b granularity more efficient for some use-cases
DDR4 up to 3200 and LPDDR4 up to 4266 Mbps
DDR Memory Controller
DDR & XPIOs
>> 7
© Copyright 2019 Xilinx
PCIe, Ethernet, and SerDes
4x 100G Multi-rate Ethernet
Multi-rate (100/50/40/25/10Gbps) Ethernet
MACs with RS-FEC. 1588 support.
6x PCIe Gen4
Up to Gen4 x16 with End-point & Root-port
Smart storage or IO-Hub accelerator
Se
rDe
s
PC
IeG
bE
s
SerDes
44x 32Gbps multi protocol transceivers
Supports 100+ protocol/rate combinations
Se
rDe
s
PC
Ie
>> 8
© Copyright 2019 Xilinx
Versal CCIX
CCIX and PCIe (CPM)– 2nd generation of CCIX coherent accelerator link
Coherent Home-node and L2 cache– Home-node for coherent peer processing
– L2 for caching capability for PL accelerator kernels
CCIX ESM (Extended Speed Mode)– Supports PCIe Gen4 x16 for CCIX & PCIe
– Supports up to CCIX 25Gbps 2x8
CPM
PCIe w/ CCIX
AXI4
AXIS
GT
s
Ph
ysic
al L
aye
r
Lin
k La
yer
CC
IX
T L
ay
er
PC
Ie
T L
ay
er
PL
Clock, Reset,
Debug
PCIe w/ CCIX
DMA & Bridge
Ph
ysic
al L
aye
r
Lin
k La
yer
PC
Ie
T L
ay
er
CC
IX
T L
ay
er
GT
s
PS/NoC
XP
IPE
XP
IPE
AXIS
CCIX
to CHI
Bridge
CCIX
to CHI
Bridge
Cache
Coherent
Mesh
L2
Cache
L2
Cache Loca
l
Cac
he
Loca
l
Cac
he
CHI
CHI
Use
r
Ke
rne
l
Use
r
Ke
rne
l
AT
CA
TC
CPM Interconnect
Coherent Load/Store Memory Semantics>> 9
© Copyright 2019 Xilinx
Versal Processor System (PS)
PS in all Versal devices
3rd generation of PS integration
First generation with PS in all devices
Host for embedded, control for acceleration
Processor System
Dual-core A72 APU
2x Cores with 1MB L2 Cache with ECC
Coherency and virtualization support
Dual core R5 RPU w/ lockstep
2x Cores with 256KB TCM & 256KB OCM
ASIL-C(D) capable functional safety
>> 10
© Copyright 2019 Xilinx
Versal PMC and Security
PMC (Platform Mgmt Controller)
PL
Triple
Redundant
MicroBlaze
AE
S-k
ey
Device
Efuse
BBRAM
PUF
AES-GCM SHA3-384RSA/
ECDSA
TRNG
Decryption Authentication
Key-loading
Boot
ROMRAM
User
Boot/Flash
Interfaces
Voltage &
Temperature
Monitor
Internal
Clock
Generator
PS
NoC,
DDR,
ME, ...
Platform Management Ctrl (PMC)Gateway for Boot/Config, Security, Power mgmt, …Dual-core triple redundant MicroBlaze subsystem
Crypto accelerator engines (RSA,ECDSA,AES,SHA)
Security & MonitorsHardware RoT with authentication and encryption
Key storage & management including PUF support
Distributed Voltage & Temp monitors
50Gbps Configuration Interface
Typical PL kernel configuration in sub 10msec
8x faster PL configuration time per config-bit
10Gbps Debug & Trace Interface
New HSDP (High-Speed Debug Port) serial interface.
100x faster than JTAG for debug & trace
>> 11
© Copyright 2019 Xilinx
Versal Programmable Logic (PL)
158Mb of URAM & BRAM
Distributed URAM and BRAM columns
Customizable memory hierarchy
50% lower power than previous gen
900K LUTs (2M LC) and 1.8M Flops
4x Larger CLB (8 LUTs 32 LUTs)
16 Flip-Flops 64 Flip-Flops
Increased local routing ( lower global routing)
Imux registers for pipeline & time borrowing
Versal CLB
Ne
w C
LB I
nte
rco
nn
ect
>> 12
© Copyright 2019 Xilinx
Versal Programmable Logic DSP
DS
P C
olu
mns
DS
P C
olu
mns
DS
P C
olu
mns
DS
P C
olu
mns
DS
P C
olu
mns
DS
P C
olu
mns
Versal DSP58
FP32/16 floating point
INT8/16/24 and CMPLX18 fixed point
1968x DSP58
Distributed DSP columns
2.8TFLOP/s FP32 Peak
11.8TOP/s (INT8) Peak
>> 13
© Copyright 2019 Xilinx
Versal SiliconE
ng
ines
Inte
rfa
ces &
IP
100G Multi-Rate MAC• Passed internal loopback
without GTY at full speed
32Gb/s Transceivers• Passed backplane PRBS31
• 7.16ps total &180fs random jitter
PCIe & CCIX• Passed Gen3 compliance
• Clean link at Gen4 x4
DDR Memory• DDR4 running at 3200Mb/s
• LPDDR4 running at 4266Mb/s
Programmable Logic• PL state machines
• Data across NoC & AI Engines
Scalar Engines• Boots 64-bit Linux
• A72, R5, PMC all running
AI Engines• All 400 AI Engine Tiles functional
Network-on-Chip (NoC)• Running error-free @ 3200 Mb/s
• Arbitration across engines
>> 14
© Copyright 2019 Xilinx
400 AI Engine Tiles
133 TOPs (INT8) Peak
AI Engine: Array
Non-blocking Interconnect Mesh
20Tbps row x-sectional bandwidth
10 32-bit channels per column and 8 per row
Distributed Memory Hierarchy
12.5MB distributed L1 memory
Multi-bank local memory shared w/ neighboring tiles
Distributed DMA
>> 15
© Copyright 2019 Xilinx
AI Engine: Core
Local, Shareable Memory• 32KB Local, 128KB Addressable
32b Scalar RISC Processor• 2 Scalar Ops / Stream Access Vector Processor
• 512-bit SIMD Datapath• 2 Vector Loads / 1 Mult / 1 Store• vec128int8• vec8fp32
Memory Interface
Scalar Unit
ScalarRegister
File
Scalar ALU
Non-linear Functions
Vector Register
File
Fixed-Point Vector Unit
Floating-Point Vector Unit
Vector Unit
Instruction Fetch & Decode Unit
AGU AGU AGU
Load Unit A Load Unit B Store Unit
Stream Interface
7+ Ops per cycle VLIW
>> 16
© Copyright 2019 Xilinx
Multi-Precision Support
8 816
32
64
128
32x32SPFP
32x32Real
32x16Real
16x16Real
16x8Real
8x8Real
MACs / Cycle (per core)
AI Data Types Signal Processing Data Types
2
4
8
16
32x32Complex
32x16Complex
16x16Complex
16 Complexx 16 Real
MACs / Cycle (per core)
>> 17
© Copyright 2019 Xilinx
AI Engine Memory Hierarchy
DRAM
L2 SRAM
L1 SRAM
Flexible XBAR
Adaptable L1 NOC with DMA
>> 18
© Copyright 2019 Xilinx
AI Engine Memory Hierarchy
DRAM
L2 SRAM
L1 SRAM
Multicast /
Broadcast
>> 19
© Copyright 2019 Xilinx
AI Engine Memory Hierarchy
> 64 GByteDRAM
16.3 MByteL2 SRAM
12.5 MByte(128 kByte 4 Core Cluster)
L1 SRAM
1.6 TB/s
102 GB/s
38 TB/s
>> 20
© Copyright 2019 Xilinx
AI Engine: Compute Efficiency
95%
80%
98%
ML Convolutions FFT DPD
Vector Processor Efficiency
Peak Kernel Theoretical Performance
Block-basedMatrix Multiplication(32×64) × (64×32)
1024-ptFFT/iFFT
Volterra-basedforward-path DPD
˃ Adaptable, non-blocking interconnect
Flexible data movement architecture
Avoids interconnect “bottlenecks”
˃ Adaptable memory hierarchy
Local, distributed, shareable = extreme bandwidth
No cache misses or data replication
Extend to PL memory (BRAM, URAM)
˃ Distributed DMA for overlapping Compute and Comm.
Compute
Comm
Compute Compute
Comm Comm
>> 21
© Copyright 2019 Xilinx
AI Engine: Performance Benchmark
4087
29250
0
5000
10000
15000
20000
25000
30000
35000
Alveo U250 xDNN Versal AI Core Series
Images/S
econd
GoogleNet Inference Performance (sub 2ms latency)
2043
11812
0
2000
4000
6000
8000
10000
12000
14000
Alveo U250 xDNN Versal AI Core Series
Images/S
econd
ResNet-50 Inference Performance
* *
*Versal AI Core (VC1902) projected performance
UltraScale+ series UltraScale+ series
>> 22
© Copyright 2019 Xilinx
Accelerating AI Applications on Versal
NETWORK-ON-CHIP
AI Engines
Arm Dual-CoreCortex-R5
Arm Dual-CoreCortex-A72
I/O
TB/s of BandwidthPL-to-AI Engine
Scalar, Sequential& Complex Compute
Any-to-AnyConnectivity
Flexible Parallel Compute,Data manipulation
ML & Signal ProcessingVector, Compute Intensive
IntelligentAdaptableScalar
Video + AI
Genomics + AI
Risk Modeling + AI
Database + AI
Network IPS + AI
Storage + AI
Heterogeneous Accelerationfrom Data Center to the Edge
Deterministic Performance & Low Latency
Custom MemoryHierarchy
463 x 32KB +
967 x 4KB of RAM
>> 23
© Copyright 2019 Xilinx
Accelerating 5G Wireless on Versal5G Wireless Infrastructure
Pa
cke
t P
roce
ssin
ga
nd
Wire
d B
ackh
au
l
Hig
he
r L
aye
r P
roce
ssin
g
Sw
itch
ing
Beam
Fo
rmin
g &
MM
IO
+ S
om
e B
aseband
Tra
nsfo
rms
Dig
ita
l Rad
io
AD
C /
DA
C
An
alo
gu
e R
ad
io
An
ten
na
Arr
ay
Ba
se
ba
nd
Pro
ce
ssin
g
Digital Radio w/ ADC/DAC
DUC: Digital Up ConverterDPD: Digital Pre-DistortionCPRI: Common Public Radio Interface
ADC/ DAC
DPDDUCCPRI
DPD Update
CFR
Channel
FilterHB1 2 HB2 2 LTE20
Channel
FilterHB1 2 HB2 2 LTE20
Channel
FilterHB1 2 HB2 2 LTE20
Channel
FilterHB1 2 HB2 2 LTE20
Mixing
DDS DDS DDS DDS
Peak
Detect and
Scale Find
HB4 2 Delay
PC-CFR
Peak
Detect and
Scale Find
Delay
PC-CFR
HB5 5/4 DPDFilter
1/4
CABS
9x9 DPDkernel
NHB3 = 43
NHB1 = 23NC = 89 NHB2 = 11
NHB5 = 41
30.72 MHz 30.72 MHz 61.44 MHz
491.5
2 M
Hz
122.8
8 M
Hz
614.4
MH
z
DPDLUTs
Coefficientto LUT
Conversion
Programmable Logic (PL)
AI Engine Array
Processor System (PS) : APU
AIE I/F
CoefficientsGain
MemoryActive/Shadow
Channel Filter
HB1 2 HB2 2 LTE20 HB3 2
NHB4 = 27
DPDFilter
2/4
DPDFilter
3/4
DPDFilter
4/4
DPDLUTs
Frequency
Domain
Measurements
Power Spectrum Estimate
DPDOutput
Versal
>> 24
AIE I/FAIE I/F
AIE I/FAIE I/FAIE I/FAIE I/FAIE I/F AIE I/F
Crest Factor ReductionShaping Up-sample Heterodyne Digital Pre-distortion
© Copyright 2019 Xilinx
Software Programmable
Compile
Design
4G/5G/Radar
Library
Frameworks
AI
LibraryVision
Library
C/C++C/C++
AI Engine Compiler
Programming
Abstraction Levels
1
2
3Run
Architecture Overlay
Data Floww/ Xilinx libraries
Kernel ProgramData Flow w/ user defined libraries
>> 25
© Copyright 2019 Xilinx
AI Engine ArrayPS PL
Unified Tool Chain for Programming
Xilinx SDK: Eclipse GUI
User-Directed System Partitioner
AI Engine CompilerARM C Compiler
System-C Virtual Simulation Platform
Core ISSQEMU
System-Level
Performance Analysis (using core profiling)
System-Level
Debugger (using core debugger)
Base Platform
ApplicationPerformance &
Partitioning Constraints
Binaries & Bitstream
Targets
SDK
Versal Device
Vivado
HLSRTL
IP
>> 26
© Copyright 2019 Xilinx
Summary
Versal is the first generation of ACAP device– ACAP is a new class of device from Xilinx
Versal employs adaptable heterogeneous system architecture– New SW programmable AI Engine for diverse compute acceleration workloads
– New High-bandwidth Network-on-Chip integrated w/ hardened DDR Subsystem
– Processor System in all Versal devices
– Re-architected Programmable Logic
Xilinx first 7nm device: Versal AI Core VC1902– 133TOPs AI Engines, 12TOPs DSP Engines, and 900K LUTs
– 256b DDR4/LPDDR4, PCIe Gen4 & CCIX up to 25Gpbs
– For more details, refer to: www.xilinx.com/versal
>> 27
© Copyright 2019 Xilinx
Adaptable.Intelligent.