devday 2016: artur speth - devops - microsoft developer divisions weg ins nächste agile zeitalter

Post on 15-Jan-2017

4.339 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

4,307

467

Spread out across 35 feature teams

ProductionDevelopment

Backlog

Requirements

Visual Studio& TFS

Update 1

Visual Studio& TFS

Update 2

Visual Studio& TFS

Update n

VS Team Services

Code Test & Stabilize Code Test & Stabilize

Beta RTM

2 years

Planning

Customer feedback – we should

change the way a feature works. We

didn’t get it quite right…

… but we’re booked solid already. 2 years

S1 S2 S3 S4 S5 Stabilization S6

A

B

S7 S8

2 years

3 weeks

https://flic.kr/p/arXUyP

Alignment

Autonomy

“Let’s try to give our teams three things…. Autonomy, Mastery, Purpose”

Scenarios

Features

Stories

Tasks

Sprint3 week

3

Plan3 sprint

Season6 month

Scenario18 month

3 6

SpringFallSpring Fall

Aspirational

60%

Sprint3 week

Plan3 sprint

3

Season6 month

Scenario18 month

3 6

SpringFallSpring Fall

Hopeful

80%

What Epics are we lighting up

Sprint3 week

3

Plan3 sprint

Season6 month

Scenario18 month

3 6

SpringFallSpring Fall

Thoughtful

90%

What features are planned?

Sprint3 week

3

Plan3 sprint

Scenario18 month

3 6

SpringFallSpring Fall

Confident

95%

What stories are we complete? What features are shipping?

Season6 month

Week 1 Week 2 Week 3

Week 1 Week 2 Week 3Week 2 Week 3

Sprint 98

Sprint 97 Sprint 99

The sprint plan What we accomplished

• Updates were large

• Months apart

• Lots of problems!

4/1/2010 4/23/2012

5/3/2010

TFS 2010 RTM

4/23/2011

Service Deployment

8/5/2011

Service Update

9/26/2011

//BUILD 2011

12/7/2011

Service Update

1/30/2012

Service Update

2/20/2012

Service Update

3/12/2012

Service Update

4/2/2012

Service Update

Program Management Development Testing

Operations

Program Management Engineering

Operations

Engineering

Program Management Engineering

Week 1 Week 2 Week 3

Week 1 Week 2 Week 3Week 2 Week 3

Sprint 98

Sprint 97 Sprint 99

DeploymentSprint Planning

Done

Week 1 Week 2 Week 3

Week 1 Week 2 Week 3

Week 1 Week 2 Week 3

Week 1 Week 2 Week 3

ONE

Code Test & Stabilize Code Test & Stabilize

Beta RTM

Planning

Code

Complete

ON

OFF

ON

OFF

ON

OFF

ON

OFF

ON

OFF

ON

OFF

VSO SU1

Chicago

VSO SU0

San Antonio

VSO SU4

Amsterdam

Shared Platform Services

San Antonio

Existing experience Baseline:

36% conversion to project

50% to 100% customers

conversion to project (+18%)

There’s no place like production!

Telemetry everywhere

Customer IntelligenceBusiness IntelligenceOperational Intelligence

Dashboard DevOps Debug Experiments

Getting the availability model right

0,8

0,82

0,84

0,86

0,88

0,9

0,92

0,94

0,96

0,98

1

-200

0

200

400

600

800

1000

1200

1400

1600

9.25.13 2:24 PM 9.25.13 3:36 PM 9.25.13 4:48 PM 9.25.13 6:00 PM 9.25.13 7:12 PM 9.25.13 8:24 PM 9.25.13 9:36 PM 9.25.13 10:48 PM

Sept 25th 2013 LSI

FailedExecutionCount SlowExecutionCount Start End Availability (ID4 - Activity Only) Availability (Current)

Alerting is key to fast detection

Every alert must be actionable and represent a real issue with the system.

Alerts should create a sense of urgency –false alerts dilutes that

Redundant alerts for same the issue

Needed to set right thresholds and tune often

Stateless alerts contributed to further noise

Health model in action

• 3 errors for memory

and performance

• All 3 related to same

code defect

• APM component mapped to feature team

• Auto-dialer engaged Global DRI

Eliminated alert noise

~928 alerts per week to

~22 and reduced DRI

escalations by ~56%

Live Site Issues (LSIs)

Time to MitigateTime to Detect

% o

f In

cid

en

ts

DRAFT

DRAFT

Microsoft Confidential 52

Service Availability & Health Metrics

DRAFT DRAFTDRAFT

Inci

den

t C

ou

nt

Inci

den

t C

ou

nt

DRAFT

DRAFTDRAFT

% o

f In

cid

en

ts

Use

r M

inu

tes

DRAFT

DRAFTDRAFT

Error By SourceIncidents by SeverityUser Impact Minutes During Incidents [TFS

Only]

3

2

1

4

1. TFS Availability is on an improving trend. No Sev0/Sev1 LSIs for July.

2. App Insights switched from synthetic availability to real-user experience in Ibiza portal. A high

volume of SEV-2 LSIs (72) contributed to customer impact in addition to intermittent UX errors.

(UX fixes applied on 8/11 that improves availability)

3. App Insights was impacted by 3 long running LSIs related to ES maintenance, Ibiza updates and

an Azure Storage outage.

4. TFS Service attainment (SLO) improved significantly MoM with focus on minimizing failed/slow

commands and reviewing in weekly LiveSite reviews

Service status

© 2015 Microsoft Corporation. All rights reserved.

top related