In second day of the QCON, we have following tracks



Software Architecture Improvements

Applied Data Science and Machine Learning

Java Innovations - The Latest Trends in Java Technology

Taming Mobile

The Evolving Cloud

Solutions Track


You can view the details of the tracks download slides at  I was mainly focus on Machine Learning and The evolving Cloud sections. 




Engineering Velocity: Shifting the Curve at Netflix by Dianne Marsh , Netflix

 try for last year's presentation.


Build a Blameless culture

Freedom & Responsibility Culture

A story of brining python into Netflix.  Employee are free to bring new techniques and new tools into the company, To be good Netflix, when a engineer introduce Python to the company, he needed to build the same support of building support for the tools as well.


Manager's Role in Netflix:

  • Context, not control
  • Loosely coupled, rightly aligned
  • Attract and retain great talent!


Developers deploy their own code, Micro-service





The developer who make the change is the right person who make call to when to deploy and how to fix the issue


Shifting the curve with culture at Netflix:

  • Free the People
  • Context not control
  • Freedom to experiment
  • Blameless Culture

Switch the culture of Netflix

  • Value self-service
  • Test everywhere
  • Recover quickly and easily
  • Best practice represented in tooling
  • Be cloud native
  • Respect the journey


Technique used in Netflix

Netflix build language


Jenkins Job DSL:

  • Configuration as code
  • Groovy script
  • Scripts go into version control



Create AMI from base AMI

Image contains server and everything needed to run it

Builds unit of deployment for test and prod

Abstracts Cloud Details



Deploys Netflix to the Cloud

Red/Black push

Developed to address deploys in rollback



Red/Black Push

  • Scale up new instances which  running the old version
  • Cloud Native
  • Turn on traffic to new ASG
  • Canary Testing
  • Turn off traffic to old ASG
  • Wait … Analyze … Roll Back? (because it take times to provisioning instances)


Canary testing

 Hot topic this afternoon


Continuous Delivery Workload

  • Support the journey
  • Judges between Stages
  • Represent best practices


Balance between Regional Isolation and Multi-Region Consistency

Limit Impact of Human Error


Netflix Chaos Monkey

 Kills Running instances

  • Simulates failures inherent to running in the cloud
  • In the production



Structure, Personalization and Scales: A deep dive into Linked Search


Daniel and Asif from Linked-in talked about linked-in 's new search platform.  I think it make sense for us as well since we also need a search platform for our apps, documents, messages.  


The overall of linked search

  • Search helps member find and be found
  • Search for people, job, posts
  • Search are the important for recruiter


What is unique in Linked-In Search

  • Personalized
  • Part of a large production experience
  • Tasks-centric:
    • Find a job, hire top talent, find a person,…


What are our system challenges?

  • 2004, not search, thought your network and filter
  • Lucent(single shared)
  • 2006: Zoie real-time search (search without commits/shutdown).  Two buffer solutions will update happens
  • Content store was also introduced with many data sources are aggregated.
  • 2008:  Shared search:  30M members,
  • 2009: Bobo Faceted search (open sourced)
  • 2010: sensieDB (Cluster management, new query language, wrapping existing pieces), open sourced for everything above
  • 2011: Cleo (instant type ahead results), not use sensieDB
  • 2013: Too many stacks
  • 2014: New search platform Galene introduced: try to solve all issues:


Techniques used

Search quality pipeline

Spelling check => query tagging => vertical intent => query expansion

Regular full index builds using Hadoop

Improved relevance

Partial live updates supports

  1.                                        i.            Don’t re-index whole document if only part is changed.
  2.                                                        1.            Life update buffers
  3.                                                        2.            Snapshot index in the middle


Early termination

  1.                                        i.            Static global ranking
  2.                                      ii.            Consider both connection result with global ranking


Indexing and searching cross graph entities/attribute

Single engine, single stack


Search as service

Improved relevance support

Ranking is highly personalized

Exploring the economic graph


Machine Learning at Netflix Scale

try last year's talk at


Aish Fenton from Research Engineering talk about machine learning at Netflix. He explained Netflix's data mining pipeline with three time requirement, he give an simple tutorial on SVC algorithm and describe many good areas in machine learning.  I suggest you watch it online


Everything is a recommendation in Netflix

Personalize homepage for you,

75% of plays come from homepage


Focus on rates more than recommendation (people only click the top rated items)


Netflix Prize: $1 Million

The people who build High quality recommendations which can beat running system with 10%

Accuracy in predicted rating


Similarity based recommendation

Based on what you learned

SVC and DBM are commonly used algorithm


Netflix has a data mining framework mixing with different data pipelines. They have three pipelines/machine learning phases : real-time, near real-time and offline

Make data flow using different tools, such as storm, kafka, S3, Hadoop, etc.


Nvida card make computation easier. Single machine is doing SVC computation



Computational Patterns of the Cloud by Ines Sombra

Lead Data Engineer at Engine Yard



This talk is very interesting, Ines talked a lot of lesson learned from Engine Yard, a very popular cloud provider.  I suggest you can watch in details if you are in service areas. The slides can be found at


Your apps as collection of services, connected via apis


Pets v.s Cattle is an very interesting blog


Cloud resource are reused

Provision=> consume -> maintain => release


Our experiences + a few stories

 Everything fails

Excel at Process:

Anticipate failure and plan for it

  • Use postmortems checklist, retrospectives and play-books
  • Take them seriously
  • Also know state can bites

 Everything is a recipe

Infrastructure is maintained as code

Resource are used to increase the av and red of applications


Importance of monitoring &benchmarking

  • Know your baseline
  • Alert & monitoring are critical
  • Benchmarking is still misunderstand


What does healthy means?

  • Many elements in place to determined health
  • These visualization  fail us, we need betters ones
  • Complexity is complexity


Resource familiar match user cases

  • Some awareness  is needed
  • Cloud resources are different than hardware-based
  • Capacity planning is tricky

   Think about resources in fluid terms: compute & release. Harvesting services


App design for the cloud

  1. Surrender the file system
  2. Your app becomes an aggregate of services
  3. Consume services via APIs
  • De-couple services allow provision API can work with Azure easily
  • Force your API maturity


Operational experiences can become siloed

  • Teams build around services
  • Knowledge boundaries
  • Geographical distribution

  The team who build billing only know how money flow into the company


Service dependency and failure planning

  • Different to asset the health in apps that consume services
  • Each service it's its own (smaller) failure domain


Importance of API design, maintaince

  • API design is a core business competency
  • Prioritize maintenance
  • Retirement will happen



Distributed Systems as a new norm. Distributed system and the end of the API


Availability & coordination

  • Failures & Latency
  • Degradation
  • Dependencies


What does it means to be "up"

  • Distributed system will break in interesting and painful ways
  • Debugging the "it's slow" problem sucks
  • Distributed traces is the key


Continuous Everything

  • Tests are critical: 

Are they fast? Can we trust them?

What type of tests?

What coverage do we have?


  • Scenario based testing:

Are databases …


Backing up?


Setting up replication?

Promoting replicas?



  • Many testing choses and a random notes:

Testing is critical and frameworks should help by streaming chose

Automation helps with the process &certification

  • All planning tools suck

Stop thinking a tool will fix your agile problem

  • Good CD can makes flaws  when direction is wrong.  Is your vision clear?



Canary Analyze All The Things: How we learned to Keep Calm and Release Often


This is related to our testing in production principle. I suggest you watch it online to get more details



You need better Testing?

  Well, Yeah

 "I am going to push to production, though I'd pretty sure it is going to kill the system"


Trade-off of Rate of changes vs. availability


Change to you need better deployment !

 Canary analysis is a deployment process, but Is not

 replacement for any sort of software testing

  • Is not A/B testing
  • Replacing 100% to production and hope it works


Are we there are yet?

  • We are not
  • You are probably not either


How you start:

  • Observability
  • Partial traffic routing
  • Decision-making


Focus on the goal:

  • Current baseline matters
  • Absolute numbers are relatively unimportant
  • Relative number matters
    • Error rate
    • RPS per CPU cycle
    • Focus on system and application metrics

Categorize metrics

Weight categories

  • Outliers are out, lying
  • Comparing a group of canary with a group of baseline machines
    • Exercise without warm-up can result in injure
  • Repeat canary analysis frequently
    • Let application owners to choose whether it


Cloud Consideration

  • Software-defined everything
  • Incremental pricing

Bad News:

  • Capacity Management
  • Unpredictable inconsistency
  • Some of the server during peaking time is extreme slow. Finally, we found that bad neighbor problem. Netflix is the own bad neighbor of its self


The road to get there

  • 753 services in production
  • In-hour telemetry platform
  • A few metrics
  • Been there, done that manually.
  • The more you add, the more hard to make decision. 
    • Context vs. precision
    • No repeatability
    • Trending
    • Manual effort is manual

Automated canary analysis

700 canary runs per day.

Avoid deploy during weekend.

Customer heavy usage during weekend

Developer don't work during weekend


Our next trick for configuration system:

  • Configuration GUI make it simple
  • Deployment system integration
  • ACA all the things
  • Open connect firmware upgrade



Canary analysis make your change

Most people can start doing it now.



Weathering the Data Storm

This talk is very interesting for people in machine learning and ads area. She explained in details how her team can generate ~5000 models automatically to do Ads Campaigns and her cluster algorithm improve 5 to 10 time more than random selection.