In second day of the QCON, we have following tracks

 

Track:

Software Architecture Improvements

Applied Data Science and Machine Learning

Java Innovations - The Latest Trends in Java Technology

Taming Mobile

The Evolving Cloud

Solutions Track

 

You can view the details of the tracks download slides at https://qconnewyork.com/schedule-2014.  I was mainly focus on Machine Learning and The evolving Cloud sections. 

 

 

Keynote

Engineering Velocity: Shifting the Curve at Netflix by Dianne Marsh , Netflix

 try http://www.infoq.com/presentations/netflix-continuous-delivery for last year's presentation.

 

Build a Blameless culture

Freedom & Responsibility Culture

A story of brining python into Netflix.  Employee are free to bring new techniques and new tools into the company, To be good Netflix, when a engineer introduce Python to the company, he needed to build the same support of building support for the tools as well.

 

Manager's Role in Netflix:

  • Context, not control
  • Loosely coupled, rightly aligned
  • Attract and retain great talent!

 

Developers deploy their own code, Micro-service

Rapid

Innovation

Detection

Response

The developer who make the change is the right person who make call to when to deploy and how to fix the issue

 

Shifting the curve with culture at Netflix:

  • Free the People
  • Context not control
  • Freedom to experiment
  • Blameless Culture

Switch the culture of Netflix

  • Value self-service
  • Test everywhere
  • Recover quickly and easily
  • Best practice represented in tooling
  • Be cloud native
  • Respect the journey

 

Technique used in Netflix

Netflix build language

 

Jenkins Job DSL:

  • Configuration as code
  • Groovy script
  • Scripts go into version control

 

Aminator

Create AMI from base AMI

Image contains server and everything needed to run it

Builds unit of deployment for test and prod

Abstracts Cloud Details

 

Asgard

Deploys Netflix to the Cloud

Red/Black push

Developed to address deploys in rollback

 

Strategies

Red/Black Push

  • Scale up new instances which  running the old version
  • Cloud Native
  • Turn on traffic to new ASG
  • Canary Testing
  • Turn off traffic to old ASG
  • Wait … Analyze … Roll Back? (because it take times to provisioning instances)

 

Canary testing

 Hot topic this afternoon

 

Continuous Delivery Workload

  • Support the journey
  • Judges between Stages
  • Represent best practices

http://techblog.netflix.com/2013/09/glisten-groovy-way-to-use-amazons.html

 

Balance between Regional Isolation and Multi-Region Consistency

Limit Impact of Human Error

 

Netflix Chaos Monkey

 Kills Running instances

  • Simulates failures inherent to running in the cloud
  • In the production

 

 

Structure, Personalization and Scales: A deep dive into Linked Search

 

Daniel and Asif from Linked-in talked about linked-in 's new search platform.  I think it make sense for us as well since we also need a search platform for our apps, documents, messages.  

 

The overall of linked search

  • Search helps member find and be found
  • Search for people, job, posts
  • Search are the important for recruiter

 

What is unique in Linked-In Search

  • Personalized
  • Part of a large production experience
  • Tasks-centric:
    • Find a job, hire top talent, find a person,…

 

What are our system challenges?

  • 2004, not search, thought your network and filter
  • Lucent(single shared)
  • 2006: Zoie real-time search (search without commits/shutdown).  Two buffer solutions will update happens
  • Content store was also introduced with many data sources are aggregated.
  • 2008:  Shared search:  30M members,
  • 2009: Bobo Faceted search (open sourced)
  • 2010: sensieDB (Cluster management, new query language, wrapping existing pieces), open sourced for everything above
  • 2011: Cleo (instant type ahead results), not use sensieDB
  • 2013: Too many stacks
  • 2014: New search platform Galene introduced: try to solve all issues:

 

Techniques used

Search quality pipeline

Spelling check => query tagging => vertical intent => query expansion

Regular full index builds using Hadoop

Improved relevance

Partial live updates supports

  1.                                        i.            Don’t re-index whole document if only part is changed.
  2.                                                        1.            Life update buffers
  3.                                                        2.            Snapshot index in the middle

 

Early termination

  1.                                        i.            Static global ranking
  2.                                      ii.            Consider both connection result with global ranking

 

Indexing and searching cross graph entities/attribute

Single engine, single stack

 

Search as service

Improved relevance support

Ranking is highly personalized

Exploring the economic graph

 

Machine Learning at Netflix Scale

try last year's talk at http://www.infoq.com/presentations/machine-learning-netflix

 

Aish Fenton from Research Engineering talk about machine learning at Netflix. He explained Netflix's data mining pipeline with three time requirement, he give an simple tutorial on SVC algorithm and describe many good areas in machine learning.  I suggest you watch it online

 

Everything is a recommendation in Netflix

Personalize homepage for you,

75% of plays come from homepage

 

Focus on rates more than recommendation (people only click the top rated items)

 

Netflix Prize: $1 Million

The people who build High quality recommendations which can beat running system with 10%

Accuracy in predicted rating

 

Similarity based recommendation

Based on what you learned

SVC and DBM are commonly used algorithm

 

Netflix has a data mining framework mixing with different data pipelines. They have three pipelines/machine learning phases : real-time, near real-time and offline

Make data flow using different tools, such as storm, kafka, S3, Hadoop, etc.

 

Nvida card make computation easier. Single machine is doing SVC computation

 

 

Computational Patterns of the Cloud by Ines Sombra

Lead Data Engineer at Engine Yard

 

 

This talk is very interesting, Ines talked a lot of lesson learned from Engine Yard, a very popular cloud provider.  I suggest you can watch in details if you are in service areas. The slides can be found at https://speakerdeck.com/randommood/computational-patterns-of-the-cloud-qcon-nyc-2014

 

Your apps as collection of services, connected via apis

 

Pets v.s Cattle is an very interesting blog

 

Cloud resource are reused

Provision=> consume -> maintain => release

 

Our experiences + a few stories

 Everything fails

Excel at Process:

Anticipate failure and plan for it

  • Use postmortems checklist, retrospectives and play-books
  • Take them seriously
  • Also know state can bites

 Everything is a recipe

Infrastructure is maintained as code

Resource are used to increase the av and red of applications

 

Importance of monitoring &benchmarking

  • Know your baseline
  • Alert & monitoring are critical
  • Benchmarking is still misunderstand

 

What does healthy means?

  • Many elements in place to determined health
  • These visualization  fail us, we need betters ones
  • Complexity is complexity

 

Resource familiar match user cases

  • Some awareness  is needed
  • Cloud resources are different than hardware-based
  • Capacity planning is tricky

   Think about resources in fluid terms: compute & release. Harvesting services

 

App design for the cloud

  1. Surrender the file system
  2. Your app becomes an aggregate of services
  3. Consume services via APIs
  • De-couple services allow provision API can work with Azure easily
  • Force your API maturity

 

Operational experiences can become siloed

  • Teams build around services
  • Knowledge boundaries
  • Geographical distribution

  The team who build billing only know how money flow into the company

 

Service dependency and failure planning

  • Different to asset the health in apps that consume services
  • Each service it's its own (smaller) failure domain

 

Importance of API design, maintaince

  • API design is a core business competency
  • Prioritize maintenance
  • Retirement will happen

 

 

Distributed Systems as a new norm. Distributed system and the end of the API

 

Availability & coordination

  • Failures & Latency
  • Degradation
  • Dependencies

 

What does it means to be "up"

  • Distributed system will break in interesting and painful ways
  • Debugging the "it's slow" problem sucks
  • Distributed traces is the key

 

Continuous Everything

  • Tests are critical: 

Are they fast? Can we trust them?

What type of tests?

What coverage do we have?

 

  • Scenario based testing:

Are databases …

Installing?

Backing up?

Restoring?

Setting up replication?

Promoting replicas?

Writable?

 

  • Many testing choses and a random notes:

Testing is critical and frameworks should help by streaming chose

Automation helps with the process &certification

  • All planning tools suck

Stop thinking a tool will fix your agile problem

  • Good CD can makes flaws  when direction is wrong.  Is your vision clear?

 

 

Canary Analyze All The Things: How we learned to Keep Calm and Release Often

  

This is related to our testing in production principle. I suggest you watch it online to get more details

 

 

You need better Testing?

  Well, Yeah

 "I am going to push to production, though I'd pretty sure it is going to kill the system"

 

Trade-off of Rate of changes vs. availability

 

Change to you need better deployment !

 Canary analysis is a deployment process, but Is not

 replacement for any sort of software testing

  • Is not A/B testing
  • Replacing 100% to production and hope it works

 

Are we there are yet?

  • We are not
  • You are probably not either

 

How you start:

  • Observability
  • Partial traffic routing
  • Decision-making

 

Focus on the goal:

  • Current baseline matters
  • Absolute numbers are relatively unimportant
  • Relative number matters
    • Error rate
    • RPS per CPU cycle
    • Focus on system and application metrics

Categorize metrics

Weight categories

  • Outliers are out, lying
  • Comparing a group of canary with a group of baseline machines
    • Exercise without warm-up can result in injure
  • Repeat canary analysis frequently
    • Let application owners to choose whether it

 

Cloud Consideration

  • Software-defined everything
  • Incremental pricing

Bad News:

  • Capacity Management
  • Unpredictable inconsistency
  • Some of the server during peaking time is extreme slow. Finally, we found that bad neighbor problem. Netflix is the own bad neighbor of its self

 

The road to get there

  • 753 services in production
  • In-hour telemetry platform
  • A few metrics
  • Been there, done that manually.
  • The more you add, the more hard to make decision. 
    • Context vs. precision
    • No repeatability
    • Trending
    • Manual effort is manual

Automated canary analysis

700 canary runs per day.

Avoid deploy during weekend.

Customer heavy usage during weekend

Developer don't work during weekend

 

Our next trick for configuration system:

  • Configuration GUI make it simple
  • Deployment system integration
  • ACA all the things
  • Open connect firmware upgrade

 

Summary

Canary analysis make your change

Most people can start doing it now.

 

  

Weathering the Data Storm

This talk is very interesting for people in machine learning and ads area. She explained in details how her team can generate ~5000 models automatically to do Ads Campaigns and her cluster algorithm improve 5 to 10 time more than random selection.