Today, I attended the first day of QCON New York 2014 Conference. Here is a brief introduction of the conference:

 

Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

 

QCon starts with 2 days of tutorials on Monday and Tuesday, June 9-10 followed by the full 3-day conference from Wednesday, June 11-13. The conference will feature over 100 speakers in 6 concurrent tracks daily covering the most timely and innovative topics driving the evolution of enterprise software development today. The setting is the beautiful, centrally-located Marriott at Brooklyn Bridge in New York City.

 

The main reason I attend this conference is to know the industry trend in the software and service development, learning more people use bigdata and machine learning techniques in there services and find existing Microsoft customer and understand their user cases.

 

Here is the highlight of day 1 of the conference.  In Day one, we have 6 sections running concurrently, they are

 

Hot Technologies behind Modern Finance

Continuous Delivery

The Hyperinteractive Client

Lean Product Design

Architectures you've always wondered about

Solutions Track

 

I am mostly involved in Continuous Delivery and Architectures you've always wondered about sections.  I will share the slides and talks (if possible) in later days

 

Highlight

Hot techniques:

  • Linked-in Kafka
  • Apache Zookeeper

 

New techniques:

  • Micro-services
  • Linux Container

 

Booths

  Application monitoring and NoSQL/other database techniques are dominate booths, nearly all booths are related to one of the techniques, which is also a

good indicator of what is hot in industry

 

Application Monitoring:

  • AppDynamic

  I knew AppDynamic from the popular slides "Call of Duty: Dev Ops".  AppDynamic is very popular company in term of monitoring customer's application. Their application performance management (APM) allow people to monitor hybrid environemnts with Java, .Net and PHP (and a wide list of environments).  The people showed

me a  demo about how they monitor oracle database and do performance troubleshooting on wait stats,  poor query plans, etc.  Which remind me that our SQL Azure's performance monitoring is not as great as we should.

 

 

  • Tibco Software Inc:  Real-time event-enabled solutions .

 

NoSQL databases:

 

Riak:  Key value store Yammer use it.  They are talking with Azure to hosting on Azure.  Moving to seattle soon.

 

GridGain: In memory computing (HPC), in memory Streaming, Im memory Accelerator for Hadoop, In-memory database

 

Tibco ActiveSpaces: a distributed peer-to-peer in memory data grid

  

Keynote

Today's keynote is Whither Web programming?  from Gilad Bracha Co-author of the Java Spec.  Gilad discussed several tools which can interactive compile and see the result in browerse.  Since I am not frontend expert, I just list the tools in here:

 You can modify code online and syntax online. He mentioned Reflection:  reflection makes your output larger

Has a live debugger, (IDE),

serializing a thread,

 continuation to flow…

  • Leisure: A purely functional, lazy, dynamite functional language. Support different themes
  • Newspeak. His hobby work. A live, modular, object-capacity. Sync the application by code and data when it is online, app always available offline

Live code editing + Mixins

Changing a mixin at runtime means changing all classes that mix it in.

 

His final thought:

Whither web programming:

  • Web apps should evolve to complete and surpass native apps
  • Web platform must support offline works, stored programs, and many programming languages.

 

 

Foursquare: Involving from check-in to recommendation engineering

Current the company has 140 employees. Highlight:

  • Zookeeper to connect many service together
  • Adding new API without deploy is quite interesting

 

Part 1: scale the data storage

Started in 2009,  MySQL switch to postgresql during to requirement. They use typical with of scale out: indexing, memory cache and when things becomes big, they do either Split tables so that replacing joins with several queries  or 2) Replication to read-only and redirect traffic.

However, they are facing Outgrowing our hardware:

  1. No enough RAM for indexing and working data set
  2. 100 writes/second/disk

So they decide to use Shading by evaluating the following options:

  1. Building your own  logical of shading logic, but lot of issues to be handled
  2. Try: Cassandra, Hbase, Mongo,

They selected Mongo due to

  1.                                        i.            Geo-replication
  2.                                      ii.            Schema flexible + secondary index on the data
  3.                                     iii.            Auto balance feature in the road map

From 2010 to 2011, migrate one table at one time to Mongo. 15 clusters,  Peak take 1million query per second.

 

Beside Mongo, they use , they use Memcache, Elastic search for Nearly venue search and User search. They also build two data service:

  • Read only key value server
  • In memory cache

 

Read-only key value server (Hfile) is a file index service.  They use nightly map reduce jobs to generate Hfile: prefix index files to pre-compute common used query results. 

Use zookeeper to tie Hadoop cluster to run these jobs. Caching services on top of mongo to avoid try something is very expensive for mongo

 

Part 2: application complexity

 

In 2009 using PhP, the company use Scala, and then shift to Java using programs named Lift.

 

One interesting tools is called  RPC tracing:  API explorer issues. Most inexpensive tools to get API insight (DB connection, performance and troubleshooting and stack traces). RPC counts past week per API:  if the increase RPC calls, it means something wrong.

Another tool is called Throttles: dynamic switch on/off the feature.  Turn on features on ids, internal users, etc with different rules.  Used for rollout new features as well.

 

Remember the goats, i.e., the grow pain as developers:

  • Dev need compiling all codes at all time
  • Deploying all the code all the time
  • Hard to isolate cause of performance regressions and resource leaks

So the Solution: SOA infancy

  •  Single codebase, multiple build
  • Twitter's Scala based RPC library (automatically generate client/server side  API)

 

But the team still face following problems:

  • Duplication in packaging and deployment efforts
  • Hard to trace execution problem (correlated different traces)
  • Hard to define/change where things live (config are hard coded)
  • Networks aren't reliable (RPC calls will fail)

 

The solutions are:

  • Builds and deploys using 1) Single service definition files 2) consistent build packaging 3) Simple deployment of candary & feet

such as ./service_releaser -j servier_name

 

  • Monitoring: Each application use the same way of monitoring
    • Healthcheck endpoint over http
    • Consistent metric names
    • Dashboard for every service

Distributed tracing tools

Send all traces to kafka  queue to summary the traces

Each application pass correlateate id from parent dwon to the children

 

  • Exception aggregation

All aggregation was published to single slot and see full stacks every easily

 

  • Application discovery

Use Zookeep + Finagle server sets to dynamic handle hostnames, etc

 

  • Network issue:

Circuit breaking:

Fast failing RPC calls after some error rate threshold

Loosely based on Netflix's hystrix

 

  • Organization changes

Smaller teams owning front to back implementation of a features

Desire to have quick deploy cycles on new API end points

 

Remote endpoints

Wouldn't it be cool if a developer could expose new API without reploy new packages

 

Some libraries register the endpoints thought zookeeper. Thrift .

Take minutes for dev to have a new API running on official site by using proxy to redirect traffic to new API

 

Benefit: Tight contract for service interaction

  • Json response
  • All http parameters passed along

Clear path to breaking off more chunks from API monopolistic

 

Future works; part 3:

  • Further Isolating service with independent storage layer.
  • Completely automated continuous deployment
  • Hybrid immutable/mutable data storage
    • Mongo & hfile & cache service

jon@foursquare.com

 

Migrating to cloud native with micro-services by Adrian Cockcroft

 

Adrian recently left Netflix to help IT industry to adopt the practices built from Netflix.

 

Here is the highlight of learning from Netflix

  • Speed wins in the marketplace: facing big competitors, but speed will win
  • Remove friction from production development (key)
  • High truest, low process, no hand-offs between teams, (reduce meetings, permissions, culture)
  • Freedom and responsibility culture (hard to replica for other company)
  • Don't do your own undifferentiated heavy lifting (reducing management cost).
  • Use simple patterns automated by tools (simple architect review board).
  • Self-service cloud makes impossible things instance

Question: rapid change with latest year/6 month, co-operate IT was learning cloud, up to speed.

 

I did not record all notes,  so here is some of the notes I recorded for your reference.  He talked a lot for micro-service, which seems very popular recently. I will share the

slides and talks when it available.

 

Disruptors: take what used to be expensive learn to "waste" them to save money somewhere

  Example1:  Solid state disk:  Past: assume random reads are expensive , Now: RR is free, immutable writes, log-merge

SSD packaging as disk, as PCI card, as memory storage.

 

Cloud native storage architecture (don’t build SSD build distribute system, but embedded into Hadoop machine).

Cassandra scalability:

Linear scale up

Hundreds of nodes per cluster in common use today

Thousands of nodes per cluster are tested.

  • One node: 300,0000 iops read/write, 5.4T of SSD
  • 100 nodes -30 million and 640T -

 

Example2: Non-Cloud product development as an example

  hardware provision is un-differential heavy lifting -replace with IASS. IASS based product development allow you develop in weeks, However, SASS can allow

You develop in days.

 

The difference with bigdata with bi is it answering unplanned questions in hours.

 

Open Space discussion on Continuous delivery

I attend the open space discussion on CD.  There are two topics I am involved, which are both testing related. I guess how testing strategy fit into the overall delivery pipeline

In the whole service is still an open question which struggled many people.

 

Automatic Performance testing on complex system

In this session, one developer describe his problem. His team has very good testing strategy, focus a lot on unit testing, and some of the integration testing, and a couple of end to end acceptance testing with UI automation.   He is worry about performance regression and want to see whether we can test performance in a cheap way.  There are a couple of ideas from different people:

  • Testing each component might not be enough, you need test end to end for performance
  • Having an environment similar to production and do testing use some tools such as Jmetor
  • Monitoring production

I explained the three Ds in Yammer team: Dark Release, Dogfood and Data Insight.  Suggest that we turn feature off by default to reduce risk,  use dogfood to test your feature and build rich telemetry for your KPIs of the system.  

 

Automatic v.s Manual

 One game company have many manual testers in QA department to test new releases which release every 6 month.  One guys suggest that try to reduce (or remove all)  the tester and invest more test automation.  The people who has this question also mentioned that organization is the main issues since you are working with different departments with different point of views.  Again, I suggest they do A/B testing, even they ship client applications to customer, they might can still do A/B testing to turn feature on/off and also do more dogfood and telemetry

 

 

Google backup cloud from Raymond Blum

Not very interesting talks, but his a couple of high light is very important for us

 

  • Redudenarncy does not bring recoverability:  does not mean that your data are safe. Disturbed processing imposes data consolation
  • Local copies don’t protected against site outages
  • Diversify of storage technologies further guards against bugs taking out data. Google use Tapes to backup,Tapes are more durable than disk
  • The only good backup is One that you have restored
  • Run continuous restores
  • Run automatic comparisons
  • Alert us: if there are unexpected type or rates of failures

 

  • Replicated data has to be consolidated, eventually, optimized for restore, even backup is more complex
  • Backup and Restore strategies have to scale
  • Automated !!!!!!!!!!!!!!!! To reduce the costs
  • Co-ordinate the processing of thousands of Taps
  • Map Reduce is really good at Shading, Fault tolerance and Blocking on dependencies
    • DiRT! Google has Yearly exercise on data recovery and found lot of issu
    • Be care of Cycle dependency.  Examples are the encryption key was backup on the Tape, but without the key, no way to read form Tape.
    • Backup team only guarantee for a couple of hours of data,  and other teams build by soft deleted or other techniques to meet the SLO and SLA

 

TestOps for Continuous Delivery

Acquia Cloud is PASS for PHP apps, it has

  • Multiple environments : Dev, stage, prod
  • Continuum integration environment for your app
  • Special sauce for Drupal

 

Obligatory impressive numbers, 03/2014

  • 27 billion original hits per month
  • 422 Tb data transfer/month
  • 8000+ E2 instance

Release every 1.6 day per  day on average

  • Each release alters the infrastructure under thousands of web apps that don't control

Our customer hate downtime. The main issues are Server configuration is a software, and it is hard to test. Puppet, Chef can assist you, but you still need invest lot of testing on this.  Problem: reality is very mess!:  you might have launch failures or race conditions.

 

Unit test v.s. system tests

  • Unit test works on isolated program
  • The problem is that: you cannot mock out the real workload and get accurate results
    • Server configuration interact with OS, network, and services

 

System tests are for end to end

Apply code changes to real, running services

Exercise the infra as the apps will

 

System tests FTW:

 For infrastructure, system tests are essential

 

Test in a clone of production is not right:

  • No back-doors to make tests "easier"
  • PCV, HIPPA security requirement
  • Tests operators just as admins do:
    • Most tests operator "from a bastion":
    • Ensures the code works in production

 

Testing Strategies

  1. Basic build tests: Launch VMS, run puppets
  • Replicate a functional production environment
  • Isolation from production
  • Scan syslog for errors
  • Test config files, daemons, users, cron jobs
  • Simple failures:
  • Incorrect puppet dependencies work while iteration on development instance but not on clean lauch

 

  1. Functionality tests the moving parts:

Backup and restores

Message queues

Work auto-scaling

Load balancing with up and down workload

ELN health check and recovery

Database failover

Monitoring and alerts

Self-healing

 

  1. Application tests: Install and verify applications (s)
  • Real site code
  • Real site db (scrubbed)
  • Cause app to exercise the infrastructure
      • Write to database, message, queue, etc
      • Verify success on the backup
      • Operator app on degraded infrastructure

 

  1. Reboot tests:
  • Reboot all test servers
  • Re-run build tests
  • Re-run functional and application tests
  • Sample failure
      • File system mounted?
      • Services restarted?
      • Database quota daemon starts before MySQL daemon, then alert

 

  1. Re-launch tests:  Re-launch all test servers from base images
  • Simulate server crash and recovery
  • Persistent data retained?
  • Server rejoins servers?
  • Unexpected issues
  1. Re-run build, functional and applications

Sample failure:

  • Non-deployable customer application prevents relauch from completing normal
  1. Upgrade test
  • Also need to test upgrading existing servers: (you add a feature
  • The upgrade test dance:
  • Launch servers in test environments on current production code
  • Run smoke tests to ensure system is operating
  • Upgrade servers to latest development code
  1. Testing in parallel: Run them in parallel to optimized running time
  • Workers may alter server-wide behavior
  • Each work need an isolated set of servers
  • Workers that break their services need to self-destruct, or they will cause false failures

 

Management issues:

  1. Who writes the tests?
  • Our tests are as or more ,complex than the production
  • Subtle cases requires white-box  testing
  • First try QA department
  • Now: Engineering
      • Take longer to write
      • Triggering specific failures scenario requirements understand OS and code details together
      • Don' work, they could not keep up or go deep
      • Every devs write unit and system tests for their own code.
  1. Who fixes the tests:
  • Infrastructure system tests are fragile
  • Code review requires a "passing" run
  • Bugs often only occur post-commit
  • Permanent, rotating team handles failures
      • Authoring must analyze any failures, config that we unrelated and refer to or open a ticket for it.
      • Authority to revert any commit causing a failure
      • Usually it is easier to fix it instead.
  1. Who invests in the tests:

Manage must accept that infrastructure system tests are :

Hard

Time-consuming

Essential

Worth it

Under-investing will bite you badly.