Let's face it - computing was created to analyze data. You rarely if ever have a program looking for data. Rather, you have data looking for code to analyze it. Machine learning represents the state-of-the-art in making sense of data. Unfortunately, for many years it has been out of reach for the common developer – until now.
This is perhaps one of the highest paid and most sought-after skills today. No question about it - this is the place to really make a big as a developer.
Figure 1: The world of machine learning
Machine learning represents the logical extension of simple data retrieval and storage. It is about developing building blocks that make computers learn and behave more intelligently.
Machine learning makes it possible to mine historical data and make predictions about future trends. Without realizing it, you are probably already using the benefits of machine learning. Search engine results, online recommendations, ad targeting, fraud detection, and spam filtering are all examples of what is possible with machine learning.
Machine learning is about making data-driven decisions. While instinct might be important, it is difficult to beat empirical data.
Once you start to dive deep into the topic you start addressing such topics as:
Supervised and unsupervised learning
Markov models and Bayesian networks and much more
The Apache Mahout project's goal is to build a scalable machine learning library.
There is an entire machine learning open-source project that you can get for free with Hadoop. You can learn more here:
Matrix factorization based recommenders
K-Means, Fuzzy K-Means clustering
Latent Dirichlet Allocation
Singular Value Decomposition
Logistic regression classifier
(Complementary) Naive Bayes classifier
Random forest classifier
I wish I had more time. I would seriously consider taking this free MIT online class, which you can find here:
Historically, machine learning, has required complex software and high-end computers. This field of computing required a seasoned data scientist. What's been needed is a fully managed cloud service for this form of machine learning, also known as predictive analytics .
MAML - Microsoft Azure Machine Learning is an Azure Service. It is a web application that has a studio called Studio ML. You create experiments with this web application that represent your machine learning activities.
A visual composition surface is used to create a machine learning workflow. The design surface of the web app allows you to add modules. Additional modules can be authored in R.
The point of a visual design surface is to remove complexity of creating algorithms, cleaning data, finding Features.
There are 2 Phases to using MAML. The first phase is the experiment. That is where you start with the data and begin to clean it up. This is going to take 60% to 70% of the total time. In this phase, you will be combining data, removing rows, eliminating columns. In this phase you will also take your model, and train it. From there the output will be scored and evaluated.
In phase 2 you will operationalize it, which means it will be put behind a web service. This will allow you connect your machine learning model to other business processes. This is the real magic of the Azure Machine Learning offering. Operationalizing your models and exposing them to your business is a key step and is often extremely difficult with other approaches. Operationalizing Azure Machine Learning is extremely simple.
Using simple drag-and-drop gestures along with some data flow graphs you are able to set up some experiments and take advantage of sophisticated algorithms about writing code.
There is a pool of VMs running machine learning algorithms using an orchestration engine, freeing the data scientist from moving data and moving to different services.
The ML Studio is targeting the emerging data scientists. You can train 10 models in minutes, not days. You can put a predictive model into production in minutes, not weeks or months. Some customers are reporting a 10X-100X in reduction in cost relative to competition. I invite readers to go get some pricing for SAS. See http://www.sas.com/en_us/software/analytics/rapid-predictive-modeler.html.
These models can also be shared with other parts of a company. Employees can create their own workspaces, giving re-use and cross-teaming. The models can be locked as well, allowing them to be reused but not modified. In other words, these can be immutable models, allowing sharing and innovation but not breaking what is considered ‘golden.’
The predictive models can be shared as a service across an enterprise leverage Azure as the public cloud back-end. Average waiting from one service in Azure to another is between 50ms to 100 ms. This is very fast and will allow companies to leverage machine learning back-ends running predictive models from other services in Azure. For example, you can write JSON-based back ends that leverage your predictive models, allowing you to build decision making dashboards for your business.
Machine Learning algorithms are built to continually improve over time by leverage training sets. Training sets make it possible to continually improve the robustness of your predictive model.
R is a popular open source programming environment for statistics and data mining. The good news is that it is easily integrated into ML Studio. I have a lot of friends using functional languages for machine learning, such as F#. It's pretty clear, however, that R is dominant in this space.
Polls and surveys of data miners are showing R's popularity has increased substantially in recent years. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. R is named partly after the first names of the first two R authors. R is a GNU project and is written primarily in C, Fortran.
Below is a framework that provides a way for you to think about the predictive nature of machine learning. It's all about providing insight to business decisions where limited resources are applied to grow revenue or limit expenses. This might include insights into consumer spending patterns, or to optimizing supply chain.
One great way to think about machine learning is to break down analytics into 3 questions:
What will happen?
What should I do next?
The information worker
Typically using a self-service approach using Power BI.
What to look for in a data scientist
Clear Understanding Of The Scientific Method
Strong in Math and Statistics
Intellectual Curiosity and Critical Thinking
Visualization and Communication
Advanced Computing And Data Management
If you were to go to school, went to study to be a data scientist, what courses would you take?
This post provided a high-level view of some of the characteristics and concepts with respect to machine learning. In the next post will start playing around with the Azure portal.
Figure 2: The Azure Portal