Presto!

The Facebook team recently open-sourced a very cool distributed query engine that they concisely called Presto. Unlike say HBase, Presto doesn't store its own data, instead it can plug-in data from a variety of sources (e.g. Cassandra) and offers an ANSI SQL query engine that distributes the query processing to many nodes. Since I'm as knowledgable about Presto as I am about making musicals, I'll refrain from trying to explain more and instead redirect you to their home page or this short article for more information. Instead, let me describe how I got it on Azure Compute and querying data from blob store!

Presto getting a little blue...

When I saw Presto I really wanted to see it in action on Azure and play with it. Of course I could've spun up a few Linux VM's and setup Presto on those (and in fact a colleague of mine started doing just that), but I was looking for a bit more fun so I decided to get it on Blue Coffee instead. Since Presto is pretty much all pure Java, getting it running on Windows wasn't too hard: mainly extending an if statement that only allowed OS X and Linux in to let Windows through as well, and adding a Windows native library to presto-hadoop-apache2. After that it was a matter of slight reverse engineering of their scripts and configuration to get the logic right, and boom I had it running on Azure (of course it was that easy).

Snakey charm

As explained, Presto on its own is not that interesting, it needs data! When I first got it up, my first test catalog for it was Cassandra: I already had that running in Blue Coffee, and configuring it was relatively easy. So I added a catalog configuration class for it, and then tried it out with a simple Azure service that stood up a Cassandra cluster next to a Presto one, with Presto configured to be able to query the Cassandra cluster. It worked like a charm, so now it was on to more interesting things...

Going after the blobby prize

Presto can also query data from "Hive", but it would help if I first explain why I put those quotes there. If you're not familiar with it, Hive is a Hadoop-based service that stores metadata in a database fronted by a service called the Metastore service, stores the actual data in a Hadoop-compatible file system, and then has a server that can translate SQL (HQL) queries into Map-Reduce (or in the case of Tez pure Yarn DAGs) jobs on a Hadoop cluster to get you the results. Presto though, being the crack SQL engine that it is, doesn't use that last part (which is the bulk of Hive): it just queries the Metastore service for metadata, then goes and does its own query processing fetching the data directly from the Hadoop file system when it needs it.

So, since blob storage is already exposed as a Hadoop file system through WASB, this seemed very enticing: I could just spin up a measly Hive metastore service pointing to data on my Azure blob storage account, then I can use Presto as a distributed SQL query engine on that data! Brilliant! Of course, I can already do this with HDInsight and Hive, but the more cool ways I can analyze my data the merrrier right?

Hive wasn't on Blue Coffee, so I added it (at least enough to get the Metastore service running). After that, it was just a matter of adding another catalogue configuration logic into Presto to hook it up with Hive, some fun debugging, then I was able to spin up a Presto cluster that analyzes my blob store data. It was awesome - and since I'm a generous guy like that, in this post I want to guide you through your own way of spinning up such a cluster so you can experience the awesomeness yourself.

A quick guide to a DIY Presto-on-Azure cluster

WARNING: this is all bleeding-edge stuff so don't expect easy packaged solutions here.

As usual, you need Visual Studio and the Azure SDK. To make this tutorial easier, I'll also assume you have an Azure SQL Server database with Hive metadata pointing to your data already. This means: you create an Azure SQL database, create an HDInsight cluster with this database configured as the Hive metastore, create your Hive tables there (make sure you use fully qualified WASB URI's as the location for these tables), then you can drop this HDInsight cluster since we won't need it after that. It's also possible to make the Hive metastore service backed by a local Derby server and skip the preceding step, and if you're adventurous please try that.

With all this prep out of the way, let's build us a Presto service!

  1. Open Visual Studio, and create a new Windows Azure Cloud Service project. Add 3 worker roles named "HiveMetastore", "PrestoCoordinator" and "PrestoWorker" (the naming is important) and make sure you have Azure SDK version 2.3 selected.
  2. Right-click on the solution, manage nuget packages for solution, Online, Select “Include Prerelease” instead of “Stable Only” (because you’re hard-core like that), search for “Presto”, install “Microsoft.Experimental.Azure.Presto” to the two Presto roles.
  3. Similarly, install “Microsoft.Experimental.Azure.Hive” to the HiveMetastore role.
  4. Use the following code for the HiveMetastore's WorkerRole.cs:
    using Microsoft.Experimental.Azure.Hive; 
    namespace HiveMetastore
    {
    	public class WorkerRole : HiveMetastoreNodeBase
    	{
    		protected override HiveMetastoreConfig GetMetastoreConfig()
    		{
    			return new HiveSqlServerMetastoreConfig(
    				serverUri: "//{yourdbserver}.database.windows.net",
    				databaseName: "{yourdb}",
    				userName: "{user}@{yourdbserver}",
    				password: "{password}");
    		}
    	}
    }
    
    Filling in the appropriate values for your Azure SQL Server database.
  5. Use the following code for the two Presto services' WorkerRole.cs:
    using Microsoft.Experimental.Azure.Presto;
    using Microsoft.WindowsAzure.ServiceRuntime;
    using System;
    using System.Collections.Generic;
    using System.Linq;
    namespace PrestoCoordinator
    {
    	public class WorkerRole : PrestoNodeBase
    	{
    		protected override IEnumerable<PrestoCatalogConfig> ConfigurePrestoCatalogs()
    		{
    			var hiveNode = RoleEnvironment.Roles["HiveMetastore"].Instances
    				.Select(GetIPAddress)
    				.First();
    			var hiveCatalogConfig = new PrestoHiveCatalogConfig(
    				metastoreUri: String.Format("thrift://{0}:9083", hiveNode),
    				hiveConfigurationProperties: new Dictionary<string, string>()
    				{
    					{ "fs.azure.skip.metrics", "true" },
    					{ "fs.azure.check.block.md5", "false" },
    					{ "fs.azure.account.key.{youraccount}.blob.core.windows.net", "{yourkey}" },
    				});
    			return new PrestoCatalogConfig[] { hiveCatalogConfig };
    		}
    		protected override bool IsCoordinator
    		{
    			get { return {true for Coordinator, false for Worker}; }
    		}
    	}
    }
    
    Filling in the information for your Azure storage account, and fixing the IsCoordinator property for the coordinator/worker roles appropriately.
  6. Use the following for your ServiceDefinition.cs:
    <?xml version="1.0" encoding="utf-8"?>
    <ServiceDefinition name="<your service>" xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition" schemaVersion="2014-01.2.3">
      <WorkerRole name="PrestoCoordinator" vmsize="Large">
        <Imports>
          <Import moduleName="Diagnostics" />
        </Imports>
        <Endpoints>
          <InputEndpoint name="HTTP" protocol="tcp" port="8080" localPort="8080" />
          <InternalEndpoint name="Dummy" protocol="tcp" port="8020" />
        </Endpoints>
        <LocalResources>
          <LocalStorage name="DataDirectory" cleanOnRoleRecycle="false" sizeInMB="102400" />
          <LocalStorage name="InstallDirectory" cleanOnRoleRecycle="true" sizeInMB="256" />
        </LocalResources>
      </WorkerRole>
      <WorkerRole name="PrestoWorker" vmsize="A6">
        <Imports>
          <Import moduleName="Diagnostics" />
        </Imports>
        <Endpoints>
          <InternalEndpoint name="HTTP" protocol="tcp" port="8081" />
        </Endpoints>
        <LocalResources>
          <LocalStorage name="DataDirectory" cleanOnRoleRecycle="false" sizeInMB="102400" />
          <LocalStorage name="InstallDirectory" cleanOnRoleRecycle="true" sizeInMB="256" />
        </LocalResources>
      </WorkerRole>
      <WorkerRole name="HiveMetastore" vmsize="Small">
        <Imports>
          <Import moduleName="Diagnostics" />
        </Imports>
        <Endpoints>
          <InternalEndpoint name="Thrift" protocol="tcp" port="9083" />
        </Endpoints>
        <LocalResources>
          <LocalStorage name="DataDirectory" cleanOnRoleRecycle="false" sizeInMB="1024" />
          <LocalStorage name="InstallDirectory" cleanOnRoleRecycle="true" sizeInMB="256" />
        </LocalResources>
      </WorkerRole>
    </ServiceDefinition>
    
    Notice that I use A6 for the PrestoWorker role: Presto is very memory-hungry, so I really got best results by using A6.

And that should be that. This defines a service that exposes the Presto port that you can interact with using the Presto CLI (IMPORTANT: this is naively unsecured, so if you have sensitive data on your Azure account please don't deploy this). You should be able to try this out in your Azure emulator, and then interact with it using the Presto CLI:

  1. Download the CLI jar from here .
  2. Run it like this:
    java -jar presto-cli-0.74-SNAPSHOT-executable.jar --server http://localhost:8080
    
  3. Try it out:
    use catalog hive;
    show tables;
    select * from mytable limit 10;
    
  4. And if all goes well, you can deploy it (again, heed security warning above) and interact with it from your machine the same way except using {yourservice}.cloudapp.net instead of localhost.

Parting thoughts

I really like Presto so far, and I think it's a very promising technology (I have no ties to the team that developed it, and would love to chat with them at some point). This is very early days for both Presto itself at least in the open world outside its promising start within Facebook, and for my experimentation with it on Azure. I can already see very fast query performance for certain types of queries (faster than I could get with Hive on HDInsight), and hopefully I'll get to play with it more and see what I can do with it.