Initial sips

There's an explosion of awesome OSS projects happening in the big data analysis space now. A big chunk of them follow a similar pattern: they're released as Apache projects under the Apache Software Foundation, they are typically written in Java or at least a JVM language like Scala, and even though the JVM is platform-independent they are typically developed and tested on Linux with helper scripts written in bash and/or Python. Over in Redmond, a bunch of us in Microsoft thought about all the ways our customers can use these tools on Azure, and as a result we have a few ways already including:

  1. You can use HDInsight, which offers a bunch of these tools in a convenient managed service: Hadoop (Yarn, MapReduce), Hive, Pig, HBase, ...
  2. You can provision a linux VM in Azure (yes, we'll still love you), or a bunch of them, and deploy your favorite services in there. You can even use your favorite automation tools such as Chef to automate that deployment.

These are all excellent and supported avenues, and if they serve your needs you should definitely pursue them. However, some of us started thinking about how it would look like to run these tools in an Azure cloud service. I started experimenting with code to deploy and run some of these projects as worker roles in Azure services, and in the spirit of openness I put these experiments up on GitHub and as NuGet packages. This is very much still in the early experimentation phase, but in this post I wanted to show you how you can use these packages to deploy a Cassandra cluster in Azure in your own cloud service. My hope is that those of you who love living on the bleeding edge will try it and give us thoughts/feedback on how useful this seems, general approach and whatever else you have.

Adventurous gulps (tutorial)

So let's walk through creating a web site that uses our own Cassandra cluster as a backing data store. To follow along, you'll need Visual Studio (I'm using VS 2013 Ultimate because I work at Microsoft and they don't mind giving us our own software, but I think Express should work), the Azure SDK, and ideally an Azure subscription, though you can just debug it on the Azure Emulator locally.

  1. First, create a new Cloud project. Add a worker role named CassandraNode, and an ASP.Net web role called FrontEnd. To simplify things I chose a simple Web Forms template with no authentication.
  2. Install the Azure.Cassandra NuGet package. In the Package Manager Console window:
    Install-Package Microsoft.Experimental.Azure.Cassandra -IncludePrerelease -ProjectName CassandraNode
  3. The CassandraNode role will need some local storage resources. Right-click on the CassandraNode role in the cloud project, Properties, Local Storage, and add two resources: an InstallDir sized 256 MB, and a DataDir sized 1024 MB.
  4. It will also need to expose some TCP ports for communication. In the same properties page, in the Endpoints tab, expose three Internal endpoints: Storage at port 7000, RPC at port 9160 and NativeTransport at port 9042.
  5. Put the following code as the definition of your WorkerRole class for the CassandraNode role:
    public class WorkerRole : RoleEntryPoint
    {
      private CassandraNodeRunner _runner;
      public override void Run()
      {
        _runner.Run();
      }
      public override bool OnStart()
      {
        var installDir = RoleEnvironment.GetLocalResource("InstallDir").RootPath;
        var dataDir = RoleEnvironment.GetLocalResource("DataDir").RootPath;
        var javaInstaller = new JavaInstaller(Path.Combine(installDir, "Java"));
        javaInstaller.Setup();
        var nodes =
          RoleEnvironment.CurrentRoleInstance.Role.Instances.Select(
            i => i.InstanceEndpoints.First().Value.IPEndpoint.Address.ToString());
        var config = new CassandraConfig(
          clusterName: "AzureCluster",
          clusterNodes: nodes,
          dataDirectories: new[] { Path.Combine(dataDir, "Data") },
          commitLogDirectory: Path.Combine(dataDir, "CommitLog"),
          savedCachesDirectory: Path.Combine(dataDir, "SavedCaches")
        );
        _runner = new CassandraNodeRunner(
          jarsDirectory: Path.Combine(installDir, "Jars"),
          javaHome: javaInstaller.JavaHome,
          logsDirctory: Path.Combine(dataDir, "Logs"),
          configDirectory: Path.Combine(installDir, "conf"),
          config: config);
        _runner.Setup();
        return base.OnStart();
      }
    }
    
    This code should setup Cassandra on startup, placing the OpenJDK and the Cassandra jar files in the Install directory, and configuring Cassandra to connect with the rest of the role nodes in a cluster and putting its data in the data directory.
  6. Now that we're done with the Cassandra role, let's play with it! Let's create a web site to store all the myths about Cassandra. First, we need to interface with Cassandra. To do that, we'll use the NuGet package CassandraCSharpDriver to work with it:
    Install-Package CassandraCSharpDriver -ProjectName FrontEnd
  7. Now add the following two classes to store/retrieve our data from Cassandra:
    public class Myth
    {
      public string Name { get; set; }
      public string Description { get; set; }
    }
    
    public static class Myths
    {
      private static Lazy<ISession> _session =
        new Lazy<ISession>(InitSession);
      private static Lazy<PreparedStatement> _insertStatement =
        new Lazy<PreparedStatement>(PrepareInsertStatement);
    
      public static void AddMyth(Myth myth)
      {
        var bound = _insertStatement.Value.Bind(myth.Name, myth.Description);
        _session.Value.Execute(bound);
      }
    
      public static IEnumerable<Myth> GetAllMyths()
      {
        return _session.Value.Execute("SELECT * FROM allmyths")
          .Select(r => new Myth()
          {
            Name = (string)r["name"],
            Description = (string)r["description"]
          });
      }
    
      private static PreparedStatement PrepareInsertStatement()
      {
        return _session.Value.Prepare(
          "INSERT INTO allmyths(name, description) VALUES (?, ?)");
      }
    
      private static ISession InitSession()
      {
        var cassandraRole = RoleEnvironment.Roles["CassandraNode"];
        var cassandraHosts = cassandraRole.Instances.Select(
          i =>
            i.InstanceEndpoints.First().Value.IPEndpoint.Address).ToArray();
        var builder = Cluster.Builder()
          .AddContactPoints(cassandraHosts)
          .WithPort(9042)
          .WithDefaultKeyspace("myths");
        var cluster = builder.Build();
        var session = cluster.ConnectAndCreateDefaultKeyspaceIfNotExists();
        session.Execute(
          "CREATE TABLE IF NOT EXISTS allmyths(" +
          "name text PRIMARY KEY, description text)");
        return session;
      }
    }
    
    This code sets up a session with Cassandra (after discovering its nodes), creates a keyspace and table for the data, and inserts/retrieves data from there.
  8. Finally just add some UI. Since I'm not exactly the world's foremost expert on web UI, I'll be light on details here: Add a web form called "AddMyth.aspx" with a couple of TextBox controls, and the submit button with this simple logic:
    Myths.AddMyth(new Myth()
    {
      Name = NameText.Text,
      Description = DescriptionText.Text
    });
    Response.Redirect("Default.aspx");
    
    And then add a web form called "ShowMyths.aspx", with the following simple GridView:
    <asp:GridView id="AllMyths" runat="server"
      DataSourceID="MythsDataSource" />
    <asp:ObjectDataSource ID="MythsDataSource"
      runat="server"
      SelectMethod="GetAllMyths" TypeName="FrontEnd.Myths"
    />
    
    And then link to both forms from Default.aspx.
  9. If you're going to test it in the cloud, you need to make Cassandra a multi-node cluster. So either increase the node count in the UI properties for the Cloud configuration for the CassandraNode role, or just edit this in the ServiceConfiguration.Cloud.cscfg file directly:
    <role name="CassandraNode">
      <instances count="3" />
    

And we're done! You can launch this web site in the debugger on your local Compute emulator and things should work, and/or deploy it to the cloud and see Cassandra operating in Azure. If it doesn't work, you can set the Diagnostics level to "All information" in the properties page for the CassandraNode role, and check out the logs from Cassandra in the WADLogsTable in your storage account to see what went wrong.

So... what other flavors have you got?

Hope this gave you a taste of how to run Cassandra as a cloud service in Azure using this (though I have to emphasize again: this is experimental, and if you want a supported way please just install a supported distribution of Cassandra on Azure VM's). Other than Cassandra, I also have packages to help deploy Apache Kafka and Apache ZooKeeper. I'm also hoping to start putting other packages in there, as well as testing and making sure this approach is resilient to all the failure modes that happen in the cloud. If this all sounds particularly interesting to you, I'd love to hear your thoughts so please leave a comment on this post.