Share via


Using the Data Science VM within your classes

image

The Microsoft Data Science Virtual Machine or Deep Learning Virtual Machine are customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server and Linux.

We offer Windows edition of DSVM on Server 2016 and Server 2012.
We offer Linux editions of the DSVM on Ubuntu 16.04 LTS and CentOS 7.4.

In most scenarios Institutions deploy a specific number of virtual machines and provide students SSH login to the infrastructure to undertake data science activities. Typically I advise 20 student per 1 VM NC6.

On an Azure virtual machine (VM), including the Data Science Virtual Machine (DSVM), you create local user accounts while provisioning the VM. Users then authenticate to the VM by using these credentials. If you have multiple VMs that you need to access, this approach can quickly get cumbersome as you manage credentials. Common user accounts and management through a standards-based identity provider enable you to use a single set of credentials to access multiple resources on Azure, including multiple DSVMs. As most universities now have some form of Active Directory support either with o365, you can use Azure Active Directory (Azure AD) or on-premises Active Directory to authenticate users on a standalone DSVM or a cluster of DSVMs in an Azure virtual machine scale set. You do this by joining the DSVM instances to an Active Directory domain. see  https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-common-identity

Building custom classroom/Lab VM images

You can also do some really neat things like develop and deploy extensions for the Data Science VMs see https://github.com/Azure/DataScienceVM/tree/master/Extensions

By using the Azure Resource manager (ARM) templates it provide a capability to define extensions for resources like Virtual Machines. VM Extensions are scripts that are run during the deployment of a VM to install additional pieces of software or reconfigure the VM to your needs or to comply with specific IT policies your company may have.

These extensions currently include

DSVM for Linux (Ubuntu) with Intel's BigDL deep learning framework extension
Azure IoT Edge on Data Science Virtual Machine
Fast.AI extension for Azure Data Science Virtual Machine

Setting up DSVM

There a lot more details and examples on DSVM setup in education at https://aka.ms/faculty - https://blogs.msdn.microsoft.com/uk_faculty_connection/?s=Data+Science and you can read more at https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/

Here a quick over of the DSVM and DLVM Images which are available as templates for more details see /en-us/azure/machine-learning/data-science-virtual-machine/overview

Some Top Tips for disk space

Ubuntu Linux (Has two disks it comes with in the VM image):

OS Disk=50GB.
2nd Disk=100 GB disk (mounted on /data).

FREE Space about 20 GB free on main disk. There should still be about 64GB empty space on the 2nd disk.

Windows 2016 DSVM:
One OS disk only which is 127GB.
There should have about 40GB free.

Re storage on the VM there are a few options on the Linux DSVM itself and via mounted storage. Most VMs types also have a temporary (non persistent across reboots) local SSD. In fact, this is a good place for scratch disk and is often much faster than the OS disk which are typically on an external blob (in same data center).

For working with large datasets you do need very large extra disks such as a 1TB blob storage.  For large datasets (>15GB) one might need to run a few extra steps to make it work on a Linux DSVM. see /en-us/azure/virtual-machines/linux/add-disk#connect-to-the-linux-vm-to-mount-the-new-disk

On a NC6 (v1) you can only attach a HDD. On NC6 v2 and v3 you can attach a SSD.

Another option, if you just need large space, is standard Azure Files or Blobs may be good enough. Here is the tip on our github repo for mounting storage to the DSVM: https://github.com/Azure/DataScienceVM/blob/master/Tips/MountStorageonLinuxDSVM.md

Lastly, you also have a very large local non-persistent SSD. You can copy large datasets here and use it from local SSD. Only thing to note is that it may not survive reboots (so you have to redownload the file back to local SSD upon reboots or back it up on a blob or somewhere). In fact this is the fastest disk you have on the VM since it is local and an SSD.

Adding a public DNS Name to your Virtual Machine

So once you have the machines setup you can assign static DNS by following these instructions https://docs.microsoft.com/en-us/azure/virtual-machines/windows/portal-create-fqdn

Regarding seeing What the DNS Names of the servers are see https://docs.microsoft.com/en-us/azure/dns/dns-getstarted-cli

You can run the following command via the Azure CLI or cloud shell this gives a nice output the key thing is you need to know the resourcegroup name and Machine name

Az VM Show -g ResourceGroup -n MachineName -d -otable

Example

PS Azure:\> az vm show -g datajam -n lsds11 -d -otable

Name    ResourceGroup    PowerState      PublicIps    Fqdns                              Location    Zones

------  ---------------  --------------  -----------  ---------------------------------  ----------  -------

lsds11  datajam          VM deallocated               lsds11.uksouth.cloudapp.azure.com  uksouth

Giving Access to DSVM via a Jupyter Notebook

You can also simply allow your student to use Azure Notebooks https://notebooks.azure.com to build their Jupyter Notebook Experiments and then run the experiments on the Data Science VMs which have built. If you have an o365 account provided by your school or University, Azure Notebooks will discover the DSVMs that you have access to and list them into a menu under the run button. Choose the DSVM that you would like to connect to and enter the user name and password of the Linux account on the DSVM.

see https://blogs.msdn.microsoft.com/uk_faculty_connection/2018/12/10/microsoft-azure-notebooks-and-additional-compute-capacity-via-connecting-to-data-science-vms/

Again if your using Google Collab service you also easily access DSVM resources see https://github.com/Azure/DataScienceVM/blob/master/Tips/ConnectGoogleColabToDSVM.md