There is good reason to be worried about the "Internet of Things" on current course and trajectory. Both the IT industry as well as manufacturers of "smart products" seem to look at connected special-purpose devices and sensors as a mere variation of the information technology assets like servers, PCs, tablets, or phones. That stance is problematic as it neglects important differences between the kinds of interactions that we're having with a phone or PC, and the interactions we're having with a special-purpose devices like a gas valve, a water heater, a glass-break sensor, a vehicle immobilizer, or a key fob.
Before I get to a proposal for how to address the differences, let's take a look at the state of things on the Web and elsewhere.
PCs, phones, and tablets are primarily interactive information devices. Phones and tablets are explicitly optimized around maximizing battery lifetime, and they preferably turn off partially when not immediately interacting with a person, or when not providing services like playing music or guiding their owner to a particular location. From a systems perspective, these information technology devices are largely acting as proxies towards people. They are "people actuators" suggesting actions and "people sensors" collecting input.
People can, for the most part, tell when something is grossly silly and/or could even put them into a dangerous situation. Even though there is precedent of someone driving off a cliff when told to do so by their navigation system, those cases are the rarest exceptions.
Their role as information gathering devices allowing people to browse the Web and to use a broad variety of services, requires these devices to be "promiscuous" towards network services. The design of the Web, our key information tool, centers on aggregating, combining, and cross referencing information from a myriad of different systems. As a result, the Web's foundation for secure communication is aligned with the goal of this architecture. At the transport protocol level, Web security largely focuses on providing confidentiality and integrity for fairly short-lived connections.
User authentication and authorization are layered on top, mostly at the application layer. The basic transport layer security model, including server authentication, builds on a notion of federated trust anchored in everyone (implicitly and largely involuntarily) trusting in a dozen handfuls of certification authorities (CA) chosen by their favorite operating system or browser vendor. If one of those CAs deems an organization trustworthy, it can issue a certificate that will then be used to facilitate secure connections, also meaning to express an assurance to the user that they are indeed talking to the site they expect to be talking to. To that end, the certificate can be inspected by the user. If they know and care where to look.
This federated trust system is not without issues. First, if the signing key of one of the certification authorities were to be compromised, potentially undetected, whoever is in possession of the key can now make technically authentic and yet forged certificates and use those to intercept and log communication that is meant to be protected. Second, the system is fairly corrupt as it takes all of $3 per year to buy a certification authority's trust with minimal documentation requirements. Third, the vast majority of users have no idea that this system even exists.
Yet, it all somehow works out halfway acceptably, because people do, for the most part, have common sense enough to know when something's not quite right, and it takes quite a bit of work to trick people into scams in huge numbers. You will trap a few victims, but not very many and not for very long. The system is flawed and some people get tricked, but that can also happen at the street corner. Ultimately, the worst that can happen – without any intent to belittle the consequences – is that people get separated from some of their money, or their identities get abused until the situation is corrected by intervention and, often, some insurance steps in to rectify these not entirely unexpected damages.
Special-purpose devices, from simple temperature sensors to complex factory production lines with thousands of components inside them are different. The devices are much more scoped in purpose and even if they may provide some level of a people interface, they're largely scoped to interfacing with assets in the physical world. They measure and report environmental circumstances, turn valves, control servos, sound alarms, switch lights, and do many other tasks. They help doing work for which an information device is either too generic, too expensive, too big, or too brittle.
If something goes wrong with automated or remote controllable devices that can influence the physical world, buildings may burn down and people may die. That's a different class of damage than someone maxing out a stolen credit-card's limit. The security bar for commands that make things move, and also for sensor data that eventually results in commands that cause things to move, ought to be, arguably, higher than in an e-commerce or banking scenario.
What doesn't help on the security front is that machines, unlike most people, don't have a ton of common sense. A device that goes about its day in its programmed and scheduled ways has no notion of figuring when something it not quite right. If you can trick a device into talking to a malicious server or intermediary, or into following a network protocol redirection to one, it'll dutifully continue doing its work unless it's explicitly told to never do so.
Herein lies one of the challenges. A lot of today's network programming stacks and Web protocols are geared towards the information-oriented Web and excellently enable building promiscuous clients by default. In fact, the whole notion of REST rests on the assumption that the discovery and traversal of resources is performed though hypertext links included in the returned data. As the Web stacks are geared towards that model, there is extra work required to make a Web client faithful to a particular service and to validate, for instance, the signature thumbnail of the TLS certificate returned by the permitted servers. As long as you get to interact with the web stack directly, that's usually okay, but the more magic libraries you use on top of the Web stack basics, the harder that might get. And you have, of course, and not to be underestimated in complexity, to teach the device the right thumbnail(s) and thus effectively manage and distribute an allow-list.
Generally, device operators will not want to allow unobserved and non-interactive devices that emit telemetry and receive remote commands to be able to stray from a very well-defined set of services they're peered with. They should not be promiscuous. Quite the opposite.
Now – if the design goal is to peer a device with a particular service, the federated certificate circus turns into more of a burden than being a desired protocol-suite feature. As the basic assumptions about promiscuity towards services are turned on their head, the 3-6 KByte and 2 network roundtrips of certificate exchange chatter slow things down and also may cost quite a bit of real money paying for precious, metered wireless data volume. Even though everyone currently seems to assume Transport Layer Security (TLS) being the only secure channel protocol we'll ever need, it's far from being ideal for the 'faithful' connected devices scenario.
If you allow me to take you into the protocol basement for a second: That may be somewhat different if we could seed clients with TLS RFC5077 session resumption tickets in an out-of-band fashion, and have a TLS mode that never falls back to certs. Alas, we do not.
Connected and non-interactive devices not only differ in terms of the depth of their relationship with backend services, they also differ very much in terms of the interaction patterns with these services when compared to information-centric devices. I generally classify the interaction patterns for special-purpose devices into the categories Telemetry, Inquiries, Commands, and Notifications.
While Telemetry and Inquiries are device-initiated, their mirrored pattern counterparts, Command and Notifications, are service-initiated – which means that there must be a network path for messages to flow from the service to the device and that requirement bubbles up a set of important technical questions:
Most current approaches that I'm running into are trying to answer the basic addressing question with traditional network techniques. That means that the device either gets a public network address or it is made part of a virtual network and then listens for incoming traffic using that address, acting like a server. For using public addresses the available options are to give the device a proper public IPv4 or IPv6 address or to map it uniquely to a well-known port on a network address translation (NAT) gateway that has a public address. As the available pool of IPv4 addresses has been exhausted and network operators are increasingly under pressure to move towards providing subscribers with IPv6 addresses, there's hope that every device could eventually have its very own routable IPv6 address. The virtual network approach is somewhat similar, but relies on the device first connecting to some virtual network gateway via the underlying native network, and then getting an address assigned within the scope of the virtual network, which it shares with the control system that will use the virtual network address to get to the device.
Both of those approaches are reasonable from the perspective of answering the first, basic addressing question raised above, and if you pretend for a moment that opening inbound ports through a residential edge firewall is acceptable. However, things get tricky enough once we start considering the other questions, like devices not being in the house, but on the road.
Roaming is tricky for addressing and even trickier if the device is switching networks or even fully mobile and thus hopping through networks and occasionally dropping connections as it gets out of radio range. While there are "Mobile IP" roaming standards for both IPv4 (RFC3344) and IPv6 (RFC6275), but those standards rely on a notion of traffic relaying through agents and those are problematic at scale with very large device populations as the relay will have to manage and relay traffic for very many routes and also needs to keep track of the devices hopping foreign networks. Relaying obviously also has significant latency implications with global roaming. What even the best implementations of these standards-based approaches for roaming can't solve is that you can't connect to a device that's outside of radio coverage and therefore not connected, at all.
The very same applies to the challenge of how to reliably deliver commands and notifications to power-constrained devices. Those devices may need to survive on battery power for extended periods (in some cases for years) between battery recharges, or their external power source, like "power stealing" circuits employed in home building automation devices, may not yield sufficient power for sustained radio connectivity to a base station. Even a vehicle battery isn't going to like powering an always-on radio when parked in the long-term airport garage while you're on vacation for 2 weeks.
So if a device design aims to conserve power by only running the radio occasionally or if the device is mobile and frequently in and out of radio coverage or hopping networks, it gets increasingly difficult to reach it naively by opening a network connection to it and then hoping for that to remain stable if you're lucky enough to catch a moment when the device is indeed ready to talk. That's all even assuming that the device were indeed having a stable network address provided by one of the cited "Mobile IP" standards, or the device was registering with an address registration/lookup service every time it comes online with a new address so that the control service can locate it.
All these approaches aiming to provide end-to-end network routes between devices and their control services are almost necessarily brittle. As it tries to execute a command, the service needs to locate the device, establish a connection to it, issue the command and collect the command feedback all while, say, a vehicle drives through a series of tunnels. Not only does this model rely on the device being online and available at the required moment, it also introduces a high number of tricky-to-diagnose failure points (such as the device flipping networks right after the service resolved its address) with associated security implications (who gets that newly orphaned address next?), it also has inherent reliability issues at the application layer since all faults that occur after the control system has sent the command, do introduce doubt in the control system on whether the command could be successfully executed; and not all commands are safe to just blindly retry, especially when they have physical consequences.
For stationary power constrained or wirelessly connected devices, the common approach to bridging the last meters/yards is a hub device that's wired to the main network and can bridge to the devices that live on a local network. The WLAN hub(s) in many homes and buildings are examples of this as there is obviously a need to bridge between devices roaming around house and the ISP network. From an addressing perspective, these hubs don't change the general challenge much as they themselves need to be addressable for commands they then ought to forward to the targeted device and that means you're still opening up a hole in the residential firewall, either by explicit configuration or via (don't do this) UPnP.
If all this isn't yet challenging enough for your taste, there's still security. Sadly, we can't have nice and simple things without someone trying to exploit them for malice or stupid "fun".
All information that's being received from and sent to a device must be trustworthy if anything depends on that information – and why would you send it otherwise? "Trustworthy communication" means that information is of verifiable origin, correct, unaltered, timely, and cannot be abused by unauthorized parties in any fashion. Even telemetry from a simple sensor that reports a room's temperature every five minutes can't be left unsecured. If you have a control system reacting on that input or do anything else with that data, the device and the communication paths from and to it must be trustworthy.
"Why would anyone hack temperature sensors?" – sometimes "because they can", sometimes because they want to inflict monetary harm on the operator or physical harm on the facility and what's in it. Neglecting to protect even one communication path in a system opens it up for manipulation and consequential harm.
If you want to believe in the often-cited projection of 50 billion connected devices by 2020, the vast majority of those will not be classic information devices, and they will not be $500 or even $200 gadgets. Very many of these connected devices will rather be common consumer or industry goods that have been enriched with digital service capabilities. Or they might even just be super inexpensive sensors hung off the side of buildings to collect environmental information. Unlike apps on information devices, most of these services will have auxiliary functions. Some of these capabilities may be even be largely invisible. If you have a device with built-in telemetry delivery that allows the manufacturer or service provider to sense an oncoming failure and proactively get in touch with you for service – which is something manufacturers plan to do – and then the device just never breaks, you may even not know such a capability exists, especially if the device doesn't rely on connectivity through your own network. In most cases, these digital services will have to be priced into the purchase price of the product or even be monetized through companion apps and services as it seems unlikely that consumers will pay for 20 different monthly subscriptions connected appliances. It's also reasonable to expect that many devices sold will have the ability to connect, but their users will never intentionally take advantage these features.
On the cost side, a necessary result from all this is that the logic built into many products will (continue to) use microcontrollers that require little power, have small footprint, and are significantly less expensive than the high-powered processors and ample memory in today's information devices – trading compute power for much reduced cost. But trading compute power and memory for cost savings also means trading cryptographic capability and more generally resilience against potential attacks away.
The horror-story meme "if you're deep into the forest nobody will hear your screams" is perfectly applicable to unobserved field-deployed devices under attack. If a device were to listen for unsolicited traffic, meaning it listens for incoming TCP connections or UDP datagrams or some form of UDP-datagram based sessions and thus acting as server, it would have to accept and then triage those connection attempts into legitimate and illegitimate ones.
With TCP, even enticing the device to accept a connection is already a very fine attack vector, because a TCP connection burns memory in form of a receive buffer. So if the device were to use a network protocol circuit like, for instance, the WizNet W5100 used one the popular enthusiast tinker platform Arduino Ethernet, the device's communication capability is saturated at just 4 connections, which an attacker could then service in a slow byte-per-packet fashion and thus effectively take the device out. As that happens, the device now also wouldn't have a path to scream for help through, unless it made – assuming the circuit supports it – an a priori reservation of resources for an outbound connection to whoever plays the cavalry.
If we were to leave the TCP-based resource exhaustion vector out of the picture, the next hurdle is to establish a secure baseline over the connection and then triaging connections into good and bad. As the protocol world stands, TLS (RFC5246) and DTLS (RFC6347) are the kings of the security protocol hill and I've discussed the issues with their inherent client promiscuity assumption above. If we were indeed connecting from a control service to a device in an outbound fashion, and the device were to act as server, the model may be somewhat suitable as the control service will indeed have to speak to very many and potentially millions of devices. But contrary to the Web model where the browser has no idea where the user will send it, the control system has a very firm notion of the devices it wants to speak to. There are many of those, but there no promiscuity going on. If they play server, each device needs to have its own PKI certificate (there is a specified option to use TLS without certificates, but that does not matter much in practice) with their own private key since they're acting as servers and since you can't leak shared private keys into untrusted physical space, which is where most of the devices will end up living.
The strategy of using the standard TLS model and having the device play server has a number of consequences. First, whoever provisions the devices will have to be a root or intermediate PKI certification authority. That's easy to do, unless there were any need to tie into the grand PKI trust federation of today's Web, which is largely anchored in the root certificate store contents of today's dominant client platforms. If you had the notion that "Internet of Things" were to mean that every device could be a web server to everyone, you would have to buy yourself into the elite circle of intermediate CA authorities by purchasing the necessarily signing certificates or services from a trusted CA and that may end up being fairly expensive as the oligopoly is protective of their revenues. Second, those certificates need to be renewed and the renewed ones need to be distributed securely. And when devices get stolen or compromised or the customer opts out of the service, these certificates also need to get revoked and that revocation service needs to be managed and run and will have to be consulted quite a bit.
Also, the standard configuration of most application protocol stacks' usage of TLS tie into DNS for certificate validation, and it's not obvious that DNS is the best choice for associating name and network address for devices that rapidly hop networks when roaming – unless of course you had a stable "home network" address as per the IPv6 Mobile IP. But that would mean you are now running an IPv6 Mobile relay. The alternative is to validate the certificate by some other means, but then you'll be using a different validation criterion in the certificate subject and will no longer be aligned with the grand PKI trust federation model. Thus, you'll be are back to effectively managing an isolated PKI infrastructure, with all the bells and whistles like a revocation service, and you will do so while you're looking for the exact opposite of the promiscuous security session model all that enables.
Let's still assume none of that would matter and (D)TLS with PKI dragged in its wake were okay and the device could use those and indeed act as a server accepting inbound connections. Then we're still faced with the fact that cryptography computation is not cheap. Moving crypto into hardware is very possible, but impacts the device cost. Doing crypto in software requires that the device deals with it inside of the application or underlying frameworks. And for a microcontroller that costs a few dollars that's non-negligible work. So the next vector to keep the device from doing its actual work is to keep it busy with crypto. Present it with untrusted or falsely signed client certificates (if it were to expect those). Create a TLS link (even IPSec) and abandon it right after the handshake. Nice ways to burn some Watts.
Let's still pretend none of this were a problem. We're now up at the application level with transport layer security underneath. Who is authorized to talk to the device and which of the connections that pop up through that transport layer are legitimate? And if there is an illegitimate connection attempt, where do you log these and if that happens a thousand times a minute, where do you hold the log and how do you even scream for help if you're pegged on compute by crypto? Are you keeping an account store in the device? Quite certainly not in a system whose scope is more than one device. Are you then relying on an external authentication and authorization authority issuing authorization tokens? That's more likely, but then you're already running a token server.
The truth, however inconvenient, is that non-interactive special-purpose devices residing in untrusted physical spaces are, without getting external help from services, essentially indefensible as when acting as network servers. And this is all just on top of the basic fact that devices that live in untrusted physical space are generally susceptible to physical exploitation and that protecting secrets like key material is generally difficult.
Here's the recipe to eradicate most of the mess I've laid out so far: Devices don't actively listen on the network for inbound connections. Devices act as clients. Mostly.
What I've discussed so far are considerations around the Network and Transport layers (RFC1122, 1.1.3) as I'm making a few general assumptions about connectivity between devices and control and telemetry collections systems, as well as about the connectivity between devices when they're talking in a peer-to-peer fashion.
First, I have so far assumed that devices talk to other systems and devices through a routable (inter-)network infrastructure whose scope goes beyond a single Ethernet hub, WLAN hotspot, Bluetooth PAN, or cellular network tower. Therefore I am also assuming the usage of the only viable routable network protocol suite and that is the Internet Protocol (v4 and v6) and with that the common overlaid transport protocols UDP and TCP.
Second, I have so far assumed that the devices establish a transport-level and then also application-level network relationship with their communication peers, meaning that the device commits resources to accepting, preprocessing, and then maintaining the connection or relationship. That is specifically true for TCP connections (and anything riding on top of it), but is also true for Network-level links like IPSec and session-inducing protocols overlaid over UDP, such as setting up agreements to secure subsequent datagrams as with DTLS.
The reason for assuming a standards-based Network and Transport protocol layer is that everything at the Link Layer (including physical bits on wire or through space) is quite the zoo, and one that I see growing rather than shrinking. The Link Layer will likely continue to be a space of massive proprietary innovation around creative use of radio frequencies, even beyond what we've seen in cellular network technology where bandwidth from basic GSM's 9.6Kbit/s to today's 100+ MBit/s on LTE in the last 25 years. There are initiatives to leverage new "white space" spectrums opened up by the shutdown of Analog TV, and there are services leveraging ISM frequency bands, and there might be well-funded contenders for licensed spectrum emerging that use wholly new stacks. There is also plenty of action on the short-range radio front, specifically also around suitable protocols for ultra-low power devices. And there are obviously also many "wired" transport options over fiber and copper that have made significant progress and will continue to do so and are essential for device scenarios, often in conjunction with a short-range radio hop for the last few meters/yards. Just as much as it was a losing gamble to specifically bet on TokenRing or ARCnet over Ethernet in the early days of Local Area Networking, it isn't yet clear what to bet on in terms of protocols and communication service infrastructures as the winners for the "Internet of Things", not even today's mobile network operators.
Betting on a particular link technology for inter-device communication is obviously reasonable for many scenarios where the network is naturally scoped by physical means like reach by ways of radio frequency and transmission power, the devices are homogeneous and follow a common and often regulation-imposed standard, and latency requirements are very narrow, bandwidth requirements are very high, or there is no tolerance for failure of intermediaries. Examples for this are in-house device networks for home automation and security, emerging standards for Vehicle-To-Vehicle (V2V) and Vehicle-To-Infrastructure (V2I) communication, or Automatic Dependent Surveillance (ADS, mostly ADS-B) in Air Traffic Control. Those digital radio protocols essentially form peer meshes where everyone listens to everything in range and filters out what they find interesting or addressed specifically at them. And if the use of the frequencies gets particularly busy, coordinated protocols impose time slices on senders.
What such link-layer or direct radio information transfers have generally struggled with is trustworthiness – allow me to repeat: verifiable origin, correct, unaltered, timely, and cannot be abused by unauthorized parties in any fashion.
Of course, by its nature, all radio based communication is vulnerable to jamming and spoofing, which has a grand colorful military history as an offensive or defensive electronic warfare measure along with fitting countermeasures (ECM) and even counter-countermeasures (ECCM). Radio is also, especially when used in an uncoordinated fashion, subject to unintended interference and therefore distortion.
ADS-B, which is meant to replace radar in Air Traffic Control doesn't even have any security features in its protocol. The stance of the FAA is that they will detect spoofing by triangulation of the signals, meaning they can tell whether a plane that say it's at a particular position is actually there. We should assume they have done their ECM and ECCM homework.
IEEE 1609 for Wireless Access in Vehicular Environments that's aiming to facilitate ad-hoc V2V and V2I communication, spells out an elaborate scheme to manage and use and roll X.509 certificates, but relies on the broad distribution of certificate revocation lists to ban once-issued certificates from the system. Vehicles are sold, have the telematics units replaced due to malfunction or crash damage, may be tampered with, or might be stolen. I can see the PKI's generally overly optimistic stance on revocations being challenging at the scale of tens if not hundreds of million vehicles, where churn will be very significant. The Online Certificate Status Protocol (OCSP, RFC6960) might help IEEE 1609 deal with the looming CRL caching issues due to size, but then requires very scalable validation server infrastructure that needs to be reachable whenever two vehicles want to talk, which is also not acceptable.
Local radio link protocols such as Bluetooth, WLAN (802.11x with 802.11i/WPA2-PSK), or Zigbee often assume that participants in a local link network share a common secret, and can keep that secret secret. If the secret leaks, all participants need to be rolled over to a new key. IEEE 802.1X, which is the foundation for the RADIUS Authentication and Authorization of participants in a network, and the basis of "WPA2 Enterprise" offers a way out of the dilemma of either having to rely on a federated trust scheme that has a hard time dealing with revocations of trust at scale, or on brittle pre-shared keys. 802.1X introduces the notion of an Authentication (and Authorization) server, which is a neutral third party that makes decisions about who gets to access the network.
Unfortunately, many local radio link protocols are not only weak at managing access, they also have a broad history of having weak traffic protection. WLAN's issues got largely cleaned up with WPA2, but there are plenty of examples across radio link protocols where the broken WEP model or equivalent schemes are in active use, or the picture is even worse. Regarding the inherent security of cellular network link-level protection, it ought to be sufficient to look at the recent scolding of politicians in Europe for their absent-mindedness to use regular GSM/UMTS phones without extra protection measures – and the seemingly obvious result of dead-easy eavesdropping by foreign intelligence services. Ironically, mobile operators make some handsome revenue by selling "private access points" (private APNs) that terminate cellular device data traffic in a VPN and that the customer then tunnels into across the hostile Internet to meet the devices on this fenced-off network, somehow pretending that the mobile network somehow isn't just another operator-managed public network and therefore more trustworthy.
Link-layer protection mechanisms are largely only suitable for keeping unauthorized local participants (i.e. intruders) from getting link-layer data frames up to any higher-level network logic. In link-layer-scoped peer-to-peer network environments, the line between link-layer data frames and what's being propagated to the application is largely blurred, but the previous observation stays true. Even if employed, link-layer security mechanisms are not much help on providing security on the network and transport layers, as many companies are learning the hard way when worms and other exploits sweep through the inside of their triply-firewalled, WPA2 protected, TPM-tied-IPSec-protected networks, or as travelers can learn when they don't have a local firewall up on their machine or use plaintext communication when connecting to the public network at a café, airport, or hotel.
Of course, the insight of public networks not being trustworthy has led many companies interconnecting sites and devices down the path of using virtual private network (VPN) technology. VPN technology, especially when coming in the form of a shiny appliance, makes it very easy to put a network tunnel terminator on either end of a communication path made up of a chain of untrustworthy links and networks. The terminator on either end conveniently surfaces up as a link-layer network adapter. VPN can fuse multiple sites into a single link-layer network and it is a fantastic technology for that. But like all the other technologies I discussed above, link-layer protection is a zoning mechanism, the security mechanisms that matter to protect digital assets and devices sit at the layers above it. There is no "S" for Security in "VPN". VPN has secure virtual network cables, it doesn't make the virtual hub more secure that they plug into. Also, in the context of small devices as discussed above, VPN is effectively a non-starter due to its complexity.
What none of these link-layer protection mechanisms help with, including VPN, is to establish any notion of authentication and authorization beyond their immediate scope. A network application that sits on the other end of a TCP socket, where a portion of the route is facilitated by any of these link layer mechanisms, is and must be oblivious to their existence. What matters for the trustworthiness of the information that travels from the logic on the device to a remote control system not residing on the same network, as well as for commands that travel back up to the device, is solely a fully protected end-to-end communication path spanning networks, where the identity of the parties is established at the application layer, and nothing else. The protection of the route at the transport layer by ways of signature and encryption is established as a service for the application layer either after the application has given its permission (e.g. certificate validation hooks) or just before the application layer performs an authorization handshake, prior entering into any conversations. Establishing end-to-end trust is the job of application infrastructure and services, not of networks.
The findings from this discussion so far can be summarized in a few points:
The conclusion I draw from these findings is an approach I call "Service Assisted Communication" (SAC). I'm not at all claiming the principles and techniques being an invention, as most are already broadly implemented and used. But I do believe there is value in putting them together here and to give them a name so that they can be effectively juxtaposed with the approaches I've discussed above.
The goal of Service Assisted Communication is to establishing trustworthy and bi-directional communication paths between control systems and special-purpose devices that are deployed in untrusted physical space. To that end, the following principles are established:
The manifestation of these principles is the simple diagram on the right. Devices generally live in local networks with limited scope. Those networks are reasonably secured, with link-layer access control mechanisms, against intrusion to prevent low-level brute-force attacks such as flooding them with packets and, for that purpose, also employ traffic protection. The devices will obviously observe link-layer traffic in order to triage out solicited traffic, but they do not react to unsolicited connection attempts that would cause any sort of work or resource consumption from the network layer on up.
All connections to and from the device are made via or at least facilitated via a gateway, unless the device is peered with a single service, in which case that service takes on the role of the gateway. Eventual peer-to-peer connections are acceptable, but only if the gateway permits them and facilitates a secure handshake. The gateway that the device peers with may live on the local network and thus govern local connections. Towards external networks, the local gateway acts as a bridge towards the devices and is itself connected by the same set of principles discussed here, meaning it's acting like a device connected to an external gateway.
When the device connects to an external gateway, it does so by creating and maintaining an outbound TCP socket across a network address translation boundary (RFC2663), or by establishing a bi-directional UDP route, potentially utilizing the RFC5389 session traversal utilities for NAT, aka STUN. Even though I shouldn't have to, I will explicitly note that the WebSocket protocol (RFC6455) rides on top of TCP and gets its bi-directional flow capability from there. There's quite a bit of bizarre information on the Interwebs on how the WebSocket protocol somehow newly and uniquely enables bi-directional communication, which is obviously rubbish. What it does is to allow port-sharing, so that WebSocket aware protocols can share the standard HTTP/S ports 80 (RFC2616) and 443 (RFC2818) with regular web traffic and also piggyback on the respective firewall and proxy permissions for web traffic. The in-progress HTTP 2.0 specification will expand this capability further.
By only relying on outbound connectivity, the NAT/Firewall device at the edge of the local network will never have to be opened up for any unsolicited inbound traffic.
The outbound connection or route is maintained by either client or gateway in a fashion that intermediaries such as NATs will not drop it due to inactivity. That means that either side might send some form of a keep-alive packet periodically, or even better sends a payload packet periodically that then doubles as a keep-alive packet. Under most circumstances it will be preferable for the device to send keep-alive traffic as it is the originator of the connection or route and can and should react to a failure by establishing a new one.
As TCP connections are endpoint concepts, a connection will only be declared dead if the route is considered collapsed and the detection of this fact requires packet flow. A device and its gateway may therefore sit idle for quite a while believing that the route and connection is still intact before the lack of acknowledgement of the next packet confirms that assumption is incorrect. There is a tricky tradeoff decision to be made here. So-called carrier-grade NATs (or Large Scale NAT) employed by mobile network operators permit very long periods of connection inactivity and mobile devices that get direct IPv6 address allocations are not forced through a NAT, at all. The push notification mechanisms employed by all popular Smartphone platforms utilize this to dramatically reduce the power consumption of the devices by maintaining the route very infrequently, once every 20 minutes or more, and therefore being able to largely remain in sleep mode with most systems turned off while idly waiting for payload traffic. The downside of infrequent keep-alive traffic is that the time to detection of a bad route is, in the worst-case, as long as the keep-alive interval. Ultimately it's a tradeoff between battery-power and traffic-volume cost (on metered subscriptions) and acceptable latency for commands and notifications in case of failures. The device can obviously be proactive in detecting potential issues and abandon the connection and create a new one when, for instance, it hops to a different network or when it recovers from signal loss.
The connection from the device to the gateway is protected end-to-end and ignoring any underlying link-level protection measures. The gateway authenticates with the device and the device authenticates with the gateway, so neither is anonymous towards the other. In the simplest case, this can occur through the exchange of some proof of possession of a previously shared key. It can also happen via a (heavy) X.509 certificate exchange as performed by TLS, or a combination of a TLS handshake with server authentication where the device subsequently supplies credentials or an authorization token at the application level. The privacy and integrity protection of the route is also established end-to-end, ideally as a byproduct of the authentication handshake so that a potential attacker cannot waste cryptographic resources on either side without producing proof of authorization.
The current reality is that we don't have many if any serious alternatives to TLS/DTLS or SSH for securing this application-level connection or route today. TLS is far from being a perfect fit for many reasons I laid out here, and not least because of the weight in footprint and compute effort of the TLS stack which is too heavy for inexpensive circuitry. SSH is a reasonable alternative from the existing popular protocol suites, but suffers from lack of a standardized session resumption gesture. My hope is that we as an industry fix either of those to make it a better fit for the connected devices scenarios or we come up with something better. Here's a summary of criteria.
The result of the application-level handshake is a secure peer connection between the device and a gateway that only the gateway can feed. The gateway can, in turn, now provide one or even several different APIs and protocol surfaces, that can be translated to the primary bi-directional protocol used by the device. The gateway also provides the device with a stable address in form of an address projected onto the gateway's protocol surface and therefore also with location transparency and location hiding.
The device could only speak AMQP or MQTT or some proprietary protocol, and yet have a full HTTP/REST interface projection at the gateway, with the gateway taking care of the required translation and also of enrichment where responses from the device can be augmented with reference data, for instance. The device can connect from any context and can even switch contexts, yet its projection into the gateway and its address remains completely stable. The gateway can also be federated with external identity and authorization services, so that only callers acting on behalf of particular users or systems can invoke particular device functions. The gateway therefore provides basic network defense, API virtualization, and authorization services all combined into in one.
The gateway model gets even better when it includes or is based on an intermediary messaging infrastructure that provides a scalable queuing model for both ingress and egress traffic.
Without this intermediary infrastructure, the gateway approach would still suffer from the issue that devices must be online and available to receive commands and notifications when the control system sends them. With a per-device queue or per-device subscription on a publish/subscribe infrastructure, the control system can drop a command at any time, and the device can pick it up whenever it's online. If the queue provides time-to-live expiration alongside a dead-lettering mechanism for such expired messages, the control system can also know immediately when a message has not been picked up and processed by the device in the allotted time.
The queue also ensures that the device can never be overtaxed with commands or notifications. The device maintains one connection into the gateway and it fetches commands and notifications on its own schedule. Any backlog forms in the gateway and can be handled there accordingly. The gateway can start rejecting commands on the device's behalf if the backlog grows beyond a threshold or the cited expiration mechanism kicks in and the control system gets notified that the command cannot be processed at the moment.
On the ingress-side (from the gateway perspective) using a queue has the same kind of advantages for the backend systems. If devices are connected at scale and input from the devices comes in bursts or has significant spikes around certain hours of the day as with telematics systems in passenger cars during rush-hour, having the gateway deal with the traffic spikes is a great idea to keep the backend system robust. The ingestion queue also allows telemetry and other data to be held temporarily when the backend systems or their dependencies are taken down for service or suffer from service degradation of any kind. You can find more on the usage of brokered messaging infrastructures for these scenarios in a MSDN Magazine article I wrote a year back.
An "Internet of Things" where devices reside in unprotected physical space and where they can interact with the physical world is a very scary proposition if we solely rely on naïve link and network-level approaches to connectivity and security, which are the two deeply interwoven core aspects of the "I" in "IoT". Special-purpose devices don't benefit from constant human oversight as phones and tablets and PCs do, and we struggle even to keep those secure. We have to do a better job, as an industry, to keep the devices secure that we want to install in the world without constant supervision.
"Trustworthy communication" means that information exchanged between devices and control systems is of verifiable origin, correct, unaltered, timely, and cannot be abused by unauthorized parties in any fashion. Such trust cannot be established at scale without employing systems that are designed for the purpose and keep the "bad guys" out. If we want smarter devices around us that helping to improve our lives and are yet power efficient and affordable, we can't leave them alone in untrustworthy physical space taking care of their own defenses, because they won't be able to.
Does this mean that the refrigerator cannot talk to the laundry washing machine on the local network? Yes, that is precisely what that means. Aside from that idea being somewhat ludicrous, how else does the washing machine defend itself from a malicious refrigerator if not by a gateway that can. Devices that are unrelated and are not part of a deeply integrated system meet where they ought to meet: on the open Internet, not "behind the firewall".