How Cloudflare competes in FaaS

· November 29, 2018

devgtm

How do you compete with AWS? Even if you set your sights a little lower in the cloud infrastructure space, does Azure, Google Cloud Platform or Aliyun look much easier? These are big companies with a lot of money, a lot of engineers and a lot of infrastructure. They also have a suite of existing cloud services, letting them act as a “one stop shop” for customers.

To look at one specific slice, all of these providers (and others) have some kind of serverless offering in the Functions-As-A-Service space: Lambda, Cloud Functions and so on. This is a hot area that offers scale-to-zero, no management of the environment, and rapid development.

Back in March, Cloudflare broadly opened up their own FaaS-style offering, Workers. CloudFlare are no minnow — likely to IPO soon with a $3bn+ market cap, they’re a serious company that provide world class CDN and proxy infrastructure. That said, they’re still much smaller than the firms they’re going up against.

When faced with a competitor that has more resources than you, you have to concentrate your efforts, ideally towards an area where the competition are weak. It’s a simple concept, but hard to do in practice, and I think Cloudflare have executed it well. What they focused on was latency.

Targeting the stack

In FaaS land, one of the most dreaded phrases is “cold start”. Due to the scale-to-zero nature of the product, there is not necessarily a server ready to take your request when it comes in.

The standard FaaS stack looks something like:

Business Logic -> App Framework -> App Server -> Host Environment -> Compute

The big platforms all had compute offerings developed during fighting out Platform-As-A-Service and Infrastructure-As-A-Service deals. That world has come together in containers, meaning the Host environment is often a container, on top of general compute infrastructure.

Within those containers is usually some kind of app server, managed by the FaaS, that loads in the third party framework, and finally the code that customer writes.

Keeping these containers spun up is a relatively cheap, but not free, and the scale-to-zero promise of FaaS products is hard to fill economically. Each provider makes substantial investments in infrastructure code to make this cheaper and faster, such as Amazon’s recently open sourced Firecracker.

The first request to a function requires instantiating the framework and business logic within an appropriate app server container, which takes some time. If another request came in while the first one was going through a cold start, another instance would be cold started, and so on. Once requests were more rapid they can be served from existing hot instances, but the cold starts hit the tail latency in an unpleasant way.

Cloudflare’s Workers, in contrast, are based on V8’s Isolates, and so run without a container. This is an architectural choice that costs a fair bit of flexibility: the language available is JavaScript (and things that can compile to JavaScript). There’s no way to load in another binary, or extend to deploying containers directly.

But they are fast — there’s no container to be spun up, so they’re cheap to execute. There’s basically no cold start cost.

This is very hard for a big provider to replicate, as it would involve them sacrificing a lot of things that larger clients focus on. Those enterprise clients are the ones bringing large (often offline) compute and storage workloads to their platform, and the accompanying revenue.

Those sacrifices includes things like the ability to load in native libraries, which opens up a whole new set of flexible options — frameworks like Zeit build on this to allow you to deploy any sort of language, even outside of the supported app servers. This, and the broader language support, is great for convincing a CIO with a team of server-side Java developers to try FaaS, as they can reuse a lot of their existing code.

In contrast, Cloudflare’s offering appeals primarily to web developers. Focusing on a specific target audience lets them craft the product towards that audience in a way that’s also hard to compete with — the basic design of Workers is modeled on Service Workers to feel familiar to folks coming from the client side.

Edge Latency

Coming back to the Function stack, the triggering source that kicks off the execution is an event. One of the most popular event sources for all providers is an HTTP request, often via an API gateway. All the cloud providers have one of these API gateways, but they also have message bus products and other event source generators.

Cloudflare just has HTTP triggers — again because of their focus on web developers, but also because of the alignment with their network. Cloudflare’s workers run on their servers as part of their general edge cache. Rather than deploy to a specific region, as with the large providers, deploying a Cloudflare worker is like deploying a rule for their cache — it goes to their entire network. This means the processing happens close to the requester, which drives down geographical latency.

This is hard to duplicate — AWS do have a similar option with Lambda@Edge that ties into their CDN, CloudFront, but its not the core of their product and hence the developer experience suffers. Lambda also leads the pack in terms of the regions it can be deployed in versus the other big FaaS providers. A focused startup like Cloudflare can have a single set of services to deploy everywhere, but a large cloud provider will need to turn up services region-by-region, which takes time.

Principles

There is a lot of technically interesting stuff in Cloudflare’s workers, and I hope they find success with the product. For me though, the really interesting move is how they have focused their efforts on developing capabilities that hit weak spots in the existing offerings.

That can look obvious or easy, but it takes courage — every feature they don’t have will be a checkbox unchecked on some platform comparison document inside a large company, or a call out by a competing sales person. Its also true that there will be a whole host of situations that are just easier to implement on a FaaS from one of the other providers — I’m certainly not suggesting that Workers is the globally optimal choice.

The benefit is the clear water they have moved into — if there is a big enough market of businesses with needs that are met by Workers, or who are convinced by the latency wins, they have a product that is well differentiated from their competition.

For Cloudflare, the fallback case is likely that Workers stays as a kind of rich cache configuration language, which is the kind of acceptable outcome that makes a bet worth taking.

In consumer technology, VCs often refer to a “kill zone” around competing directly with one of the big tech companies, and that sometimes feels like it applies to their developer offering too. In both cases there are ways of building fantastic, differentiated products that do compete directly, but it requires this kind of asymmetric approach, agility, and focus.