Minimizing the Cost of SaaS Operations
Posted by Bob Warfield on March 29, 2010
SaaS software is much more dependent on being run by the numbers than conventional on-premises software because the expenses are front loaded and the costs are back loaded. SAP learned this the hard way with its Business By Design product, for example. If you run the numbers, there is a high degree of correlation between low-cost of delivering the service and high growth rates among public SaaS companies. It isn’t hard to understand–every dollar spent delivering the service is a dollar that can’t be spent to find new customers or improve the service.
So how do you lower your cost to deliver a SaaS service?
At my last gig, Helpstream, we got our cost down to 5 cents per seat per year. I’ve talked to a lot of SaaS folks and nobody I’ve yet met got even close. In fact, they largely don’t believe me when I tell them what the figures were. The camp that is willing to believe immediately wants to know how we did it. That’s the subject of this “Learnings” blog post. The formula is relatively complex, so I’ll break it down section by section, and I’ll apologize up front for the long post.
Attitude Matters: Be Obsessed with Lowering Cost of Service
You get what you expect and inspect. Never a truer thing said than in this case. It was a deep-seated part of the Helpstream culture and strategy that Cost of Service had to be incredibly low. So low that we could exist on an advertising model if we had to. While we never did, a lot was invested in the critical up front time when it mattered to get the job done. Does your organization have the religion about cutting service cost, or are there 5 or 6 other things that you consider more important?
Go Multi-tenant, and Probably go Coarse-grained Multi-tenant
Are you betting you can do SaaS well enough with a bunch of virtual machines, or did you build a multi-tenant architecture? I’m skeptical about your chances if you are in the former camp unless your customers are very very big. Even so, the peculiar requirements of very big customers (they will insist on doing things their way and you will cave) will drive your costs up.
Multi-tenancy lets you amortize a lot of costs so that they’re paid once and benefit a lot of customers. It helps smooth loads so that as one customer has a peak load others probably don’t. It clears the way to massive operations automation which is much harder in a virtual machine scenario.
Multi-tenancy comes in a lot of flavors. For this discussion, let’s consider fine-grained versus coarse-grained. Fine grain is the Salesforce model. You put all the customers together in each table and use a field to extract them out again. Lots of folks love that model, even to a religious degree that decrees only this model is true multi-tenancy. I don’t agree. Fine grained is less efficient. Whoa! Sacrilege! But true, because you’re constantly doing the work of separating one tenant’s records from another. Even if developers are protected from worrying about it by clever layering of code, it can’t help but require more machine resources to constantly sift records.
Coarse-grained means every customer gets their own database, but these many databases are all on the same instance of the database server. This is the model we used at Helpstream. It turns out that a relatively vanilla MySQL architecture can support thousands of tenants per server. That’s plenty! Moreover, it requires less machine resources and it scales better. A thread associated with a tenant gets access to the one database right up front and can quit worrying about the other customers right then. A server knows that the demands on a table only come from one customer and it can allocate cpus table by table. Good stuff, relatively easy to build, and very efficient.
The one down side of coarse grain I have discovered is that its hard to analyze all the data across customers because it’s all in separate tables. Perhaps the answer is a data warehouse constructed especially for the purpose of such analysis that’s fed from the individual tenant schemas.
Go Cloud and Get Out of the Datacenter Business
Helpstream ran in the Amazon Cloud using EC2, EBS, and S3. We had help from OpSource because you can’t run mail servers in the Amazon Cloud–the IP’s are already largely black listed due to spammers using Amazon. Hey, spammers want a low-cost of ops too!
Being able to spin up new servers and storage incrementally, nearly instantly (usually way less than 10 minutes for us to create a new multi-tenant “pod”), and completely from a set of API’s radically cuts costs. Knowing Amazon is dealing with a lot of the basics like the network infrastructure and replicating storage to multiple physical locations saves costs. Not having to crawl around cages, unpack servers, or replace things that go bad is priceless.
Don’t mess around. Unless your application requires some very special hardware configuration that is unavailable from any Cloud, get out of the data center business. This is especially true for small startups who can’t afford things like redundant data centers in multiple locations. Unfortunately, it is a hard to impossible transition for large SaaS vendors that are already thoroughly embedded in their Ops infrastructure. Larry Dignan wrote a great post capturing how Helpstream managed the transition to Amazon.
Build a Metadata-Driven Architecture
I failed to include this one in my first go-round because I took it for granted people build Metadata-driven architectures when they build Multi-tenancy. But that’s only partially true, and a metadata-driven architecture is a very important thing to do.
Metadata literally means data about data. For much of the Enterprise Software world, data is controlled by code, not data. Want some custom fields? Somebody has to go write some custom code to create and access the fields. Want to change the look and feel of a page? Go modify the HTML or AJAX directly.
Having all that custom code is anathema, because it can break, it has to be maintained, its brittle and inflexible, and it is expensive to create. At Helpstream, we were metadata happy, and proud of it. You could get on the web site and provision a new workspace in less than a minute–it was completely automated. Upgrades for all customers were automated. A tremendous amount of customization was available through configuration of our business rules platform. Metadata gives your operations automation a potent place to tie in as well.
Open Source Only: No License Fees!
I know of SaaS businesses that say over half their operating costs are Oracle licenses. That stuff is expensive. Not for us. Helpstream had not a single license fee to pay anywhere. Java, MySQL, Lucene, and a host of other components were there to do the job.
This mentality extends to using commodity hardware and Linux versus some fancy box and an OS that costs money too. See for example Salesforce’s switch.
Automate Operations to Death
Whatever your Operations personnel do, let’s hope it is largely automating and not firefighting. Target key areas of operational flexibility up front. For us, this was system monitoring, upgrades, new workspace provisioning, and the flexibility to migrate workspaces (our name for a single tenant) to different pods (multi-tenant instances).
Every time there is a fire to be fought, you have to ask several questions and potentially do more automation:
1. Did the customer discover the problem and bring it to our attention? If so, you need more monitoring. You should always know before your customer does.
2. Did you know immediately what the problem was, or did you have to do a lot of digging to diagnose? If you had to do digging, you need to pump up your logging and diagnostics. BTW, the most common Ops issue is, “Your service is too slow.” This is painful to diagnose. It is often an issue with the customer’s own network infrastructure for example. Make sure to hit this one hard. You need to know how many milliseconds were needed for each leg of the journey. We didn’t finish this one, but were actively thinking of implementing capabilities like Google uses to tell with code at the client when a page seems slow. Our pages all carried a comment that told how long it took at the server side. By comparing that with a client side measure of time, we would’ve been able to tell whether it was “us” or “them” more easily.
3. Did you have to perform a manual operation or write code to fix the problem? If so, you need to automate whatever it was.
This all argues for the skillset needed by your Ops people, BTW. It also argues to have Ops be a part of Engineering, because you can see how much impact there is on the product’s architecture.
Hit the Highlights of Efficient Architecture
Without going down the rathole of premature optimization, there is a lot of basic stuff that every architecture should have. Thread pooling. Good clean multi-threading that isn’t going to deadlock. Idempotent operations and good use of transactions with rollback in the face of errors. Idempotency means if the operation fails you can just do it again and everything will be okay. Smart use of caching, but not too much caching. How does your client respond to dropped connections? How many round trips does the client require to do a high traffic page?
We used Java instead of one of the newer slower languages. Sorry, didn’t mean to be pejorative, and I know this is a religious minefield, but we got value from Java’s innate performance. PHP or Python are pretty cool, but I’m not sure they are what you want to squeeze every last drop of operations cost out of your system. The LAMP stack is cheap up front, but SaaS is forever.
Carefully Match Architecture with SLA’s
The Enterprise Software and IT World is obsessed with things like failover. Can I reach over and unplug this server and automatically failover to another server without the users ever noticing? That’s the ideal. But it may be a premature optimization for your particular application. Donald Knuth says, “97% of the time: premature optimization is the root of all evil.”
Ask yourself how much is enough? We settled on 10 minutes with no data loss. If our system crashed hard and had to be completely restarted, it was good enough if we could do that in less than 10 minutes and no loss of data during that time. That meant no failover was required, which greatly simplified our architecture.
To implement this, we ran a second MySQL replicated from the main instance and captured EBS backup snapshots from that second server. This took the load of snapshotting off the main server and gave us a cheaper alternative to a true full failover. If the main server died, it could be brought back up again in less than 10 minutes with the EBS volume mounted and away we would go. The Amazon infrastructures makes this type of architecture easy to build and very successful. Note that with coarse-grained multi-tenancy, one could even share the backup server across multiple multi-tenant instances.
Don’t Overlook the Tuning!
Tuning is probably the first thing you thought of with respect to cutting costs, right? Developers love tuning. It’s so satisfying to make a program run faster or scale better. That’s probably because it is an abstract measure that doesn’t involve a customer growling about something that’s making them unhappy.
Tuning is important, but it is the last thing we did. It was almost all MySQL tuning too. Databases are somewhat the root of all evil in this kind of software, followed closely by networks and the Internet. We owe a great debt of gratitude to the experts at Percona. It doesn’t matter how smart you are, if the other guys already know the answer through experience, they win. Percona has a LOT of experience here, folks.
Long-winded, I know. Sorry about that, but you have to fit a lot of pieces together to really keep the costs down. The good news is that a lot of these pieces (metadata-driven architecture, cloud computing, and so on) deliver benefits in all sorts of other ways besides lowering the cost to deliver the service. Probably the thing I am most proud of about Helpstream was just how much software we delivered with very few developers. We never had more than 5 while I was there. Part of the reason for that is our architecture really was a “whole greater than the sum of its parts” sort of thing. Of course a large part was also that these developers were absolute rock stars too!