Uncategorized

Elasticsearch in AWS

If you’re using AWS and you need elasticsearch, you are standing in front of a choice.

  1. Use https://aws.amazon.com/elasticsearch-service/
  2. Just standup some nodes and run your own.

Which one should you choose? Well here’s some thoughts on the matter. I’ve been using, running and migrating data on elasticsearch for the last 2 years. So this should be sound judgement.

Firstly, let’s consider cost. Purely HW cost is very similar, in AWS managed mode you’re still paying for the same hardware as you would running your own. I haven’t see any premium for running ES on that kit.

Access controls. 

In standalone mode you have the usual security groups or VPC and all the richness of the usual AWS controls. In managed mode you don’t. This was surprising, but basically you can either permission individual credentials or whitelist IPs.

In the first case you need to sign each request. It’s rather inconvienient, although it works with python aws-request-auth.

In the second case you literally whitelist IPs, which will have access to your cluster. This sounds insane, but that’s the only sensible way to interact with ES. Because you can whitelist a small number of IPs running nginx forwarding your requests to ES. A very good howto can be found here.

And this is the option I would recommend – nginx with usual AWS controls forwarding your requests to ES. But of course you lose the load-balacing built into most clients.

Monitoring and management.

In standalone mode you have a variety of tools and plugins for monitoring and managing your data and your cluster. For me those are separate things. Both vitally important.

In managed mode you get some cluster management. I say some because AWS has done an excellent job at distilling all the good practices of running ES and captured the essense in a handful of choices. I loved it. It also has some metrics around the cluster, which was nice.

The management options I liked included:

  1. Separate data and master nodes
  2. Field data cache limits
  3. Automated periodic snapshots

These are all the things I would set on my own cluster straight away.

BUT And this is a biggie. In managed mode you have 0 and I mean no tools to help you manage your data. You’re stuck with plain REST. That’s a biggie – kopf or similar is invaluable at looking at your indices, changing your replicas, settings, templates, etc. All of that is missing. You can’t even run kopf someplace else because all REST calls it uses to configure itself are querying cluster state and are disabled. So it doesn’t even start :(

So you’re at the mercy of AWS to manage your data, upgrade your cluster, do many other things, which can be quite tricky. From my experience all of these bits require great care. For me that was a “walk-away” in negotiation terms. That killed all the niceness of bringing up a cluster in a few easy choices and a few clicks.

Clients.

In managed mode you’re only allowed REST. In standalone it’s whatever you like :)

Conclusion.

I like AWS managed elasticsearch. It’s a great product. Unfortunately it’s a bit raw. It’s also VERY slow at taking in configuration changes, including entitlements.

Too much is restricted without adding adequate alertnatives. So if you want a quick cluster to prototype something and you don’t want to manage it – it’s right for you. Otherwise I’d wait a bit and see where this goes.

Standard