Scaling zulily’s Infrastructure in a Pinch, with Salt

At zulily, we strive to delight our customers with the best possible experience, every day.  Our daily customer experience involves offering thousands of new products each morning, all of which comes together thanks to our technology, and impeccable coordination across the organization. As our product offerings dramatically change on a daily basis, quickly scaling our infrastructure to meet variable demand is of critical importance.  In this article, we will provide an overview of zulily’s SaltStack implementation and its role in our infrastructure management, exploring patterns and practices which enhance our automation capabilities.

 

Let’s start with a bit of context, and not the jinja kind

Our technology team embraces a DevOps approach to solving technical challenges, and many of our engineers are “full stack”. We have several product teams developing and supporting both external and internal services, with a variety of application stacks.  All product teams have developers of course, and a few have dedicated DevOps engineers.  We also have a small, dedicated infrastructure team.

 

zulily has seen phenomenal growth since it’s inception, and what was initially a tech team of one, quickly became a tech team of a few, rapidly evolving into a tech team of many product teams and engineers, which is where we find ourselves today.  With these changes and growth over time, it became apparent our infrastructure team was perhaps not the ideal team for managing all components and configurations across the entire Technology organization.

 

To elaborate further on this point, our product teams have overlapping stacks but with variations, and many teams have vastly different components comprising their stacks.  Product teams know their application stacks best, so instead of attempting to have a small team of infrastructure engineers managing all configs and components, we needed to empower product teams to be able to take ownership, by providing them with self-service options.

 

Enter SaltStack to address our organization growth, which we have found to be very approachable, with its simple-to-grasp state and pillar tree layouts, use of yaml, and customization possibilities with python.  Salt is a key component in our technology stack enabling our product teams to take control of their system configurations, keeping us moving forward quickly and accomplishing our goals.

 

saltenv == tenant (mostly), and baseless?

Like many initiatives and projects at zulily, we’ve taken a unique approach to our use of salt environments. It has worked out exceptionally well for our tech organization and we are excited to share our approach to multi-tenancy with salt.

 

Each product team has its own salt and pillar trees, salt environments map to tenants essentially. For example, we have environments with names such as “site”, we do not use salt environment names such as “dev” and “prod”.

 

But what about “real” environments?  We are able to manage those too, thanks to our strict and metadata-rich host-naming convention, paired with salt’s state and pillar tree layouts and top.sls and targeting capabilities.  Our hostnames have the following format:

 

<product_team>-<function>-<node_number>.<location_code>.<environment>.zulily.com

 

Also related to our host names, each minion has custom grains set for all of these fields, and these grains are quite useful in many of our states!

 

We have found that the majority of states are the same across (real) environments, and environment specifics can instead be managed through pillar targeting.  By keeping all of a team’s states and pillar data within just two git repositories, we have found we are overall more DRY than we would have been with separate git repositories (per real environment).

 

Additionally, salt states may be extended and overridden, which may be useful for different (real) environments when necessary.  So instead of having a flat state tree, we have sub-directories such as ‘core’, ‘dev’ and ‘prod’.  Our approach is to place just about everything under core, but use environment sub directories when we must have environment-specific states, or when we simply wish to extend or override states residing in core. If parent states in core must be modified, it is important to consider the ramifications for any environment-specific children.  We generally don’t do a lot of extending and overriding at zulily, and instead focus on placing environment specifics within targeted pillar data, as previously mentioned.

 

We have the same layout in our pillar trees for consistency, but note that pillar keys must be unique and have no hierarchy when retrieved, however, hierarchy is important for pillar top.sls targeting!

 

Reviewing the following state tree example illustrates our layout approach for a “provision environment”:

 

│── core
│   │── aliases
│   │  │── files
│   │  │   └── aliases
│   │  │── init.sls
│   │  │── map.jinja
│── dev

│── prod
│── top.sls

 

But wait, if a highstate is run, what happens and couldn’t this be dangerous?  Running a highstate does have the potential to be dangerous, if a product team accidentally targets *their* very specific MySQL states to ‘*’ for example, a separate team’s database server could result in a serious outage.  To mitigate the risk of an incident such as this occurring, pushes to all of our state and pillar repositories are subject to inspection by a git push constraint that deserializes the top.sls yaml and evaluates all targets.  The targeting allowed in our top.sls files is very restrictive, with only a subset of target types allowed, and non-relevant environment references are disallowed.  Also worth noting is that only very specific, authorized team members have write access to our salt and pillar product team repositories, a member of the site team may not write to the infrastructure team’s salt and pillar repositories.

 

Also worth mentioning, one additional layer of risk mitigation we have in place is that all of our users append “saltenv=<product_team>” to their salt-calls, always.
We do have additional environments which are not-tied to any specific project team, known as base, provision and periodic.  The base environment is empty!  The latter two are critical to our operations, we’ll explain this next.

 

Less salt (highstate runs)

In our experience at zulily, we’ve learned that the vast majority of our salt states really only need to run just once, or rather infrequently.  So our standard practice for product teams is to run highstates only once per week  or on an as-needed basis, which we do very cautiously.  It goes against the traditional wisdom of converging at least hourly, but in the end, we have had consistent environments and greater stability with this approach.  It is nearly inevitable that even the most senior automation engineer will make a bad push to master at some point, and a timed hourly run could pick up on that, with potentially disastrous consequences.  Configuration management is a powerful thing, and we have found our approach to highstating to be the appropriate balance for zulily.

 

Now, getting to zulily’s two important non-product team “environments”…

 

The first of which is known as “provision”.  States in the provision environment provide the most basic packages and configurations with reasonable defaults, which work for most product teams, most of the time.  What is very particular about the provision environment is that a “provision highstate” is only run once!  That’s correct we almost never re-run any of these states once an instance goes into production.  There just really isn’t a need, and more importantly, there may be conflicts with subsequent customizations by product teams, and we would really rather avoid unnecessary subsequent configuration breakage.

 

To limit ourselves to a single provision hightstate, our provision top.sls targeting requires that a grain be set to True, known as “in_provisioning”.  When an instance has been provisioned, we remove the grain — a provision highstate will never run again, as long as the grain remains absent.  Very seldom, we have had to roll out updates to a few individual states within provision, which we accomplish very cautiously with specific states.sls jobs.

 

We have recently open sourced a sampling of many of our basic states used at provision time, please have a look at our github project known as alkali.

 

The second non-product team “environment” is known as periodic.  While our standard is to run a full product team environment highstate once per week, some changes need to get out in near realtime.  For zulily, these types of changes are limit to states addressing resources such as posix users and groups, sudoers, iptables rules, and ssh key management.  Periodic highstates are cron’d every few minutes at present with saltenv=periodic of course.  We are however moving to triggered periodic highstates, as cron’d periodic highstate runs may block other jobs.

 

State development workflow

We have done a significant amount of state develop at zulily, and for the most part, this has occurred within Vagrant environments.  Vagrant has worked very well for us, but more recently we are beginning to leverage docker containers for this purpose.  For more information on how we are doing this, please check out a project we just released, known as buoyant.

 

Given our salt development environment, whether Vagrant or docker, we typically iterate on states working out of our home directories (synced folders or docker volumes), preferably in a branch.  Once state and pillar files are ready, we merge into master and configure very restrictive and precise targeting at first, or simply remove or disable existing targeting.  This gives us full control over our rollout process across (real) environments, which limits the risk of a service disruption, we know exactly which hosts are executing which states and when.

 

Pushes to master branches for all salt and pillar git repositories are integrated within just a few minutes with our current automation, and then ready for targeted execution across relevant minions.

 

zulily’s salt masters are controlled by a centralized infrastructure team, and product teams are restricted from running “salt” commands, they do not have access to our masters.  They do however have all the control, and only the control they need! Product teams use simple, custom scripts that leverage fabric to execute remote commands on their minions, most notably salt-call (with saltenv specified of course!).

 

Other salt-related open source projects zulily has released

Outside of the aforementioned alkali and buoyant projects, we have recently released four community formulas:

 

 

All of these projects are in their early stages, a bit heavy on the jinja in some cases, and very Ubuntu-specific for the most part at this time. They have however shown good promise for us at zulily, we didn’t want to wait any longer to share them with the community. Our hope is they will already be useful to some, and worthy of iterating on going forward.

 

Coloring outside of the lines

One of zulily’s core values is to “color outside of the lines,” and our use of SaltStack is no exception.  Many of the patterns we use are uncommon, and our approach to environments in particular may not be the first idea that comes to mind for the typical salt user.  Our use of salt and its inherent simplicity and flexiblity have enabled us to decentralize our configuration management, providing multi-tenancy and product team isolation.  With self-service capabilities in place, our product teams are empowered to move at a quick cadence, keeping pace with what we call “zulily time” around the office. We’ve had great success with SaltStack at zulily, and we are pleased to share some of our projects and patterns with the community.

 

Happy salting!