Facebook–zulily Joint Hack Day

zulily launches over 9,000 unique styles – more than the size of an average Costco – every day. This causes unique time and scale challenges for the Marketing team as they rapidly create, place and manage ads. Facebook and zulily’s Business and Technology teams have been working together to build a platform – the Acquisition Marketing Platform (AMP) – to automate the ad management process on Facebook.

During this partnership, we realized that both companies share a passion for enabling the engineering teams to move fast and put new, exciting features in the hands of customers. In that spirit, we hosted a joint hack day to see how we can further advance the AMP toolset.

Joint FB-ZU Hack day participants: David Whitney (FB), Leo Hu (ZU) and Tim Moon (ZU). Not in picture: Omar Zayat (FB), Justin Sadoski (FB), Mason Bryant (ZU) and Rahul Srivastava (ZU)

The key wins for the hack day:

  • End-to-end automation: Currently, Marketing uses AMP to create the ad assets and Facebook’s PowerEditor to target and place the ads. The joint team built a prototype, utilizing the FB Graph and Ads APIs, to automate the ad process entirely within AMP. This is a huge win for the team! Marketing no longer has to switch between tools and now has a central place to manage all FB ads. This feature will further enable Marketing to scale up their ad creation process. The AMP Engineering team is now working on building a production version of the prototype.
  • Ad objects hierarchy management: FB ads have a hierarchy of campaign, ad-set and ad. The joint team spent time diving deep into the API to manage this hierarchy. Building this understanding is critical to automating management of campaigns, ad sets and reporting on key metrics.
  • API-driven ad Targeting: There are many ways to define the target of an ad on FB: interest, website customer audience, custom audience, etc. The team used the Targeting API to dynamically change the target range of an ad. This enables zulily to build some really interesting scenarios that can leverage real-time data in its systems; as we track the cost and conversion of every ad that is launched, we can now automate the increase or decrease the targeting reach via the API calls.
  • Account management: In October, FB released a new model to manage business ad accounts. The team spent time revamping the existing account model. This was a huge help and removed numerous access- and account-related roadblocks to further development. It would have taken us days of back-and-forth over email and conference calls to adopt the new model. During the hack day, this was completed in hours!

We had a fun and highly productive day-and-a-half of hacking. It was amazing to see the cool ideas and products that materialized from this event. The AMP team is super excited (and the Marketing team even moreso!) to build all the features that are now on their backlog. We are looking forward to future FB-ZU hack days!

Simulating Decisions to Improve Them

One of the jobs of the Data Science team is to help zulily make better decisions through data. One way that manifests itself is via experimentation. Like most ecommerce sites, zulily continuously runs experiments to improve the customer experience. Our team’s contribution is to think about the planning and analysis of those tests to make sure that when the results are read they are trustworthy and that ultimately the right decision is made.

Coming in hot

As a running example throughout this post, consider a landing page experiment.  At zulily, we have several landing pages that are often the first thing a visitor sees after they click an advertisement on a third-party site. For example, if a person was searching for pet-related products, and they clicked on one of zulily’s ads, they might land here.  Note: while that landing page is real, all the underlying data in this post is randomly generated.

The experiment is to modify the landing page in some way to see if conversion rate is improved.  Hopefully data has been gathered to motivate the experiment but, please, just take this at face-value.

Any single landing page is not hugely critical, but in aggregate they’re important for zulily, and small improvements in conversion rates or other metrics can have a large impact on the bottom line. In this example, the outcome metric (what is trying to be improved) is conversion rate, which is simply the number of conversion over the total number of visitors.

Thinking with the End in Mind

Planning an experiment consists of many things, but often the most opaque part is the implications associated with power. The simple definition of power: given some effect of the treatment, how likely will that effect be detectable. The implication here is that the more confident one wants to be in their ability to detect the change, the longer the test needs to run. How long the test needs to run has a direct bearing on the number of tests a company can run as a whole and which tests should be given priority… unless you don’t care about conflating treatments, but then you have bigger problems :).

Power analysis, however, is a challenge. For anything beyond a simple AB test, a lot needs to be thought through to determine the appropriate test. Therefore, it is often easier to think about the data, then work backwards through the analysis, and then the power.

Ultimately power analysis boils down to the simple question: for how long does a test need to run?

To illustrate this, consider the example experiment, where the underlying conversion rate for landing page A (the control) is 10%, and the expected conversion rate of the treatment is 10.5%. While these are the underlying conversion rates, due to randomness the realized conversion rate will likely be different, but hopefully close.

Imagine that each page is a coin, and each time a customer lands on the page the coin is flipped. Even though it’s known a priori that the underlying conversion rate for page A is 10%, if the coin is flipped 1000 times, it’s unlikely that it will be “heads” 100 times.  If you were to run two tests, you would get two different results, even though all variables are the same.

The rest of the post walks through the analysis of a single experiment, then describes how to expand that single experiment analysis into an analysis of the decision making process, and finally discusses a couple examples of complications that often arise in testing and how they can be incorporated into the analysis of the decision making.

A Single Experiment

For instance: flipping the “landing page” coin, so to speak, 1000 times for each page, A and B. In this one experiment, the realized conversion rates (for fake data) are in the bar chart below.


That plot sure looks convincing, but just looking at the plot is not a sufficient way to analyze a test. Think back to the coin flipping example; since the difference in the underlying “heads” rate was only 0.5%, or 5 heads per 1000 flips, it wouldn’t be too surprising if A happened to have more heads than B in any given 1000 flips.

The good news is that statistical tools exist to help make it possible to understand how likely the difference observed was truly due to an underlying difference in rates, due to randomness.

The data collected would look something like:


where Treatment is the landing page treatment, and Converted is 0 for a visit without a conversion and 1 for a visit with a conversion.  For an experiment like this, that has a binary outcome, the statistical tool to choose is logistic regression.

Now we get to see some code!

Assuming the table from above is represented by the “visits” dataframe, the model is very simple to fit in statsmodels.

The output:


This is a lot information, but the decision will likely be based on only one number: in the second table, the interception between “C(Treatment)[T.B]” and “P>|z|”. This is one minus the probability that the difference between the two conversion rates is actually different, or the significance level. The convention is that if that value is less than .05, the difference is significant. Another number worth mentioning is the coefficient of the treatment. This is how much change is estimated. It’s important, because even if the outcome was significant it’s possible to have been a significant negative coefficient, and then the decision is worse than just not accepting a better page, since we’ll accept a worse landing page.

In this case the significance level is greater than .05, so the decision would be that the difference observed is not indicative of an actual difference in conversion rates. This is clearly the wrong decision. The conversion rates were specified as being different, but that difference cannot be statistically detected.

This is ultimately the challenge with power and sample sizes. Had the experiment been run again, with a larger sample, it is possible that we would have detected the change and made the correct decision.  Unfortunately, the planning was done incorrectly and only 1000 samples were taken.

Always Be Sampling

Although the wrong decision was made in the last experiment, we want to improve our decision making.  It is possible to analyze our analysis through simulation. It is a matter of replicating — many times over — the analysis and decision process from earlier. Then it is possible to find out how often the correct decision would be made given the actual difference in conversion rate.

Put another way, the task is to:

  1. Generate a random dataset set for the treatment and control group based on the expected conversion rates.
  2. Fit the model that would have been from the example above.
  3. Measure the outcome based on the decision criteria; here it’ll just be a significant p-value < 0.05.

And now be prepared for the most challenging part: the simulation. These three steps are going to be wrapped in a for loop, and the outcome is collected in an array. Here’s a simple example in python of how that could be carried out.

Running that experiment 500 times, with a sample size of 1000, would yield a correct decision roughly 4.2% of the time.  Ugh.

This is roughly the power the of the experiment at a sample size of 1000.  If we did this experiment 500 times, we would rarely make the correct decision.  To correct this, we need to change the experiment plan to generate more trials.

To get a sense for the power at different sample sizes we choose several possible sample sizes, then run the above simulation for those samples. Now there are two for loops: one for to iterate through the sample size, and one to carry out the analysis above.

Here is the outcome of the decisions at various sample sizes.  The “Power” column is the proportion of time a correct decision was made at the specified sample size.


Not until 10^4.5 samples — roughly 31,000 — does the probability of making the correct decision become greater than 50%.  It is now a matter of making the business decision about how necessary it is to detect the effect.  Typically it is around 80%, in the same way that the significance level is normally around 5%… convention.  It would be easy to repeat this test for several intermediate sample sizes, between 10^4.5 and 10^5, to determine a sample size that has the power the business is comfortable with.

Uncertainty in Initial Conversion Rate

The outcome of the experiment was given a lot of criticism, but the underlying conversion rate was (more or less) taken for granted. The problem is there’s probably a lot of error in the estimated effect of the treatment before the experiment, and some error in the estimated effect of the control, since the control is based on past performance and the treatment is based on a combination of analysis and conjecture.

For example, say we had historical data that indicated that 1,000 out of 10,000 people had converted for the control thus far, and we ran a similar test to the control recently, so we have some confidence that 105 out of 1,000 people would convert.

If that was the prior information for each page the distribution for conversion rate for each landing page over 1,000 experiments could look like:


Even though it appears that landing page B does have a higher conversion rate on average, its distribution around that average is much wider.  To factor in that uncertainty, we can rerun that model with but instead of assuming a fixed conversion rate, we can sample from the distribution of the conversion rate before each simulation. Here’s the outcome, similar to above, of the proportion of times we’d make the correct decision.

Sadly our power was destroyed by the randomness associated with the uncertainty between landing pages.  Here’s the same power by sample size table as above.  For example, at 10^5.0 it is likely that conversion rate for landing page B was less than landing page A.


An alternative route in a situation like this is the use of a beta-binomial model to continue to incorporate additional data to the initial conversion rates.

More Complicated Experiments

The initial example was a very simple test, but more complex tests are often useful.  With more complex experiments, the framework for planning needs to expand to facilitate better decision making.

Consider a similar example to the original one with an additional complication. Since the page is a landing page, the user had to come from somewhere.  These sources of traffic are also sources of variation. Just like how any realized experiment could vary from the expectation, any given source’s underlying conversion rate could also could also vary from the expectation. In the face of that uncertainty, it would be a good idea to run the test on multiple ads.

To simplify our assumptions, consider that the expected change in conversion rate is still 0.5%, but across three ads the conversion rate varies individually by -0.01%, 0.00% and +0.01% due to the individual ad-level characteristics.

For example, this could be outcome of one possible experiment with two landing pages and three ads.


Thankfully statsmodels has a consistent API so just a few things need to change to fit this model:

  • Use gee instead of logit. This is a general estimating equation.  It enables a correlation within groups for a GLM to be fit, or, for these purposes, a logit regression with the group level variances taken into consideration.
  • Pass the groups via the “groups” argument.
  • Specify the family of the GLM; here it’s binomial with a logit link function (the default argument).

Those changes would like this:


The decision criteria here is similar to the first case, so we cannot say anything about the effect of the landing page, and likely this test would not roll out.  Now that the basic model is constructed, we follow the same process to estimate how much power the experiment would have at various levels of sample size.


Experiments are challenging to execute well, even with these additional tools.  The groups that have sufficient size to necessitate testing are normally sufficiently large and complex that wrong decisions can be made.  Through simulation and thinking about the decision-making process, it is possible to quantify how often a wrong decision could occur, its impact, and how to best mitigate the problem.

(By the way, zulily is actively looking for someone to make experimentation better, so if you feel that you qualify, please apply!)

Meet a zulily developer: Beryl


Who are you, and what do you do at zulily?

I am Beryl, a Software Engineer on the EMS (Event Management Systems) Team. I work on the internal tools that help other zulily teams set up and manage sales.

When did you join zulily?

I started at the end of July, 2013 — almost a year and a half ago.

What’s a typical day like for you?

Every day I am usually spending some time working on my assigned projects, reviewing designs or code from other members of my team and supporting users. The work I do on projects depends on the project I am working on as well as at what stage that project is at. Various things I might do for project work include drawing up user interface designs, attending meetings with our users, coding and testing.

What is one of your favorite projects you finished and launched to production?

My favorite would have to be a rewrite of a page that provided an audit for what we sell. The page had become extremely slow, the data source had been deprecated and was often inaccurate and it was not tailored to what our internal users needed to do. When we realized this was taking up a huge amount of time for our users, I was able to talk to them to figure out their intended workflow and propose a new design. Within a fairly short period of time we had a new page up and running that everyone was happier with and that provided accurate data.

What was it like in the early days?  Tell us a crazy story.

When I first joined, I spent some time looking around the application I work on. It was very surprising when I came across a couple of error/info messages that included pictures of my coworkers for humorous effect. Some still exist, but I think they’re slowly starting to disappear…

What gets you excited about working for zulily?

The sense of community I get from other employees. As a developer on internal tools, I get a lot of opportunity to meet with employees from a bunch of different parts of the company. Additionally, I am enjoying getting to know engineers across all tech teams through our small, but growing, women-in-tech community. Even though the company has grown so much since I first started, it still manages to feel just as small!

Google Compute Engine Hadoop clusters with zdutil

Here at zulily, we use Google Compute Engine (GCE) for running our Hadoop clusters. Google has a utility called bdutil for setting up and tearing down Hadoop clusters on GCE. We ran into a number of issues when using the utility and were using an internally patched version of it to create our Hadoop clusters. If you look at the source, bdutil is essentially a collection of bash scripts that automate the various steps of creating a GCE instance and provisioning it with all the necessary software needed to run Hadoop. One major issue we found with bdutil was that there is no way to provision a Hadoop cluster where the datanodes do not have external IP addresses. For clusters with many datanodes — the kind we typically run — this means we end up running against our quota of external IP addresses. Additionally, there is no reason for the datanodes to have external IP addresses as they should not be accessible to the public.

We decided to stop patching bdutil and write our own utility to provision a Hadoop cluster. The utility is called zdutil and you can find it on our GitHub page. Here’s how it works:

  • First, GCE instances are created for the namenode and all datanodes in your Hadoop cluster.
  • Then, any persistent disks that you requested are created and attached to the instances.
  • If you have have any tags that you would like to be applied to the namenode or datanodes, the tags are added to the instances. This saves you from having to manually tag every single instance in your cluster or write your own script to do so.
  • Next, all of the required setup scripts to provision the namenode and datanodes are copied to a GCS bucket of your choosing. The namenode then provisions itself.
  • Once it completes, it copies (via scp) all scripts needed for datanode provisioning to each datanode and then each datanode will provision itself.
  • Once all datanodes have been provisioned, the namenode will start the Hadoop cluster.

If you deploy the datanodes with either external or ephemeral IP addresses, they will have internet access as determined by the rules of your GCE network. If you deploy the datanodes with “none” for the IP address, they will proxy through the namenode using Squid. You don’t have to configure any of this yourself; zdutil will take care of the details for you, including installing and provisioning Squid on your namenode. It is also important to be aware that Google’s version of the Google Cloud Storage Connector currently does not support proxying. If you use zdutil, it will install our fork of the GCS Connector which does support proxying by adding the following properties to your Hadoop core-site.xml configuration file: fs.gs.proxy.host and fs.gs.proxy.port.

If you have any need for zdutil, please use it and give us your feedback. At the moment we only support Debian-based images and we only support Hadoop version 1. If you would like to see another OS supported or Yarn support, please add an issue to the GitHub page.