Skype Experimentation Configuration Service
I was brought in to help an internal project at Skype in danger of not even getting off the ground. Many tools exist for doing A/B testing on the web, but at the time no commercial apps or services could handle a/b testing across web apps, shipped desktop software, native mobile apps, and internal cloud services. The mission of ECS reflected a notion of testing which was broader than just A/B for user experience work. Video teams hoped to test which codecs worked more effectively in different regions, and cloud teams might want to slowly rollout a new service, testing for scaling issues. ECS supported these cases by managing configurations that switched features on and collecting telemetry used for evaluations.
2014 : Before
Initially I was brought in to help with usability and visual design, acting as first and sole designer but with the ability to connect to part of the Skype team in Redmond. The engineers in question didn’t have much inclination to some aspects of web design, so it was important to build credibility with the team, and help with some aspects of styling and engineering. The PM assigned had not identified users, constructed personas, developed a notion of key tasks or uses apart from the engineering requirements.
Identify actual users, understand and document their motivations, stories, journeys. This started as a walk the office endeavor — our users were all internal, though some were geographically dispersed. I interviewed product managers, data analysts, and engineering leads for back end, front end, and mobile. This led to a few interesting discoveries:
1) PMs might manage the process and set deliverables, but they would likely only dip into the tool to monitor things, to set everything in motion, or to specifically assign or remove treatments from individuals. This became important later on as it was somewhat missed in initial engineering specs.
2) Engineering leads typically did most of the work of uploading and setting configurations. Testing shipped software meant fully building multiple paths and toggling them, as a result the configurations could get big. The configurations also tied tightly to engineered code.
3) Test conflicts presented an ever present danger for both A/B and rollout– the mechanism for partitioning test cohorts, adopted from another group’s test structure, could result in two teams essentially working against each other thru configurations. This also meant the tool should help teams answer “what happened?”, to answer “who got what?”– if you have 20 teams each rolling something out or testing something else, things conflict.
4) New features always bring interested and frightened users and subjects. A VIP asks to “try things out.” Some internal or external user might recoil in shock when their beloved feature moved and want things back the way they were.
3) Rollouts of new products formed a much bigger use case than A/B. There was always a danger that a new bit of engineering would crash at scale or that some heavily tested bit of UI would prove a market flop. Rollouts also allowed for a controlled scaling of new server frameworks or tech that simply couldn’t be sufficiently tested in a lab environment.
4) Collecting telemetry didn’t often happen well, and teams experienced ongoing issues with how it affected performance and battery drain on mobile (not a strict usability issue but relates to the importance of A/B users vs rollout users, as A/B users might make heavier use of telemetry) Its frustrating if the great feature doesn’t actually shine because telemetry ate users’ battery.
4) Statisticians cared mostly telemetry, but would not setup tests. We still included them for testing. In this context the PMs also got feeds of metrics setup by engineering leads.
So from that we had regular users coming from two major backgrounds with distinct things to do:
PMs, who typically were interested in controlling the actual kickoff (pushing start) and controlling who accessed their new stuff (VIP wants to see something on their box). Eng leads would handle creation and uploading of configurations, setting the A/B/C etc… configs and getting scheduling done. Initial tasks were: setup A/B, setup rollout, who got what config?, Include me!, Leave me out!, Start and Stop.
Early ‘paper prototypes’
With some idea of users, motivations, and tasks, the time came to wireframe flows. I did this in Balsamiq, in wide use within the organization at that time. Informally I had a few PMs and leads at the office click thru the mockup and give feedback. Skype’s principal testing labs were out of office, but some a lightweight mac screen recording tool took care of it. The wireframes validated big picture concepts of a large grid where configs were columns and with separate horizontal sections for allocating users, editing the configuration, and reviewing telemetry.
Part of the wireframe
From here I started visual design studies to give something more concrete to sell engineering. The wireframes validated some changes to things they were not going to like: Specifically, formatting the configuration as table rows and not a JSON blob, and the desire to inline edit them. In consumer software engineers and the product team would accept this, but on an 8 person internally focused team, with perhaps 2 front-end engineers, it required persuasion. The engineer’s inital try solved this with a popup for editing everything in a big form that did not even bother to validate that the JSON ingested properly! We would make use of a form to efficiently jump-start the first config, if for no other reason that to provide a space to parse the initial JSON based configuration blob and make it human readable for validation. But a popup would not prove an ideal place to edit configs which might extend to hundreds of attributes.
First /early visual design studies + behavioral redline for edit behavior.
The engineering team made some decisions on frameworks that complicated development.
The first visual design studies were a mixed bag, but enabled the team to learn a few things our users, PMs and Eng leads. The studies we conducted focused on just the principal screens which would answer questions. As a one person team working ‘lean’ you take certain shortcuts or run out of time. Some of the key bits:
1)Our users didn’t value the kinds of telemetry we could show in app.They still wished to see telemetry: The data scientists felt they needed significant amounts of time to make sense of telemetry, given the number of overlapping tests going on, the very specific test plans, and the largish number of users they required for statistical significance. Teams using ECS to roll out new features didn’t care because they had their own stats and telemetry pages; a/b type feeds made less sense to them.
2) My color choices were awful
3) Search or “how did I get this config?” would likely not make MVP but adding and removing from a test would.
Formal usability tests of the engineering alpha
The second set of visual mockups fell more into the category of red-lines to support the build out of a second engineering mule which happened simultaneously with design. The good news was that we were gonna test something soon with data in it. Users respond much more effectively when looking at their own projects in a service like this. After redlining the inline edit behaviors, I ended up working with one of the UI engineers, restyling the kendo UI controls to fit spec. At the end of this, we had the kernel of the app, and it could load (but not yet run) real configurations. This meant we could do more formal task analysis.
With the go-ahead of the engineering manager (engineering managers in this part of the organization carried more weight than PMs) I wrote formal test protocols for the workflow. We would test with PMs and engineering leads across Skype, which meant testing in London and Prague where key teams worked. The test setup used my laptop for test, using a screen recorder. Video supported findings lent credibility in both the engineering team and up the PM chain of command.
We tested 15 people split between PMs, Leads, BI Statisticians. We asked them to load a configuration, edit a treatment, setup a test and start. As part of the close questionnaire, I also showed a set of color combinations (remember how I said my first color choices needed uh, refinement?)
Some things we learned
I presented findings to the team and higher ups in PM for buy-in. Some highlights:
- All subjects successfully setup a test in about 10 minutes. (9:44 median to be exact)
- Some language differences between rollout users and A/B test users. (and never call anything a parameter)
- People used really long names.
- Subjects didn’t know what to expect when they hit a critical action like start or cancel.
- The MVP had partial inline edit which made editing more awkward then it should have been — based on observations about how people worked, it was essential to follow thru with that, proper copy-paste, and a few other niceties. This was a bittersweet bit for me, as they got there, they built the design, but only after the end of my contract and after a reshuffle on the engineering team.
- The importance of configuration merge and conflict issues— the ability and necessity of blending and combining multiple configurations assigned to a cohort instead of segmenting them in to separate cohorts. In the ECS world baselines(or defaults) might involve multiple overlapping configurations with regional or demographic tweaks. (how you allocate users to configs is kind of essential in terms of results, and for rollouts its impossible to avoid overlaps)
- We settled on better colors.
A number of other usability issues presented with the minimum viable product. Some were a result of only completing some workflows, others as a result of shortcuts. The raw results and video evidence helped make the case to pursue the showstoppers. We also took the opportunity to do product discovery on a feature set later known as “Test in Production”– the extra data and interactivity helped that conversation.
At this time the PM on this project left the company, and whatever PM duties I hadn’t done yet, I took on until a replacement came in. Focus at this point was the big architectural change of configuration merge which the testing exposed, and pursuing some of the other short term changes that could get the product to ‘market,’ to build user base. Some teams already adopted the ‘headless’ product with engineering support, so it was essential to get the UI out and get traction- not strictly part of the UI design story, but if you have no users and no traction you cannot fix or extend the design. We got the MVP to a beta that we could open up.
Back to design tasks
A week or two later I was able to get back to addressing a few design tasks, some revised color and layout, and a high fidelity (HTML) mockup showing the complex fix header/fix first column scrolling behavior, and later some further studies for multiple baseline configurations– essentially how to keep the treatments or configs (columns) in the grid from getting out of hand. I think about two weeks after this we had a temporary, soon to be permanent PM online.
There were a few other design tasks related to alternate ways of managing multiple projects, but we didn’t get to testing before the contract end. The most notable task was a kind of vision statement for a comprehensive 1-stop place for deployment/rollouts and monitoring in the visual/structural context of microsoft’s upcoming redesign of azure. This wasn’t so much a bit of design as some ideas I’d been thinking about for a long time, quickly drafted.
Part of the TIP vision sketches
What did I accomplish?
The overall UX structure and underlying product structure I advocated for was validated and later fully realized in later versions. It was a sound design. I was able to persuade a divided engineering team to adopt a technical approach that led to a better customer experience. This product became a candidate for adoption across Microsoft. All teams at Skype adopted it, but we got adoption in some of the big MSFT teams like Office 365. Its pretty rare when the chance comes to make something you might use yourself.
At the end of the contract the PM team invited me to interview for a PM role, some time later the design team sent a similar invitation.