We often talk about the builder ethos here at Contentful. This approach recognizes that all employees can be builders – technical or otherwise. With the right tech, time and collaboration opportunities, builders are best suited to solve an organization’s most pressing problems. The goal of the builder ethos is to create the best experience for your customers, but the principle applies to realizing the best working experience for your employees too.
Our security and infrastructure engineering teams had a chance to do both as we worked together to solve a Web Application Firewall (WAF) issue causing one too many headaches. And the results were remarkable; we’ve doubled the speed of our Content Management API by greatly reducing latency for our users across every region, and we’ve since avoided 13 potential incidents relating to regional outages.
Security and reliability are topics of utmost importance to Contentful and our customers. WAF is just one of the many components covered by our ISO 27001 certification. For those unfamiliar with the term, you may be wondering what a WAF is. As defined by CloudFlare, a WAF helps protect web applications by filtering and monitoring HTTP traffic between a web application and the internet. A WAF is not designed to defend against all types of attacks, which is why it is a part of a suite of tools we use together to create a holistic defense against a range of attack vectors.
Contentful’s APIs were designed with a security mindset from the beginning, which is why we separate our Read and Write APIs. Our write APIs – Content Management and Preview APIs (CMA & CPA) – use a WAF as the first line of defense using auditing and filtering rules to control what traffic is going to hit our APIs on the origin side. Security is built into our design process and actively participates at every stage of development, from idea to prototype to post deployment testing and bug hunting.
Our previous WAF solution was a third-party tool that met ISO requirements, but as we expanded our security coverage over time, we found it wouldn’t support our internal workflows and expectations. Its monitoring capabilities, for instance, weren’t sufficient for alerting us to issues when they arose – and made troubleshooting more difficult.
There were times when our engineers wouldn't be alerted to a problem early enough and would be woken up in the middle of the night for an emergency fix. Fortunately, processes were in place to support and maintain a low time to resolution as a standard service to our customers and users, but keeping our engineers healthy and happy is something we like to prioritize as well. This is something I knew we could fix.
When I first encountered this issue, I was fairly early in my tenure as a security engineer at Contentful and growing more familiar with Contentful’s view of the builder ethos. I saw this as an opportunity to solve a current issue for the team, connect with engineers across the organization and learn more about how our infrastructure is set up. That is what I’ll be focusing on in this post: learnings on effective collaboration.
What collaboration looked like for us
Collaboration is a great buzzword. It’s just nebulous enough to mean everything and nothing at the same time. So here are some details on exactly what collaboration looked like on this project.
Stepping out of silos
At this stage, the infrastructure and security teams were distinct entities tackling the same problem from different angles. The infrastructure team was focused on keeping the platform available to customers, while my team's priorities were to detect and stop attackers. Sitting in our silos, we solved individual issues that sometimes didn't align with the other team's interests. So we took a step back and talked to one another.
Both teams agreed that we wanted to have better reliability, with a robust solution that didn’t require short-term workarounds to fix a problem. Simply getting problems out on a common table allowed us to work toward shared goals.
Aligning the way we do things
To make sure this project worked, we had to make sure we worked together. I’m most accustomed to working autonomously and asynchronously. My style is to maintain very detailed notes, even down to random thoughts about something I could potentially investigate later. I’m an avid commenter on Jira tickets and like to leave all my work in a state that offers a clear and accurate depiction of its current status.
Meanwhile, my counterpart on our infrastructure team appreciates live syncs where we can align in real time, and then go off and work autonomously. So at the very beginning of the project, we had a level-setting discussion about how we would accommodate our different working styles. We decided on a primarily asynchronous working process that included regular syncs throughout the week. This ensured we were aligned in advance of our larger group’s backlog grooming sessions for Agile/SCRUM.
Solving the bigger problems
My team focused on security issues; the infrastructure team’s focus was on a reliability issue. When we discussed the problem together, we realized it was actually a technology issue. The tool we were using didn’t offer the features we needed. We were missing things like a robust Terraform provider that would allow us to build more complex rules on top of the tool. Everything previously had to be made within the solution’s user interface (UI), which was too prescriptive for our engineers to tailor the application to our business needs. We needed a tool that better accommodated proper engineering procedures around review, auditing and infrastructure as code.
Our previous solution also sat in front of our infrastructure and acted as our Content Delivery Network (CDN). This design is acceptable under different workloads but didn’t meet our high availability expectations. This created a situation where the entire app would fail if the WAF were to fail. We needed a solution that could safely fail-open so a WAF failure wouldn’t be felt by our customers.
Sharing the outcomes
Collaboration is often focused on getting things done together, but it should also be extended to the celebration — big or small — of the things you’ve achieved together.
Through our collaboration, we identified a new solution for our WAF, eliminated some of the late-night pings to our infrastructure team and also improved performance for our customers. Our solution doubled the speed of our Content Management API by greatly reducing latency for our users across every region. We also avoided 13 potential incidents relating to regional outages. We’ve written more complex security rules and now have a more in-depth set of signals to work with. These outcomes were richer and more impactful than anything we could have done hunkered down in our respective silos.
Summing up, this was an excellent learning experience to solve an immediate problem. But it was also a valuable opportunity to lay the groundwork for new relationships and expertise that will streamline problem-solving in the future. And for those interested, I’ll be writing another post detailing the technical side of our WAF redesign in the near future.