Enabling And Empowering A Continuous Delivery Culture

Knowledge / Inspiration

Enabling And Empowering A Continuous Delivery Culture

Continuous Delivery
UXDX Europe 2018

At UXDX, Ciaran O'Connell, Senior Director Engineering, Houghton Mifflin Harcourt discusses how infrastructure is provisioned. And overall, do individual teams have control or centralised?
Ciaran will also talk through how costs are managed across the organisation

Thank you, Olivia. And thank you, everyone. I know some of you are still having your lunch. It's a lunch break for Sam. Thank you for attending this. So in the context of this presentation, enabling and empowering the continuous delivery culture, I want to talk about the principles in the context of the journey that my engineering teams have undertaken in the last three years, which for us has been going from zero to continuous delivery. And in doing so, talk a little bit about the technology path we've taken, the challenges overcome. And then crucially, the kind of the cultural challenges that we've undertaken, and the mind shift to our engineering teams to agree to go from a monolithic to a DevOps culture. And indeed, to talk about the challenges that we have undertaken and the challenges we still have. But I think we've had such good success over the last three years. I think it's a good story to tell that I know our problems aren't unique. And indeed, some of the solutions and principles outlined here may be familiar as well. But hopefully, you'll find the decisions on the journey we've taken on the patterns we've undertaken a useful as part of this. So very briefly, who we are, so a lot of people may not have heard of HMH, because while we have a huge amount of users, and as Olivia said, we're targeted at the education industry, and right then to the student and teacher classroom with 13,000 districts, 3 million teachers, 50 million students, nearly all that is in the US or our market is in the US, but our technology arm. And the biggest technology arms here in Dublin. So we have130 engineers working on the flagship learning platform, which I'll touch on briefly. And that learning platform is providing that, you know, the teacher and student experience, and all these kind of capabilities, as well as providing a teacher, the ability to personalize instruction to individual groups of students, and the students themselves, ultimately to take personalized learning and get recommendations and ultimately to get better outcomes. So there's a lot going on there. There’s a lot of technology there. Where we are at the moment is quite cutting edge. We're sitting on a kind of a micro services platform, collection of services with big data pipelines and content pipelines being pushed in. We're fronting all of this open to those services are communicating with asynchronous protocols and running in Docker containers, and fronting a pretty modern application using React and an Angular application. So, a lot has happened for us in the last three years at where we are now, where we came from is important, I think because wepride ou rselves as a company, we're actually 195 years old. We started as a pure publishing company. So books such as, for example, orderings, published by HMH, right up to our friend here, for many of us, young kids might be familiar with Curious George, which is a children's book. So up until three or four years ago, his left-hand side books, digital content, multiple platforms, the right-hand side is, as I've described, the learning platform. So we would in the engineering and engineering group had a lot to do in three years with a lot to do to come from where we were from great content to great outcomes. And our journey started in monolithic applications. And I know Nick, earlier on, spoke a little bit about, you know, moving to the cloud, and we were very much monolithic applications. We knew we needed to innovate and meet the demands of our increasing user base and innovate the new platform. And we moved, you know, we did a lot of good things initially. So we moved to the cloud, we moved to AWS, and we created a pretty standard stack that could scale and deploy automatically. We saw we had all kinds of structures in place on the cloud, but we hit a very big problem that ultimately hit a bottleneck in terms of being able to scale up environments. So we were a lot of engineering teams, you know, 15/16 scrum teams and more creative services. We had a centralized operational team that would spin up our environments. In a company like ours, where you have 4000 employees, maybe 10% of it is engineering that would be resistant to change, and ultimately lead to a lot of bottlenecks. And that, in turn, led latency in how we were evolving within the engineering group. With a lot of handoffs, it's somebody else's problem. So even though we'd moved to the cloud, we still had a lot of those problems of different environments, or testing as a different group and all of that. So we knew we had problems, and we almost business demands, we had to say to ourselves, look, we need to take a step back. So we within the engineering leadership came back and settled on principles of a continuous delivery environment. What are principles for DevOps? So it starts with controlling your infrastructure. So this is, you know, when I say control, it's the engineering teams here. The engineers here have control of their infrastructure. So the spinning up an environment and controlling how that environment is working, within, whether it's in the cloud or not controlling that environment, completely within the engineering structure in terms of how you're allocating resources in terms of how the cluster might work on top of the cloud, having full control of your infrastructure programmatically, is going to accelerate your continuous delivery. We knew we needed to do that. Secondly, is controlling your deployment. So we needed to, so we would take months to go down to the environments to deploy something. We wanted to be able to get this has to happen instantly. We have to deploy with a huge amount of confidence and reliability through our pipeline. And we have to be able to do it. We have to roll back quickly mean. Ideally, I think Nick mentioned early in the previous presentation about Canary deployments. You can do it to deploy different production levels, but from our perspective, it controls your deployments, a must from the engineering team. Third point, control your applications or your services. So, we have multiple services web front end applications in the learning platform. What I mean by control is that these applications, even if they run in a larger environment, will be self-contained in some ways. So can be containerized would have banded context, would be observable, will be monitor-able have all the right kind of quality checks in and around that. And that's an absolute must for continuous delivery culture. You need to create the infrastructure, we need to create the tools and software, for the applications to be controlled by engineering teams the way we wanted it to be in this culture.

The next is to trust your system. So that's the real part of when we are at the point where we now have a pipeline. So we are able to deploy, we're able to engage, we have a system that we can trust that when we push software into it as a group multiple teams, that there are enough quality checks, there's enough of happening within that system, that we'd be able to take it to the next environment with confidence because that's the only way of building into a company. So when you have all the first three steps, you have to trust your system. The last part is controlling your destiny. And that's as much cultural as anything else. And it can be some of the most difficult things to prioritize to solve it. But that's giving teams empowerment giving, understanding within engineering teams, what the definition of done is understanding the quality of service they're doing, knowing within the teams and impairments about what they're delivering on and taking ownership of that. So we listed those five principles as an engineering group that we wanted to find, and it's easy to do that. But how do we put that into practice? So we came up with the different patterns. But in our case, we did have to be quite draconian. So we sent off a team of five to six engineers, and with the engineering backing, kind of a skunkworks project, which the guy to the right, and they went off to start building the platform as a service. So start building, and we'll talk about a bit of that in a minute. But they start talking about building the infrastructure is code and the environment. And once they got to a place where they had a cluster on the right kind of tooling and capabilities on top of AWS that could allow the application to start using it. And between them a pretty tough few months between them because from an engineering perspective, they were starting to work off a completely different stack completely different ways of working where it's no longer someone else's problem to provision, infrastructure, or control your applications and all that. It's now becoming you as an engineer need to understand the DevOps. There was a lot of upscaling, plus, there was a lot of integration and how will this work, will it perform and scale. And we hit success. We got it over the line and with big success, but in funny ways. That's only the starting point for us. So it wasn't enough for them. To go, well, you've delivered that. So every other team should do the same. So we hit the cultural, the engineering challenges, so does the upskilling, huge upscaling for engineers to take these sorts of actions themselves to understand their service fully and how to put it onto the platform. And then secondly, how culturally, all the teams become self-empowered through quality, true production, everything that becomes the whole cycle, in the continuous delivery structure goes to production is their ownership, it's not somebody else's ownership, quality, QA is not somewhere else, it's all falls under the umbrella of engineering. So in effect, we are no longer using an operational team. Everything was being built back into the team. So that was the challenge. So we use lines there but deliberately create and deliver tools to make the writing these things. When you cannot make the writing easy, you must make the wrong thing increasingly difficult. And that's the pragmatic way you have to approach, it's not enough to have governance, say use our platform. It has to be the tooling that's been created as part of this platform, to quality controls that we're all bringing in the automated test, all of the stuff that's been pulled is attempting to make the right thing, the easy thing to do. And if not, it has to be tough. It has to be said, well, a difficult course of action to take for engineers because we need to have a safe platform because ultimately, we're still servicing 3 million users and 50 million students. So this was our approach. It worked overtime. And to the point where we have full continuous delivery, we'll talk a little bit about where we are, but that was the approach we have to take in terms of the draconian way to set it up, but then have a pragmatic behavioral way of how it gets pushed up into teams. So briefly, for the technologists that there is what was our platform? Infrastructure is code, as I mentioned controls transparency and predictability, we built it on top of a Mesos cluster. So that allows us to program instances pools of resources, it gives applications the ability to share pools of resources on the same two instances. So it's having this level of control that can be very cost-effective. So I think the saying is that a few Cloud providers can be a little bit like banks that if you don't have a plan for your money, they'll create a plan for you. So for us having full control, especially in the market, we were in the education industry as well, where we wanted full control programmatically about how we manage our instances. So, we want to build a kind of containerized capabilities for our multiple applications within that. So we brought in Jenkins as our CI process to build Docker containers of Docker images and deploy them as individual and isolated containers sharing kernel resources. Also, if we wanted to bring in new application software, such as a new database, a new MySQL, or PostgreSQL database, we're using Terraform. To provision so engineers can provision do a pull request and provision the infrastructure. And like that, and we saw huge amounts of scaling, horizontally scaling benefited hugely from that, but also, we now are able to solve our principle of controlling infrastructure and also controlling deployments and making applications containerized, which means we can start testing around them that are isolated they can run tests within the container. And they can be portable across environments as which Docker is.
We would have pushed in Aurora schedulers to run these containers. So the applications could use Aurora scripts to configure how the application will work on the platform. And, you know, with Aurora, this will keep those services running forever. So we don't have downtime with degradation. But our services will come back up with those problems if they have glitches. So, the key point is when we have our technology stack from the platform, the next focus is shifting to the pipeline. And the pipeline is key. So your continuous integration pipeline is great. I have older deployment capabilities that can run quickly, but it's no point in deploying rapidly. If code is going in there and it's going to the next stage, and it's failing, then we're back again and all the rest you can have all means to create infrastructure and all the rest. So it starts as we know it test automation. So test automation is key. And especially in a very integrated environment that we might have, in our learning platform with a front end application, you could have, you know, 20/25 services, and lots and lots of instance, 25 services with different capabilities. And they all have to come together on an experience to have to work 100% within the classroom, fully reliable, and all of that. So test automation starts on a unit test, but the automated tests have been functional testing right from the front end. So front end, you know, great tools out there. So like you use WebDriver or Protractor, depending on what type of application stack you have. We're using we bought Angular and React. And you can use it at the API level. You can use Browser Stack for browser testing. You can use it for API testing, we'll use Gatling, and we'll use simple you Super Test pretty heavily. And the point of all of this is lots and lots of tools, a lot of you will be familiar with them. But they all must run as part of this pipeline. They have to be part of your automation. Not enough have unit tested, automation is built in, automation is key, the team owns the way we solve it. We hire specific quality, the team owns the quality of that service, the definition of done for our service will include full test automation, and we may have quality engineers, and they may be doing some automation. But ultimately, there is much to make sure the practices are influenced in the definition done as followed as they are. We expect the engineers to be writing and, you know, helping to write these automation tests as well. So it's very much owned by the team. Similarly, speed and stability.

We need to deploy quickly, fail fast, fast feedback, strive for a quick and stable execution degree every merge go into production. So we'll make a pull request, we'll do a pull request, and a huge amount of tests will get executed straight away. We're containerized in our tests so that nothing will get through to the next environment without its whole automation execution. So we have full confidence that we can get to the next environment. And you may have order integrations for a lot of content integrations that might need to be tackled at that point. But we've full competence of moving between environments with the amount of automation we're doing. And we can practice every merge can go into production quickly. And we've had a great experience with that, like we've we can, we can do production deployment every day with even a quite a complex platform. We did six and one day a couple of months ago, and a huge, huge deadline that we had had lots of those changes. But we've had huge success and being able to kind of move to the speed and stability of really just following this process. And monitoring is extremely important. So especially in a micro services architecture, you have lots and lots of services discrete. Every service needs to be observable. Every service needs to be monitored and needs to be owned by the team. Because you're releasing to production. Engineers own this, you own this right through, and even an organization quite big like ours, and we don't expect a support team or someone else to take these things independently. The engineers have ownership of when these things go to production. So your service must be highly observable, you know, the run scope is a common monitoring tool, we would have huge build into our operational platform and then into our applications. We've various dashboards combined on all of these things that will make the service fully observable as we can get. We can set it up for pager duty or text messages from a run scope. If the service degrades, we rarely have a service that goes down. But it could be an issue of degrading or whatever needs to be looked up. Security, really important that security is a shift left on security, you don't want these things to be happening in the continuous delivery environment where a security team takes a look at your application, there's so much modern stuff out there. But we don't have to be happy at the end of any cycle. You want it to be built into your pipeline. So some great tools like checkmarks first. So those scans need to happen in your pipeline. So the more you can do in your pipeline, the more naturally it happens where all of this distrust has been built in rather than expecting someone to go off and write a test. So shifting security left is really important. So results, test automation reports. And I'm like that's something that's par for the course you need. Everything needs to be exposed as reports to pass the bug highly visible to teams. And then the last part for us is quite a big thing: data-driven decisions. So you need to, metrics a big part of it as every part of your service. In terms of CPU usage, all of that needs to be communicated as we use ourselves we store a lot of that stuff in Influx dB, and we're able to visualize all that stuff in Grafana. But equally, you know, Google Analytics and you like you want feedback, if you've got product owner, as part of your team, you want the feedback of how your software has been used in the continuous delivery culture, and you want it done quickly. So you should be able to see straight away how your customers are using your software, you shouldn't be waiting for someone. And that can be used by, and in our case, we spend a lot of time investing in a student's behavior in our software because it's a big part of the whole cycle for us to deliver outcomes. So we would track user behaviors, and we would track user experience through Google Analytics. So all of those things are important for data-driven decisions. Because that way you can go back in again, say, well, that's not been used, and all of that. So they're the tools, what's the mindset, changes that need to make so philosophical, really, really important, and it's ongoing, it is an ongoing challenge with us all of the time because we shifted the culture in the space of two years.

But we still have challenges. You know, empowerment engagement, we talk a lot through it as you need to give empowerment to the teams and legal engineering teams, Scrum teams, Scrum, and everyone wants empowerment, to be able to have control of how they, how they delivered their software. So you want all of that within the, you know, the principles of agile development. And we constantly use the word empowerment to ensure the teams are comfortable that they sign up to teams own what they deliver. Teams understand what quality is understanding what service means to be deployed. And because especially when you've got multiple teams all coming together for the same application, you need to have everyone signing up to, you know, ownership and accountability of what they're delivering that it's not somebody else's problem, very key thing if you want to deliver quickly. Accountability and Transparency, any tooling you have like in our company, we use Slack quite a lot for an engineer, really with slack to ensure that things get pushed into Slack channels of the build if anything fails. So like things, people know pretty quickly when there's something wrong in our system. And the engineer in question, we'll probably know fairly fast. And we want culture. We try to improve on our engineering teams to make sure that everyone is transparent about what they do in an engineering term, and is nothing being hidden, and all of those sort of things. And from a management perspective over to the right, it is important to set expectations. The quality of service and the definition of done like you do not want it to be completely 100% where the team decided this, you have to have some level of expectations, because that's why you can have the agreement of what a definition of done. There might be deviances between the definition of done in some teams, but you do need to have like if one team is saying security is not important to us, and we're going to bring some third party library in. And they're not going to fail further down, you know, well, that's going to be a problem. So you do need to set expectations for the quality of service. And we do a lot of our true working groups, you know, engineers to engineering, we try to empower that on the teams rather than from the management sort of teams, there might be some level boundaries that we would put in place, but the teams talk to each other. And they say, Well, what's your definition what's, what's the agreed principles that we have, in our definition of done, what can get to production, if some failed test failed are we also in, there's not a chance that we could go to production, all of those kinds of rules. And that's important because it has to be built if we're going to be given teams the power to deliver to production, that they need to expectation of the quality of service they're delivering. And finally, kind of the culture of celebrating success, failing fast and learning, really important for innovation really important for continuous delivery is that you know, I think we're all good at celebrating success. But sometimes we do need to, you know, we do need to try stuff pretty quickly on something you know, and you know, in a micro services architecture, especially, we've tried certain technology stacks around. We've said, well, that didn't work particularly well. And I can be okay, as long as it fails fast. And we learn from the experience quickly. So we have we try to instill that culture because you'll move quicker. And we move pragmatically, and we as a group will feel as if we're innovating faster, and from giving a little bit more power for engineers to try something out a little bit more so long as the failing is fast. It will be with all of the tools that I mentioned in place. But I think the culture should be okay as long as we learn through our cycles. So that's, that's all I had, I want to leave a few minutes for questions. So thank you.