This post is a re-post from my original LinkedIn post.
Over the past few years I had the unique opportunity to see a start-up, TubeMogul, going through hyper-growth, an IPO, and an acquisition by a fortune 500, Adobe. In this journey, I was exposed to a lot of technical challenges, and I work on systems at an astonishing scale, i.e. over 350 billions real-time bidding request a day. It allowed me to build some strong personal opinions on the role of an SRE and how they can help transform an organization. I'm lucky enough to work with a talented team of SRE that keep pushing the limits of innovation while executing through chaos.
As I flew back from the ML for DevOps (Houston) summit that Adobe sponsored, I took the time to reflect on some of the ways our SRE teams excel in their job and how they leverage machine learning and self-healing principle to scale their day-to-day operations.
I.T. Systems, with the broad adoption of public and private cloud, get more complex over time. The hyper-adoption of micro-services and the increase of loosely coupled distributed systems are an obvious factor, though you can see how IoT devices, edge computing, and al. can factor into the mix.
Point being, it is increasingly difficult for a single individual to understand the space in which a product evolve and live. One cannot assume knowing it all. Humans quickly reach their cognitive limit. So, how do SRE overcome this limit? Below is my take on the top 5 machine learning and self-healing techniques used by SRE to scale and operate increasingly complex environments.
Over the past decade I had the privilege to build a massive scale infrastructure at a small start-up called TubeMogul. We went thru an IPO and an acquisition from a Fortune 500 company, Adobe. Hence, it was quite a privilege to present my team accomplishment at the OpenStack Summit 2017 in Boston. We built a fully automated infrastructure which enable our team to leverage a multi-cloud environment with cloud-bursting capabilities. Check out the presentation on slideshare/youtube and our interview on #TheCube.
Today, I got the privilege to present my team work at USENIX LISA 15. TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.
During Puppet Camp Paris, I got the privilege to present the Continuous Delivery Workflow of TubeMogul's Operations Engineering Team. In few years, we went from few servers to over two thousands nodes fully managed by Puppet. In our presentation, we went over the challenges we faced as well as the implementation of our workflow to improve our day to day operation while still moving fast.