This post is a re-post from my original LinkedIn post.
Over the past few years I had the unique opportunity to see a start-up, TubeMogul, going through hyper-growth, an IPO, and an acquisition by a fortune 500, Adobe. In this journey, I was exposed to a lot of technical challenges, and I work on systems at an astonishing scale, i.e. over 350 billions real-time bidding request a day. It allowed me to build some strong personal opinions on the role of an SRE and how they can help transform an organization. I'm lucky enough to work with a talented team of SRE that keep pushing the limits of innovation while executing through chaos.
As I flew back from the ML for DevOps (Houston) summit that Adobe sponsored, I took the time to reflect on some of the ways our SRE teams excel in their job and how they leverage machine learning and self-healing principle to scale their day-to-day operations.
I.T. Systems, with the broad adoption of public and private cloud, get more complex over time. The hyper-adoption of micro-services and the increase of loosely coupled distributed systems are an obvious factor, though you can see how IoT devices, edge computing, and al. can factor into the mix.
Point being, it is increasingly difficult for a single individual to understand the space in which a product evolve and live. One cannot assume knowing it all. Humans quickly reach their cognitive limit. So, how do SRE overcome this limit? Below is my take on the top 5 machine learning and self-healing techniques used by SRE to scale and operate increasingly complex environments.
With Ganglia, graphing a large number of servers has never been so easy... Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. Ganglia let you create any kind of module in C/C++ or Python. You can also use the command line tool Gmetric and then the scripting language of your choice. The problem with Gmetrics is that you can’t keep your data organized by group and it’s getting harder to poll values in an efficient way. Few months ago I needed to monitor some JMX values returned by a Java daemon.
My previous post was made a long time ago, so here is a draft that I finally decide to post. Let’s see how to secure some of your data with an encrypted block device using losetup and dd.
Steps will be :
- Create an image with dd
- Build a new device using the image with an encrypt algorythm by using losetup
- Format the device using mkfs.ext3
- Mount the device and start using it !
Of course, when you have mounted the device, your data are readable to anyone who have access to the mounted directory.
Here is a quick picture about the impact of MySQL indexes misuse. At work, our developers made a new release of their search engine using MySQL fulltext indexes, unfortunately they didn’t implement it correctly. The impact was a huge load on all our database servers. To find the trouble, I had to redirect the SQL search flow to a specific server and check for the slow queries then reproduce it with EXPLAIN. Don’t need a long time to find that the search query did an invalid usage of the fulltext index and the "Match / Against" syntax. In fact, the fulltext index was a multiple column fulltext index, in such case you have to specify ALL the column present in your index, else the index won’t be used by MySQL...
Here is a quick post about a cheap SAN and secure backup architecture solution for small-sized platform (5/10 servers). In this study case we will see how to use Network Block Devices (nbd) and soft-raid with mdadm. I design it for my personal web platform which is a small one. Last day, I was missing of free space on my backup server and I was angry by the idea to rent a most expensive server to store my backup while I had lot of unused space on my other servers.