WHY WE DO NOT AUTOMATE AS MUCH AS WE SHOULD?
If we analyze IT Operations and its components, we see that there is a lot that can be automated. Almost all of it can be automated, but there must be a will and time to do so. This is where most of the operations automations fail, the will might be there, but the time is always lacking. So, many opportunities are left unfinished or are pushed forward till they vanish.
A second impeding factor why we do not automate as much as we can is the constant change that operations witnesses. Every day, new technology is rolled out, applications are updated and upgraded, OS’s are updated and upgraded. This constant change is killing the initiatives for automation. And of course, you will always have organizational implications. You do not have the right person in your team for automation, they are over eager or underperform or whatever organization implications you have. This might also snuff out whatever automation initiative that you have.
WHY SHOULD WE AUTOMATE?
As with all factors that impede us from doing something, they can be turned into assets. This is also the case for automation. What I do is simple, I embrace automation, because the benefits outweigh the drawbacks. I have first set myself a goal: the least we manually touch a system, the better. Of course, this is not something that I have invented, large organizations like Google, AWS and others swear by this principle, as human labor is the most expensive and the most error prone.
Another important factor to automate are your SLA’s. I you have SLA’s for availability and performance of your environment, it is a good practice to automate your controls. You will say, I have monitoring and an incident process in place, that does it. Up to a degree you are right, but automation takes it much further than that. Automation gives you control over your SLA’s, what you do not have with monitoring and incidents alone. In fact, with automation you can reduce your unavailability and incidents by pro-active controls of your environment and faster MTTR’s.WHAT TO AUTOMATE?
Once your mindset is right and you embrace automation as a way to improve efficiency of operations, you have to choose what to automate. This question is directly linked to your own operations environment, so, I can only answer for myself. I find it important to see automation as a long-term investment instead of point solutions for a particular problem. When I say this, I mean really long-term, through the changes in your environment. This means two things:
- You will have to adapt your automation to the changes in your environment (so, the job is never finished)
- You may not hesitate to start with automation, even if it is only useful for a limited amount of time without adaptations. You know that things will change.
- Were all my backups correct last night?
- Do all my applications work (or are healthy)?
- Can all users login to my environment?
- Did all my scheduled jobs run correct?
- Did we not get compromised (security checks)?
- ...
Why do we do these controls first? These controls are directly linked to your SLA’s. For instance, if your application does not work in the morning, you will have an incident, and it is better that you know this before a large number of users start to complain. If your backup did not run, you will not be able to restore yesterday’s work … So, with automation, you not only increase efficiency, you will be better able to control your SLA’s and be proactive.
Once the obvious is done, you want to automate more complex tasks like updates and upgrades of your OS, automated deployments of your applications and so on. I believe that anybody can think of the candidates for automation. Among these candidates are the following:
- Automated updates and upgrades of OS
- Automated security patching
- Automated control of your environment for high availability and switch overs
- Automated creation and liberation of resources in your cloud
- Automated set-up (and orchestration) of resources like servers, VMs, OS’s, containers, storage …
- Automated applications deployments
- Automated controls after updates/upgrades and deploys
- Automated problem resolution.
HOW TO AUTOMATE
When you automate, you have your engineers that are eager to write scripts. This is good, but you should put your scripts into a tool that helps your engineers to organize their automation. Also, you need something to execute these scripts from a central location to all hosts in your network. We have chosen for Ansible as our central automation tool, but there are others that do the job equally well.
Ok, so far so good, we have scripts and a script management tool. With that you can execute your scripts whenever you want and see the result of what you executed. Very nice for our engineers. But that is not all that operations and automation is about. I want to be able to execute these scripts on a regular basis or triggered by events. So, in addition to the management tool, I need a scheduler that executes my tasks on a regular basis or that can be event driven. We have chosen for Rundeck as a scheduler and it results in scheduled tasks that are executed without human intervention. We combine Ansible and Rundeck into a perfect operations tool.And that is the end of it? I do not think so, I am still missing the results of my automation.
THE RESULT OF YOUR AUTOMATION
Now, what should be the result of your automation? Job done … is a good starting point but gives you little visibility into your automation. Job succeeded/failed would already be better and will indicated where it went wrong, so that you can verify and take action, but is this sufficient?
To my opinion, not. I, as a manager am interested in the results of my automation and want to see with the blink of an eye if my automation succeeded, the results and the exceptions. For these exceptions, I want to see the remediations. This might be a lot but is not too difficult with actual technologies.
I use my monitoring for exception handling and the output of my other tools for my reports. This might not be much at first but will give you an idea of what to expect. Once this works, I consolidate the reports into one overview (clean it, make it more readable, set the KPI’s I need …) that is sent on a daily basis to my mailbox at the start of the day. So, I am aware of what is going on and what the operational challenges are for the day. We also put this on a portal and project this on one of the operational screens (TV).CONCLUSION
I want to conclude that automation is a combination of perseverance of the manager, the right engineers and the correct tools in a pragmatic process organization. As tools it is important to see that tools can and need to work together and are the same for every automation process. You need an automation tool, a scheduler and a monitoring tool for a correct implementation of automation that goes beyond a few basic scripts.
When you start, think about the organizational implications. Do I have the right skills for automation (System Engineers)? If not, do I seek help? Are my processes enough evolved so that they can be automated? Are my use cases clear enough to be automated? If you are not sure, it is better to seek the appropriate help than to start a journey without a clear vision.