Everybody is talking about how "Agile" makes you succeed.
However, Agile is not a silver bullet. As an empirical process, there are many failures on the way to success.
In this blog, I want to share the failures from which I have learned so that others may also learn.
Back in my days as Six Sigma Master Black Belt, I learned and taught that every process can be modelled as a function. We write, on an abstract level, "y=f(x)". We use tools like SIPOC and Ishikawa Diagrams ("Fishbone") to model the relationship between inputs and outputs.
Since most processes have some variation, we accept that in practice, the function really implies "y=f(x+e)", where e is an error term within certain tolerance limits, typically impacting the accuracy of the model by no more than 5%. Because e is so small and we can't control it anyways, we ignore it in the model.
But what do you do when y=f(x) is simply an invalid assumption?
That's where agility comes into play.
The first difference between product manufacturing and software development is that product manufacturing is reproducible and repeatable. You know the exact process to apply to certain inputs to get the predetermined output. This is the "simple" world based on the Cynefin Framework.
But nobody in their right mind would build the same software the same way twice. Rather, if you have it, you'd use it. Or, you make some changes. Regardless. The output is not identical.
What typically happens is that once a customer (user) sees a piece of software, they get ideas of what else they need. In this case, it's more like "y=f(x,y)": Given a result, the customer defines the new result. The process isn't reproducible anymore, because the output of each stage is input again. As a Six Sigma practitioner, I already had some issues with outputs that were process inputs: SIPOC becomes confusing, but it was workable.
At this point, it makes sense to work incrementally, because you can't know the final output until you have seen intermediate output. This is why agile frameworks typically apply backlogs rather than finalized requirement specification documents. We accept that the user may need to gain some understanding of the system to decide what has value and what doesn't. We also accept that the user may change their opinion on anything that has only been a vague idea yet.
Failure to implement checks and balances will move the team into the Complex domain within a couple weeks. Validation is not a onetime initial activity but must actually become stricter throughout the project lest errors accumulate and disorder leads straight into a Chaotic environment.
There is no way to reach the Simple domain, lest you acquire a good crystal ball that lets you accurately predict the future. (If you got that, you probably shouldn't develop software but play the stock market.)
You get into a mess when you only assume that the customer likes the result, but can't know for certain whether they do. The definition of quality may be unavailable until after the result is available to users. This scenario is fairly typical in web development, so as developers we have to guess what customers will accept. When users don't like something, we made an error. And there's no way of knowing that it was an error until we deployed the feature. We have to live with the fact that "y=f(x,y,e)" We can't eliminate the error term from our model, much rather, we have to accomodate towards the ever-present risk. We can only try to minimize risk by getting frequent feedback. The more time between two deployments and the more content within a single deployment, the more likely we made some critical error in the user experience.
Processes like A-B testing and continuous delivery become critical success factors.
While you cannot completely eliminate the randomness, creating a high speed feedback mechanism, such as actively involving users in every minor detail produced, minimizes the effect of errors and effectively permits working similar to a Complicated environment.
The absence of control processes which deal with randomness may cause disorder to quickly shift the team into the Chaotic domain.
The worst thing that can happen is that yesterday's truth may be today's error. Customers may change their preferences on a whim. Imagine you work in a project where yesterday, you were requested to build a web portal for e-commerce, today the customer wants a document management system instead. Any plan is invalidated when you can't rely on your requirements, user stories or feature requests - or whatever you call them. Your model becomes a form of "y=f(e)", where neither x, you input, nor y, your previous output, are relevant indicators of success or failure.
This is where Waterfall projects with long planning phases may drift: By the time the plan is realized, demand shifted due to factors outside the project's sphere of control. An example would be a Waterfall team building a well-defined, perfectly fine php platform over the past 2 years meeting all business requirements, only to find out the newly hired CTO has just announced that all newly launched systems must be implemented in pure Java.
The only good news about the Chaotic domain is: You don't have to be afraid that it gets worse. Everything the team does may be waste.
The best way to deal with the Chaotic domain is moving away from it by delivering iteratively to minimize the effect of uncontrollable random on results. Frequent releases, no more than a couple years, provide the opportunity to move into the Complex domain.
The nicest statistical process model doesn't help when there is no feasible way to keep error terms under control or any model you come up with has relevant variables which can not be controlled. Six Sigma techniques for statistically modelling processes is useful when the process is repeatable and reproducible (good AR+R), but it won't get you far in a domain where R+R is neither given nor desired.
Imagine you're a sysop. Your systems are up and running, so you sit back and chill.
Yeah, I said, "Imagine". Now, I mean, seriously. Not like you'd have time for that. Every platform requires frequent maintenance, you wouldn't want any service to deteriorate. While you're juggling security updates, intrusion alerts, overflowing storage, network load and whatnotever could possibly go wrong on the average working day, those nasty Scrum teams come around and ask for a configuration change, giving you only a couple days to implement.
They can't even formulate a proper Request for Change including a detailed risk assessment and simply want you to "install that one module" and because you think that the new storage module is more important, they request "just give us the root password and we'll do it ourselves".
Anyways. That's beyond reasonable. Someone will compromise the server in one way or another and everyone will point fingers at you. Worst case, you'll get the sack. No chance in hell freezing over!
Ok, that was "Why are admins always so slow" for all of you developers out there.
Now, let's imagine you're the developer. You've got your story backlog, you've got the code lined up and are ready to deploy. Inside your Scrum team, you can simply prioritize the issue, and everyone will swarm to solve it within the business day. Unfortunately, this time you need a change to the system configuration, a certain module needs to be installed. And that means you have to go into the lion's den of sysadmin. And probably they'll ask to have everything written up formally, only to shove it into some tracking tool where it will rot until a date far beyond sprint review. Chances are, your story won't make it, because they'll be too slow.
That is the flip side of the coin. Now, who is wrong?
This dilemma exists in many companies and the result is neverending fingerpointing and blame-games. But - that's not how our story needs to end.
How can you synchronize Development and Operations without infringing on either, while satisfying both? What can Development offer to Operations to make their life easier, how can the mutual pain be eased?
In small startups, you can simply move the admin role inside the sphere of responsibility for the Scrum team, but as soon as the organization grows, difficult questions will arise: How do we keep servers secure across teams, how do we maintain dozens of individual systems which can go berserk every minute, etc?
Some people go as far as claiming "#NoOps: Developers can do everything!", but let's be honest. You're not very likely to find someone who can do the highly specialized fulltime job of a developer and still be a highly specialized fulltime admin, while at the same time having time to keep up with new business requirements and user activity on the platform - and still working on an affordable payscale. On the other hand, distributing administration within the team may simply not be feasible, because too many half-competent (no offense) developers run a risk of making a dangerous modification towards the servers.
The main result I have seen so far coming out of #NoOps claims is servers that are hardly maintained, with configurations that are insecure, incomprehensible and run a tremendous risk of falling apart any minute. Many developers don't even understand why they shouldn't simply log in as root. And that's just the tip of the iceberg. It goes further over using telnet login, creating tablespaces with "autoextend" and more. (I apologize to all the admins out there for the head bruises incurred by facepalming too hard and to all the developers out there who don't understand where the problem is).
Development must understand the pain of Operations rather than pointing fingers - and Operations must understand the pain of Development.
DevOps are the solution: Let's just admit that administration is necessary and that developers are developers. And then work out solutions from there. Rather than usurping admin activity into the dev team, let's facilitate a positive information exchange, foster mutual understanding and simplify the interaction.
For instance, you can automate the configuration management with CI, or you can go as far fully automate deployment and eliminate the risks that always come with major releases. Development can relieve Operations of any activity that really no admin wants to do (i.e. working through tedious, poorly written instruction manuals) , and in turn, Operations can provide platforms where change is smooth to implement and easy to track.