Actionable summary of this article if you have only a few seconds to spare:
Reasoning from first principles, operations management at its core is a cycle of:
Operators spend most of their time on 1) and 2) above, with little time left for systematic improvements.
Detection - understanding what the operations team should focus on - is usually manual and reactive, which makes it error-prone and inefficient. “Delegating” this to a system (e.g. through dashboard alerts, automatic monitoring) makes the detection proactive and improves the overall cycle, given it’s the input for the whole system.
The resolution process, despite requiring collaboration from several teams, is usually done ad hoc (via email, slack, etc). Adopting an incident management system can help with both increasing effectiveness / efficiency of the day-to-day and informing future improvements.
Systemic improvement is typically deprioritized, given little time is left after 1) and 2) above. By automatic detection and implementing an incident management system you will get the time and data needed to run systemic improvements
Improving the 3 steps above enables you to close the feedback loop in operations, significantly improving your ability to scale and drive value effectively.
Ops managers are the day-to-day heroes keeping business running
However, a day in the life of an ops manager is not for the faint of heart. If you work in Ops, this typical day is likely familiar. You have <insert your dashboarding tool of choice here> open with at least 3-5 charts that you routinely check. You check these dashboards for the same things every day, for at least 2-3 hours a day. Then, as you identify events that require your attention you start the resolution process. Most of the time this implies triggering an email chain to the 3-4 people that need to be in the loop or action something related to the event. Multiply this by (at least) 3 and you get a sense of a normal day. Lastly, you start thinking about how to do more systemic improvements - this usually happens outside “normal” working hours as the day-to-day above takes all the time.
Let’s break down the components and get behind the hood.
Detection is hard
Like any process, the inputs are key as they influence the underlying steps. At its most basic, detection implies there is 1) a change in an underlying state (ideally tracked & stored as data), 2) an observer that monitors for relevant changes and 3) an event that is created that needs acting upon.
Operators are adept at knowing what to look for - they can articulate what are the relevant monitoring rules, given they (usually) do it manually. There are more complex cases that might require prediction models (e.g. churn monitoring), but even then a skilled operator can describe with a high degree of accuracy what are the rules / indicators for when a customer is at risk of churning.
That said, consistent, accurate detection is hard.This is largely due to 3 factors:
It’s manual & repetitive, thus error prone. An operator has to monitor multiple dashboards and / or thousands of lines of sheets when only a fraction of that information is relevant (e.g. order delayed by more than 3 days, bad review that contains certain feedback)
It’s non-standard - unlike tech stack observability (e.g. error logs, CPU usage, site downtime) operations monitoring is very specific to each sub-vertical or even business.
It’s fluid - given the ever-changing nature of operations, the rules which are used to detect relevant events today might not be the same next month / quarter.
Events vs. Metrics
Ray Dalio nicely explained the economy as a series of transactions. Similarly, we could simplify operations to a series of individual events. We are used to reviewing OKRs, KPIs and dashboards, which are aggregated events (to generate a metric that is tracked / displayed). To exemplify - a delayed order is an event, while the aggregation of all delayed orders in a period divided by total orders is % of delayed orders. At the risk of stating the obvious with this distinction, I noticed the 2 concepts sometimes get conflated.
Needless to say, the more granular you can intervene (i.e. at the event level), the more impact you can drive for the business.
While smaller companies / startups tend to operate at an event level, most companies run operations at the aggregated level due to high volume of events and no automated monitoring. For example, the operations team would monitor the delayed orders metric and if it gets above a certain amount, intervene.
However, at the aggregation level valuable detail and opportunities can be lost. Keeping with the above example, whilst there might not be enough capacity to manage each delayed order, there might be value in detecting and managing high-impact delays (first-time or repeat users, high order value). Detecting those specific cases from the aggregate metric is, in most cases, impactful.
The (mostly) reactive nature of detection
We can map how detection is achieved on a spectrum from reactive to proactive.
On the most reactive end of the spectrum, detection is done by customers (internal or external) - e.g. customers flag that food delivery orders are not delivered. Needless to say, this is not ideal and can be very costly (in lost revenue, higher costs or decreased customer satisfaction).
As we move up the spectrum we have (Excel / Google) sheets, (operational) dashboards and customized project management tools. These work until a certain time - as the business grows, it can be harder to stay on top of all the events. At that stage detection starts to take up a lot of time - imagine checking thousands of rows of the same data structure for the 5-10 events that require attention.
Lastly, we have proactive detection systems - the best example are tools like Datadog for tech functions (eg. tech observability, monitor). In ops, monitoring is usually done by the tech teams by implementing alerting into their core product - whilst effective, this is very costly from a resource perspective and not flexible as it can’t be managed by operators.
Thus, a good detection system should be automated, fast and easy to manage / change by operators themselves. Relying on the (ever constrained) technical folks likely means that the system will always lag behind what the business needs.
The added bonus of this flexibility in detection is that it allows for operators to test hypotheses quickly by setting up monitoring for new / overlooked areas of the business that could drive a lot of value if properly managed.
Resolution is opaque
Once detection is handled, the actual work of driving value starts. Individual event resolution looks differently from addressing an aggregated metric.
At the event level - without automatic detection and assignment - the person who detects an event needs to determine what is the best resolution path. Resolution of that event is straightforward in the fortunate (but rare) case when the person noticing the event is also the owner and can resolve the event without support. If most operational events were like this, then transparency in this step would not be crucial.
However, in most cases, two things happen:
Resolution requires coordination from multiple people / functions.
Resolution path is unknown. When faced with a new issue that requires problem solving (and hence unknown steps to resolve) the collaboration described above multiplies, significantly leading to a lot of time spent on managing that process.
The above coordination is typically managed by email, Slack and other tools, obscuring the real work required to keep the business running and making it hard to move quickly (there is a name for this - “collaboration tax”).
On the aggregated metric level the complexity increases as the resolution also requires analysis. In the above example of delayed orders, when the metric jumps (i.e. there is an increase in delayed orders) the ops manager starts analyzing the issue which means looking at the event level to determine if there is a root cause that explains the increase in delays. Are delays increasing for a specific region? Product line? Delivery provider? User group? Or is it systemic, with no pattern?
Resolution in this case is further delayed because of this extra step, which might leave value on the table - imagine the case when a large share of those extra delays are for first-time buyers or high-value orders.
Being able to monitor & identify (from the aggregate) those specific cases can enable operations teams to react in real-time & unlock significant value - in this case, by improving retention for first time buyers or improving customer satisfaction for high-value clients.
Systemic improvement is typically deprioritized (unless it becomes crucial)
The day to day described above is intense - ops teams have enough work to fill a 24/7 schedule. Prioritization thus becomes crucial - when the BAU operations take all the available time, systemic improvement is deprioritized.
Compounding this problem is the lack of data to inform this work - imagine being able to sit down with the ops team every 2 weeks and do an operational retrospective, essentially answering a few critical questions:
What are we spending our time on?
What is the impact that we deliver?
What is triggering this work? Can we reduce the volumes or improve the outcome?
With manual detection and ad-hoc / opaque resolution, these retrospectives become impossible and only happen when there is a critical need that requires fundamental re-thinking. However, proactive monitoring and centralized incident management can provide the data and time needed for these retrospectives.
Closing the feedback loop
The three components described above - flexibility in detection, order in resolution and data for system improvements - enable operators to experiment quickly for their operations. This enables them to observe what is happening at a granular level which in turn helps them define how to address / resolve those events. I think of this as rapid iteration in operations - define a hypothesis, start monitoring to confirm, test a resolution path and review the results - refine and repeat until you get to a monitor & process that drives value for your business.
Feedback loops such as the one described above are super powerful when applied consistently and with good insights. In most operational contexts, feedback loops are hard to build given the challenges we discussed above (manual / time consuming detection, opaque resolution, no time / data for system improvements & reviews).
Lastly, a quick thought on automation. While it makes a lot of sense to automate as much of the business processes as possible, automation is hard as you need to know the precise steps that are required. For example, tools like Zapier automate simple processes, but complex ones require discovery / mapping. The alternative is - start monitoring, test resolution pathways and then automate.
We’re building a platform that can help you with the 3 stages described above - drop us a note if you want to chat!
Taking our integration capabilities to a whole new level by adding hundreds of new applications to our portfolio. With over 300 new data sources for connecting data to Flawless and more than 170 destinations to send alerts to, we are opening up a wide range of new possibilities for your operations teams.
Missing data points, outdated static dashboards, slow response times from data teams are among the common obstacles in BI. With new technologies, such as data warehouses, data observability tools or genAI implemented, the future of BI is bright. It will emphasise quick setup, self-service capabilities, real-time insights, and actionable automation.
Balancing customer experience and cost in customer service teams has been a never-ending struggle for businesses. Fortunately, technology is here to help. In this article, we will discuss how to decrease contact rate without decreasing customer experience.