Plexxi on integrations: If it’s a one-off, you’re doing it wrong

September 11th, 2013 | Plexxi | No Comments

busThe energy behind things like SDN and DevOps is focused on driving long-term operating expense down through heavy automation. Automation within a silo is meaningful, particularly for infrastructure where the rate of change is either high or complex. In the former, it reduces repetitive tasks. In the latter, it lessens the likelihood of human error.

But the real power of automation is at the edges. Wherever two silos come together (storage and networking, or compute and networking, for example), the boundaries tend to be hard edges. We don't see a lot of orchestration tools that effectively manage over these boundaries. In fact, within customer environments, these domains are owned by different teams under different management with different budgets and priorities. When things need to happen at these hard organizational and infrastructure edges, that work tends to be more manual and arduous. 

Integration today is treated as a bunch of one-off events. When the vendor does them for you, they typically charge you for the professional services engagement. When the integration is done, they leave the engagement – along with all the expertise they used to do the integration. The next time you need an integration? Give them a call and start filling out the check. 

But the problem is that integration is not a one-off event. It happens over and over. If you really want to make that orchestration buzz word real, you need to integrate. And integrate. And integrate some more. For contextual information to span across these silos, you need more than just 2 things to be integrated. You need to have an infrastructure fabric of interrelated elements that all speak a common language. And if that infrastructure fabric is multi-vendor, then the only hope for a common language is to create a data shim layer of sorts – or a data services engine.

And by the way, not only is this integration not a one-off event, it isn't even done when the integration is done. Integration is a continuous activity over the life of a solution. Things change. Correlations change. Workflows are created. And deprecated. These changes, sometimes subtle and sometimes not, will continue. This means that integration is less a specific thing and more a state. If the infrastructure is too rigid, it ceases to be useful on day 2. If it is too undefined, it cannot be applied with specificity to very much.

This is why the answer to the problem cannot be "We have an API for that." APIs are interfaces, but they are notoriously selfish. An API says "Talk to me, but speak my language." If you have a very clear center of the universe, this makes sense. But if the ecosystem is really a set of interrelated entities, how does this work? 

This is all to say that there has been a gap in IT for some time now. That gap has gone unnoticed because we were mostly content with our silos. As trends like software-defined data center (or any of the SD* derivatives) take off, it will expose the gap. Companies that expect their software-defined solutions to be production-worthy will quickly find that readiness extends well beyond basic capability. Considering operator experience and being context-friendly will absolutely be critical. And this data services functionality will ultimately be the glue that makes orchestration possible.

Let's take an example. An end user working at some remote branch office is trying to submit an expense report. When she enters her data into the expense tool and clicks submit, she gets the spinning wheel of death. The application hangs. Because entering in expense report data is such a colossal pain, she is frustrated after the second time the application dies. She calls IT, and the first level support guy quickly passes her on to someone more experienced. What happens next?

It isn't clear whether the issue is an application issue, a network issue, or possibly even human error. In fact, troubleshooting the problem is not trivial. If the issue isn't easily reproducible, it might never really get examined. Individuals might try common fixes ("Have you tried rebooting your computer?") with the hope that they get lucky. The application guy looks and sees no error messages. He updates the ticket and assigns it to the networking girl. She looks at the network and sees no issues. She bounces the ticket back to the application guy. After a couple of days, the ticket is closed because the problem seems to have gone away. 

Troubleshooting these issues is hard because the source of data needed to do a thorough job doesn't usually exist, and even when it does, it exists across a distributed set of resources, each owned by a different team. When a low-level support person gets the ticket, they look at the data on the resources they own and try to figure out what is wrong.

There should be a better way.

The foundational key to these types of scenarios is that the collective infrastructure is really a heterogeneous set of state consumers and producers. Every element in the infrastructure produces some state (if it can be logged, it is probably state) and consumes some state (configuration is the most common but not the only type). All of this information exists (not always persistently), but there is no single place where the state information can be easily queried and used to do something meaningful.

Ideally, anything that consumes this state to either determine or actually do something is a data service. Data services could do logging, monitoring, compliance, provisioning, help desk ticket workflow, billing, and so on. To facilitate all these data services, we need some central point of integration. We need a data services engine.

The important part here is that the data services engine cannot exist only within a silo. The problem that needs to be solved exists at the silo boundaries, not within it. 

The data services engine ideally becomes the source of truth for all current state. Anyone who is either a producer or a consumer subscribes to the data services engine, indicating which subset of the total state is interesting. They then can publish or consume updates. This is a pretty basic application of a message bus.

Because these producers/consumers of state will all do so in their own way, the data services engine serves not only as a means of collecting and reporting data but also as the great normalizer of information. Anything that comes into the services engine needs to be groomed – essentially formatted in a common way. And on egress, that same data needs to be reported in whatever format the subscribing infrastructure requires. 

The subtle point here is that having a common central format and a means of grooming data on ingress to and egress from that central point, you are essentially creating a common data services model that can be easily extended to different types of data services, be they device-specific (like configuration for a type of gear) or more overarching (cross-ecosystem troubleshooting for anything associated with a DNS server).

There is obviously than this required to fully automate and orchestrate workloads, but this would be a decent start.

Leave a Reply