This is just my view of what are the most critical issues in the world of Data, AI, Analytics, DevOps related to Data:
First of all, DataOps is a new term related to data and analytics management lifecycle – Gartner considers DataOps as one of the key emerging technologies at the moment (although the term still continues to define itself, as often the case in tech).
So, on to the subject:
1. SEE THE BIG PICTURE – currently we’re still in the phase, where AI / Digital departments in organizations are treated as something very unique and in many cases have status of pilot projects. However, the realization is in the air that all the different analytics tools must a) align to the core organization KPIs and b) get aggregated to some big picture.
Therefore, one of the main challenges here is a way to standardize and aggregate the data. Basically, an organization doesn’t want to end up with dozens of dashboards and analytics tools all covering small parts of their businesses, and then another team of analysts needs to try to make sense of all that dealing on the way with many different data sources and methodologies.
2. SECURITY – I personally think that this is the #1 challenge, but the realization of this has not come yet and maybe won’t come in 2019 (but will happen eventually). Based on the old paradigms, many people in the industry still care much more about uptime SLAs than about security. For some reason, it is hard for many to fully adopt the idea that a single major security incident is much-much worse than even a full day or two days of downtime.
Speculating a bit here, but reasons may be psychological – possibly, it feels like uptime SLAs are much better under one’s control and can be easily expressed by a certain number of nines, while popular opinion is still that it’s not possible to give a clear “SLA percentage” for security and it is very difficult (even in better cases where organizations truly care about security) to create a reliable workflow which addresses new vulnerabilities constantly. So it may be possible that many people still think about Security in all-or-nothing terms (i.e., either we implemented DDoS protection or we didn’t; either we set up the antivirus or we didn’t, etc) – this approach highly misses the standard where main focus is iterative continuous improvement.
In relation to DataOps there is an immense number of challenges related to Security which many people try to ignore or defer as much as possible. To name a few, there would be Data Obfuscation problem (while preserving a balance between obfuscation and keeping meaningful data) – how we prevent sensitive data from bouncing around in our tools, Data Audit and Access Control problem – who is allowed to do what and who actually did what with data, Data Provenance or Origin problem – can we even use these data and how can we keep track of Provenance in our various Big Data tools; all the standard DevOps security problems obviously also need to be addressed as well (Perimeter security, User Authorization, etc.) Finally, things like GDPR and HIPAA should also be mentioned here (and probably more regulation is on its way which is a challenge on its own).
To be clear, main challenge in Security is the fact that there are not enough Security experts in the market to cover these needs on the current scale of Operations. Maybe, that is part of the reason that many people in the industry try to ignore the Security challenge as a whole.
Last interesting thing about security is that DataOps is actually something that may help very much here, as Security Management in many aspects is very similar to Data Management, so Security in general should be treated as integral part of DataOps and DataOps should be treated as a necessary element in solving global Security challenge.
3. CONTROLLED DEPLOYMENTS – This challenge is about how to ensure that new data model deployments or online data trainings still meet the product quality standard. Popular (and very wrong) belief is that if we label more data in the classifier under same algorithm as before, and then retrain, the end result will have better accuracy, than before. While such improved outcome is usually the case, it’s absolutely not guaranteed; and even if we observe overall accuracy improvement, it is possible that some very popular use case will be broken.
Let me give an example here (I will exaggerate but it gives the idea) – imagine that you have a chatbot and it has “Hi” classified as a greeting. Now imagine that after retraining the classifier overall accuracy (measured by some F-score) is improved, however the “Hi” case is now broken. Would this be acceptable for production?
Challenge here is to understand that not all data entries are created equal (as they have very different weight for many reasons) and create efficient and controlled system of managing and capturing such issues.
4. DATA VERSION CONTROL – while this is tightly related to the previous challenge as it helps with deployments, Data Version Control is actually framed in a little different direction. Main goals here are to allow the system to a) roll-back or roll-forward data components at any time with full transparency; b) allow for branching of the data repositories (i.e., ability to work in parallel on 2 datasets which are of the same type but one is, say, newer than the other); c) support the audit process.
Important to note here that as we’re dealing with “Big” data in many cases, the amounts of data may be huge and it is very far from obvious (and in many cases this is an unsolved problem) how we can put such data under version control – or what else can be done here to improve the situation.
With all the above, I’m sure we will see more and improved tools, more people tackling those challenges, more buzz about DataOps and exciting things on the market. Happy 2019!