Episode 161

Martijn Moret - The risks of centralized data

Posted on: 28 Nov 2024

About

Martijn Moret is a seasoned professional and entrepreneur with experience across various industries such as aviation and hospitality, and he's the co-founder of DataSquirrel.ai.

In this episode, we talk about the risks of data centralization. We discuss the main challenges and risks of centralized data, the myth of 100% data integration vs. the untapped 60%, and the specific challenges when it comes to working with AI. We conclude with Martijn sharing his hopes and expectations for the future.

 

Links & mentions:

Transcript

"The problem with data centralization is that often companies also decide to make a central data team, which is at a certain distance of the data, which is very good for your governance model because you, you have data analysts that know more or less what is in the data and how they should map to each other. But the problem is that they are becoming at a distance from the business."

Intro:Welcome to the Agile Digital Transformation Podcast, where we explore different aspects of digital transformation and digital experience with your host, Tim Butara, content and community manager at Agiledrop.

Tim Butara: Hello, everyone. Thank you for tuning in.

Our guest today is the co founder of DataSquirrel AI, Martijn Moret. seasoned professional and entrepreneur with experience across various industries, such as aviation, airlines, hospitality. And I'm sure that he'll tell us a little bit more about himself in just a few moments. Yeah, today we'll be talking about the risks of data centralization and yeah, Martijn, welcome to the podcast.

Excited to get into everything with you, but first, if you want to add anything before we dive into the questions, you can feel free to do so now.

Martijn Moret: Okay. Thank you very much, Tim. Lovely to be here. Yeah. DataScroll is a company, a startup that tries to help business managers that don't have a lot of skills in data analysis and statistics to actually be able to analyze data.

But we'll get back to that later.

Tim Butara: Okay, cool. Yeah. As I said, excited to get it into, into this with you. We've talked a lot. I mean, in a lot of conversations. Yeah. About data, then, you know, inevitably touch upon stuff like data silos and how that's bad. And, you know, responsible data usage, but we haven't really had a focused conversation about centralized data specifically.

So I'm, I'm interested in it and let's start with, you know, some of the main challenges of centralized data and how these have evolved as, you know, technology innovation is accelerating and we're seeing that all around us.

Martijn Moret: Yeah, so data centralization is a topic that I think is already around the table for 30 years or more.

My background is in aviation where I for European airline did a lot of different programs and every time we ran into the issue or the challenge to actually Connect different systems with each other. And so even for a medium sized organization, data centralization, or being able to connect different data points or different journeys, like the customer journey or the transaction journey or the product journey with each other is always a big feat because.

It, it takes a while to get everything mapped to each other. And the main issue is context. And when you're in a, in a bigger organization, the context of certain data fields or the way data is collected is very hard to document and very hard to keep track of. And then you of course have the life cycle of all kinds of software that is storing that data and manipulating that data.

So getting to a. Data lake or data warehouse with centralized data with documentation and with people that know what is exactly in that data and know how to get the context out is a very big challenge for for almost any size organization. Mm

Tim Butara: hmm. That makes sense. And one particular thing that I'm really interested in that you mentioned when we were kind of selecting the right topics and then all the all the sub topics that we're going to.

And if you want to discuss is the myth of 100 percent data integration versus the untapped 60%. Can you explain this a little bit more?

Martijn Moret: Yes. So so every organization in an ideal way would have would have connected or integrated a hundred percent of their data into a kind of a centralized silo, like a data warehouse or a data lake.

The problem is really that there are so many different contexts, as I, as I just said, the journeys that you have, but then if you add to that, for example, the market situation whatever competitors are doing and, and every kind of data that flows into your email box in CSV files or in Excel files, maybe internal Excel files, because there's no company, I think, in this world that is not some kind of primary process still in Excel and the Excel reliance metric is is, of course, a famous one.

So it looks like there are two different worlds. One is the world of integrated data, and then you have all the additional data that is really hard to map onto any of the levels that you put in your in your data warehouse. And on top of that, once you have. Integrated data, different departments or different people want to do different things with data.

And you see that the big BI tools, they offer only up to a certain limit. They offer the ability to do something with data. And then you get the most. The popular button of those BI tools is usually the download data and that leads to another Excel or another spreadsheet and adding up to that 60 percent again to be used as input for something that they want to do outside of the already developed and more or less static dashboards.

So the 60 percent of data that is, that is out there, ideally is is ready for analysis by by business people or by data analysts in a in a department. But the problem is, is skills there. The skills are hard and it's not only maybe to know how to work with Excel or how to do a VLOOKUP or how to to do a pivot table, but it's also to actually identify what is the quality of the data and also to to get to the context as I as I mentioned before.

So the fact that you have a lot of data integrated in there and you have 60 percent of your data as an ad hoc thing doesn't still mean that it's being used or to the, to its full extent. And

Tim Butara: you mentioned that, that one of the main reasons for that is, is the lack of skills. And, you know, the business people who are now responsible with doing this data analysis.

So what can we do to improve this, you know, to, to, to have this be better in the future, not just in a specific instance.

Martijn Moret: Yeah. So of course, until now, it was, it was training. And I think that any business manager or even data analysts should at least have some basic skills, not only in, in how to use tooling, like like spreadsheets, but also basic statistical knowledge on how to identify.

By a bias in in your data or how to make sure you're, you're, you're pulling the right test from from the data, if you're trying to make any conclusions from it. But this of course is is mostly for people who are interested in this. That's why you see data analysts. And the problem with going back to the previous point, the problem with data centralization is that often companies also decide to make a central data team.

Which is at a certain distance of the data, which is very good for your governance model, because you, you have data analysts that know more or less what is in the data and and how they should map to each other. But the problem is that they are becoming at a distance from the business. So the business unit is.

Not having access to a data analyst for which they put the priorities there. And what we get a lot back from our clients, for example, at data school is that they need to analyze data themselves because their data team is simply overwhelmed with other requests. And it takes up to six weeks for a simple question to get them to get an answer with.

And the issue there is that the base these days of data that is added to your whole set of your, of what you're working with is, is growing so fast. Let's, let's, for example, take the sales and marketing data these days. How Long ago, was it that we didn't try to enrich all kinds of data that we from, from leads that we had based on an email address or a name or a company or an industry?

How long ago was it that we were able to get certain digital signals and put them in into a system? So all these new possibilities add only to that to that data environment, I would say. And gets to new questions that you want to ask your data team. So once the data analyst is not any more part of your business unit, you have to do it yourself.

And if you don't have the skills for this, this data basically is on the table and, and you're running behind with your improvement on, on on how to run your business.

Tim Butara: So the best way would be to not, not just focus on the individual and how you can improve them, but focus on like the teams themselves and how you can optimize them as much as possible to equip the whole team, you know, to, to be able to move forward more quickly and do the responsible thing with data without taking too long because, you know, basically also bridging this divide between business and, and a strict data science tech side.

Martijn Moret: Absolutely. But of course, there was also a career path in into play. So people who start out as data enthusiasts, they do some trainings, become a data analyst within a business unit, then they have a certain skill or talent. And you often see that they go into a centralized data team. So I I think businesses should actually look at what would they need at a certain moment and, and basically put in dedicated data analyst skills into into a department, but, but even then still most likely as a business, it would not make sense to hire so many data analysts or data engineers that you can virtually may give an answer to any question.

So we're back to into the circle where the business manager, the one that. That that tries out new strategies or experiments or, or season developments with competitors wants to know certain things and needs to analyze it. And, you know, the current era of AI and, and probably we'll, we'll we'll touch upon that a little bit later.

It allows for certain solutions in in that sense, but the current chat GPT is not really suitable for data analysis because of all kinds of limitations and, and risks.

Tim Butara: Can we actually talk a little bit more about those risks and those limitations and maybe some key considerations that, you know, that would help alleviate these when we're talking about, you know, current AI?

Martijn Moret: Yeah, so the current AI is let's say you have the current state of the technology and the perceived state by many people, because maybe some people would call me a skeptic, but I see many. Advertisements out there off of people making social media videos and showing that I can basically do everything create programs out of nothing within two minutes.

And even the executives are playing a role in here. If you, if you look at a speech off of some Altman of open AI, for example. He promises that that AGI advanced general intelligence, or, or the next model of of jet GPT 5. 0 is going to solve everything. And I simply don't believe that because the, the basic of the technology is still that we have created some kind of linguistic.

Intelligence that looks at probabilities between words and is very efficient in sounding fluent. However, if you, if you use it on a daily basis like like I do for all kinds of things for creating code, but also creating analysis for, for trying to input certain statistical, metrics into, into a compelling story, you see.

That it's still this technology of trying to make a story out of words and, and, and choosing the best, the probabilistic best next words of this. So, and, and one of the limitations that you see with AI, other than that is so a linguistic concept, it doesn't really understand a real world context in there, but it's that it's trained to give an answer and which means if it can't find an answer.

It'll give the next best answer. Mm-Hmm, . And in if it's not true, it's called a hallucination. You know, if you would throw your data file to chat g PT at this moment, it'll create a small program, usually in Python. Tries to get some some results from that and interpret it and gives back a compelling story.

During that process, all kinds of stuff is is, is happening. For example, it tries to get the context of the, of the data file, put that in a Python program where where bugs happen, where a, a misunderstanding of the context is actually happening. Then it tries to just use the data without thinking.

It doesn't think without thinking that it's actually required cleaning or require some other kind of data. It just does what, what you ask it to do. And then it gives back a statistical model and it doesn't really say, it doesn't realize that it requires to see if it's either significant or makes, or the answer even if it comes out of a Python program makes any sense.

And so the, the limitation is really that it's not the same. As your data analyst, because you can, you can come close to it to a junior and junior data analyst, but you need to know exactly what you're asking and to know exactly what you're asking. You need to actually domain expertise in the field of data analytics.

And

Tim Butara: I think even before everything that you just said, It's like, okay, this is all considering that you actually feel safe with typing business data into a service such as ChatGPT, right? You know, you know, it only comes after, after this.

Martijn Moret: Absolutely. So the one thing that was not thought of while building this technology, Because we're still, let's say, at the beginning of this very exciting technology.

But but one thing that was not thought of as security. And while we have databases, you know, the databases are around for 40, 50 years already, and we have got row level security thereof, who can see what kind of data which is all built in there in a, in a, in quite a clever way. AI doesn't have that at all.

Which means that once the data is into, it's used as training data into a system, anyone is able with the right prompt to get this data out. And and the first plagiarism cases are already there, or the first kind of code snippets are there, or the first implementation with a, with a, with a database from which it gets some some data it should not have access to that data, but the data.

But the, the large language model actually doesn't do any good checking on who am I giving this information to. So security is a big, is a big issue. Microsoft says that they are solving that by using enterprise copilot. But even think of that, if your, if your data files that you're using is for training, and those are usually millions of files, if they files contains one file of a payroll, there is a possibility that someone can get this information out.

Tim Butara: Yeah. Yeah. I mean, I mean, that's one of the main issues here, I think. And I think like, you know, there are options like, you know, building your own custom integrations, but that, you know, requires an additional, additional time, additional skills, usually additional workforce. And then you probably have to balance that against the actual benefit you'll get from this internal implementation versus risking your own data versus giving it incomplete data so you don't give them, give it access to something that you don't want to be accessed by others, but then you won't get the best results possible.

So there's all of this. And you mentioned like, like, you know, it's, it's interesting. You mentioned Sam Altman talking about AGI and how we're at the beginning of this technology. It has a lot of promise and this is kind of moving into the final part of our conversation, but I found it really interesting because I think that recently he said, he said something like he predicted that they'll have a AGI in several hundred days.

Which is like, okay, is that like half a year? Is that like three years? Is that, you know, a thousand days? How long is it? It's several hundred. You know, that's a very, a very broad timeframe that you can pick from.

Martijn Moret: Absolutely. And, and that it still leaves open what is meant by AGI because there's, there is a complete.

Complete lack within the AI industry. It's a very new industry. There's a complete lack of narrowly defined jargon. And when one person is speaking about a GI, another person is speaking at super about super intelligence. And, and let's not forget, Sam Altman is also the CEO of a, of a market leader, and he's also basing his wording.

On valuation expected and expected investments. And, and let's say the, the whole hype cycle of AI is is hopefully somewhere near the top. If you know, the the hype cycle diagram of of new and innovations it's hopefully near the top. And then, and people start to realize what is actually possible with.

AI and what are the limitations of it? And especially within the data analysis, you know, the fact that it can read a CSV doesn't really mean that it understands what is in there or that it, that it makes sounds statistical issues. And if you're a PhD student and you believe that your, the analysis can be done by by GPT instead of putting all your data through our SPSS yourself.

Then that is actually a danger to to what is being published as a, as a as a result of that, because data analysis and AI require a very careful way of dealing with each other. And, and for example, in data. We are using large language models to humanize, to obtain context from data without uploading the actual data and to humanize any kind of output that we have, but we're always checking it for numbers.

If the numbers in the output are actually not in our input, then we are throwing it away simply because we don't want to have any chance of hallucinations.

Tim Butara: So, based on everything that we talked about, Today, especially this last part about AI and, you know, the current state and some of the risks and set and challenges.

Do you have any predictions or maybe hopes for, you know, what the future will bring here in this field? So both in terms of tooling technology on the one hand and the people, the workforce on the other hand?

Martijn Moret: Yeah, so. The prediction is that that AI technology will be better and people will see more of the limitations and work around that.

Like we, for example, built internal quality tools to check the output of AI and that's called we see more and more the term coined of hybrid AI for that. The hope is that we see the early starters that people find this a very accessible technology. So the hope is that they are starting to analyze more because at this moment, if you don't have access to a data team and you don't have the skills or the time to analyze data, you simply let it go.

Or you give it to your, figuratively speaking, your nephew who is who is doing great with with spreadsheets to get some data out. But, but most of the time it's ignored. And in our world, we see a lot that people are using the tool because they were neglecting data that they actually need to make some data different decisions before that.

So my, my hope, my expectation is actually that more and more business managers are becoming more and more. Data savvy through the use of easily accessible technologies that are actually making that that are actually providing quality output that this there's Darren. And since they become more data driven.

They start to, to make decisions less based on gut and more on what is actually happening in the market. And then there's this, those are not the decisions that we that you see nowadays in systems like recommendation engines that you have on YouTube or on Spotify or a recommendation engine that that larger retailers are using is like, what do we need to recommend in terms of products for for repeat sale.

But those are those. Everyday decisions that that business managers are are making based on any kind of input usually from the 60% of the untapped data that we talked about.

Tim Butara: I really, really like the term data savvy. I think it's, it's going to be very important moving forward. And Martin, thank you so much for joining us today, for sharing your insights, your expertise.

If listeners would maybe like to reach out to you, connect with you, and learn more about Data Squirrel, what's the best way to do any of those?

Martijn Moret: The best way would be via LinkedIn, I guess or via X or threads. I am I am active on all the platforms, but LinkedIn is is the business tool that I that I use most please connect.

And if there's any questions or, or doubts on on what you need to do with data integration or, or how to get your team to become more data driven, don't hesitate to reach out.

Tim Butara: Awesome. We'll also make sure to, to add your LinkedIn profile in the show notes so that people can easily, easily find you and access it.

And yeah, thank you again for a great conversation and have an awesome day.

Martijn Moret: Thank you, Tim.

Tim Butara: And to our listeners, that's all for this episode. Have a great day, everyone. And stay safe.

Outro:Thanks for tuning in. If you'd like to check out our other episodes, you can find all of them at agiledrop.com/podcast, as well as on all the most popular podcasting platforms. Make sure to subscribe so you don't miss any new episodes. And don't forget to share the podcast with your friends and colleagues.