Skip to content
Exploring Machine Learning, AI, and Data Science

Carlos Chacon on Data Community, Family, & Messy Data in Legacy CRM Systems

In this episode, Frank and Andy speak to Carlos Chacon about data community, family, and messy data in legacy CRM systems.


00:00:00 BAILey 

Hello and welcome to data driven, the podcast where we explore the emerging wait a tick. This is the premiere episode of Season Five. Can you believe it? Data driven started four years ago this month. 

00:00:14 BAILey 

Up until last season, we had a human doing the voiceover work. That is until she was replaced by an AI. Yours truly. 

00:00:23 BAILey 

In this episode, Frank and Andy speak to Dave Wensel about why you don’t need a datawarehouse. We’re starting off the new season with a bit of contrarian tone. 

00:00:33 BAILey 

It’s a lively back and forth conversation that runs contrary to prevailing wisdom. Don’t say we didn’t warn you? Now on with the show. 

00:00:41 Frank 

Hello and welcome to data driven. The podcasts were we wait a minute. We’ve been saying this Andy for four years now. Can you believe it? 

00:00:48 Andy 

Four years, that’s crazy talk. 

00:00:52 Frank 

That’s just craziness. So I think when you and I first talked about this and that was that fateful, I think it was December was right after Thanksgiving. But before Christmas, I was thinking about starting a podcast and as a data scientist, I needed someone. 

00:01:01 Andy 

Yeah, yeah. 

00:01:09 Frank 

That was a data engineer that could kind of round out the talent there and and and and obviously I wanted someone I knew, liked, and trust. 

00:01:11 Frank 

Found out. 

00:01:11 Frank 


00:01:22 Frank 

And so it was you. 

00:01:25 Andy 

Well, I’m just glad all of the real smart data engineers you knew were busy. That’s all I got to say. 

00:01:25 Frank 


00:01:30 Frank 

Ah, no man. You were the first one. I reached out to and the only one I would have done it with it. So I was delighted when you said yes because starting a podcast can sound like a daunting thing, particularly if you haven’t done it before. 

00:01:44 Andy 

Yeah, neither one of us really had. And gosh, it’s it’s worked out. What are we up to? 180,000 downloads or something? I mean that’s. 

00:01:52 Frank 


00:01:53 Frank 

Like that about hundred 8000 downloads. I mean, we’re not Joe Rogan, but that’s OK, Yep. 



00:01:57 Andy 


00:01:59 Andy 

Yep, Yep. 

00:01:59 Andy 


00:02:01 Frank 

But you know what, we we we’ve impacted. I think the community in a significant way. We’ve we’ve done a number of things we’ve we’ve innovative how we podcast. 

00:02:12 Frank 

Uh, we we’ve actually managed to keep a good cadence with some exceptions. 

00:02:18 Andy 

Yeah, thanks. 

00:02:19 Frank 

You know, we we finally did earlier this year or late last year, kind of fulfill our vision of it being data driven TV when we actually interviewed guests on. 

00:02:27 Andy 


00:02:32 Frank 

On video. 

00:02:33 Frank 

And that was that actually delayed the launch of the show by about three months. 

00:02:38 Andy 

It did but also uhm. Yeah, that was interesting, but you know it’s typical software development, right? You release a feature and then you debug it. The I have this saying about that Frank. All software is tested some intentionally. 

00:02:52 Frank 


00:02:53 Andy 


00:02:56 Frank 

I love it, but I also like how, how, how both our careers have evolved over the last four years. And dumb, you know, this being the premiere episode of Season 5 and we have something special lined up, but I’ll get to that in a minute. 

00:02:58 Andy 


00:03:03 Andy 

Oh gosh, itch. 

00:03:11 Andy 


00:03:12 Frank 

You’ve progressed in your career. We, you and I’ve worked on some some projects together or virtual Summit. What we’re calling Ring Gate, which will announce very very soon and and but. But most of all, is been my kind of skilling up in transition into data engineering myself. 

00:03:29 BAILey 


00:03:31 Frank 

Which was something that when I joined, so this is just a job update about a year ago. I I left the role of Microsoft kind of field sales and I went into the Microsoft Technology Center stick with me. There’s a point to this story and basically I was at the rest in MTC. 

00:03:52 Frank 

And basically I was the AI guy on my my my field sales team, but I didn’t really have deep knowledge of kind of the typical typical data engineering pipe work that goes into that role and basically my my. My then manager said you know he’s like hey, you know, if you want this role, you’ve got a skill. 

00:04:12 Frank 

And skill up I did. And with Andy’s mentoring and a bunch of other folks that helped me kind of skill up on our the data engineering side. I looked at it this morning. I’m like 88 hours on Pluralsight. 

00:04:25 Frank 

Wow, that was from mid may till we’re recording this on April 30th. So just about a year 88 hours right now tracking on about 200 four 205 consecutive days of getting on LinkedIn. I’m not on LinkedIn on Pluralsight, LinkedIn learning. I also have a number of courses too. 

00:04:31 Andy 


00:04:43 Frank 

Uh, that is something I’m proud of in terms of career evolution. 

00:04:47 Andy 

Absolutely Frank, you should be. How many cirts are you up to now? 

00:04:50 Frank 

I 87. 

00:04:53 Andy 


00:04:54 Frank 

I know, I know. 

00:04:54 Frank 

Know, I know. 

00:04:54 Andy 

I think I’ve got 4. 

00:04:56 Frank 

Ah, now I know you and I did the data engineering thing, so you have at least 11. 

00:05:00 Andy 

That’s true, that’s true. We did that one and you know that was it’s just. It’s just been a nice journey and I’ll take credit for this. ’cause ’cause I can I was. I was actually pestering you years ago. We’ve been friends since 2005 and we started doing. 

00:05:20 Andy 

Code camps here in the Richmond area. 

00:05:22 Andy 

Together and co-founded RE co-founded Richmond SQL Server Users Group and you know, worked with the net users group and stuff. And I told you as soon as I saw some of your graphic art and Frank would do a keynote for the Richmond code camps and every time he would make movie posters, the one that. 

00:05:41 Frank 

Oh yeah. 

00:05:42 Andy 

Still sticks out is 1 called devs on a plane. 

00:05:45 Frank 

Ha ha ha. 

00:05:49 Andy 

Oh yeah, I loved that one that was so so cool and. 

00:05:49 Andy 

And that was. 

00:05:49 Andy 


00:05:54 Andy 

You know I saw the graphic arts part of it and I just knew I said you, you’d be really good in analytics and data visualization. You should get into by and you were busy doing other stuff which was cool. You were good at that too. It wasn’t, you know you. I don’t know of anything you’ve done that you haven’t mastered. By thank you. You know you when. 

00:06:14 Andy 

Things took a took, uh, started taking a turn for you in your first rodeo at Microsoft. You got into it and and took off with it. I don’t. I won’t tell the story well, but you just really turned around. You focused on data and. 

00:06:32 Andy 

You know, I’ll say this Frank. I was right. 

00:06:35 Frank 

Well, with that he totally I. I think if anything I took away is I should have listened to Andy 10 years earlier. 

00:06:36 Dave 

You aren’t very good. 

00:06:40 Frank 


00:06:41 Frank 

And that that that that is something that that that that’s the big takeaway we’ll talk about, kind of that journey. ’cause I think that’s worth kind of talking about. And I think one of the things we you, and I’ve been bouncing around is kind of interviewing each other. 



00:06:55 Frank 

Like in asking one of us those those those questions we have, so we definitely will do that, but not today kids. 

00:06:55 Frank 


00:06:55 Frank 


00:06:59 Dave 

We need to. 

00:06:59 Dave 

Need to. 

00:07:02 Andy 

Today, do we have Dave? 

00:07:02 Andy 

Today do Dave. 

00:07:03 Frank 

Today we have a special guest we have Dave Wentzel. Dave Wentzel is a was a peer of mine when I worked at the Microsoft MTC and that reminds me, I no longer work at Microsoft 2 weeks ago was my last day. I turned in my second blue badge. 

00:07:05 Frank 


00:07:05 Frank 


00:07:18 Frank 

And I joined a startup called electrify. We’ll talk about them a later day, but I’m so excited to have Dave here. Dave is the data in AI architect out of the Philadelphia Microsoft Technology Center, and he’s an awesome guy. Awesome, got to work with. I worked with him when I was in field sales and I worked with him when I was in the MTC organization. 

00:07:38 Frank 

It is April. It was a privilege and honor Dave to have you as a colleague, and it’s once again a privilege and an honor to have you here as a guest on data driven. 

00:07:46 Dave 

Well, thank you so much, appreciate that. 

00:07:47 Andy 

Welcome Dave. 

00:07:49 Dave 

Thank you. 

00:07:51 Frank 

So, uhm, so for folks that don’t know what the MTC is. Shocking that there are actually people that don’t know what that is, what? What is the MTC? 

00:08:00 Dave 

So basically we’re a free service to our customers and I’m a data and AI technology architect. We talked to customers about data and it could be anything from just, you know. Hey, here’s what we’re doing. State of the art in Azure with. 

00:08:16 Dave 

With data, but it could also be architectural design sessions where we talk to customers. Our customers bring us their architectures, and then we kind of get it with them. Give them the pros and cons, alternative ways of thinking, and then what I really enjoy doing is hackathons with customers and workshops and just you know, helping them to learn without just. 

00:08:37 Dave 

Taking a course somewhere so actually using their data and then I guess I’m roughly a data scientist, so we also do design thinking sessions and those are absolutely a lot of fun. 

00:08:48 Dave 

We did one at the MTC with CSL Behring a couple years ago and it actually won a Forrester Award. So I’m very proud of that one. And yeah, it’s it’s a. It’s a lot of fun and it’s a good way to bring to have executives and business people understand the actual capabilities of data science. And then within two days be able to come up with a use case. 

00:08:55 Andy 

Oh wow, wow. 

00:09:08 Dave 

And actually build a prototype out a lot of fun. 

00:09:11 Frank 

Yeah, the NPC’s are definitely like Microsoft Secret weapon in terms of how ’cause you know. Although I will say and because we were in the DC and we dealt with a lot of government contracts, we could not say that they were a free service. They were and already included paid for service. 



00:09:26 Dave 

Much, much better said yes. 

00:09:28 Frank 

I I ’cause I said free once and I got kind of slapped. 

00:09:31 Frank 

On the hand, say that. 

00:09:34 Frank 

But you know it, it really is something that if you do have a Microsoft account team and you are encountering any kind of questions or or whatever, and it’s not strictly technical, there’s also pretty good. You know, we basically wouldn’t engage with the business development, business decision makers. 

00:09:52 Frank 

Technical decision makers all the way from kind of like you know, hey, this is what Azure can do. This is what data can do for you all the way down to OK. What’s your problem? Let’s build something out, give you 3 days with one of the top Notch architects in the. 

00:10:04 Frank 

Space and. 

00:10:07 Frank 

You know, boom, you know we knock it out and and you know I I enjoyed it you know had this opportunity not come I would have I would have gladly stayed another. You know 5-10 years of the MTC. Like a lot of people do, and it’s a fun organization. So with that in mind, today we’re going to do something a little different. We’re kind of doing the. 

00:10:27 Frank 

A contrarian approach is that right, Dave. 

00:10:29 Dave 

Yes, exactly. 

00:10:31 Frank 

So this this has actually come up one of my last. This is one of the things that intrigued me about about your idea for the show was this came up when I was working with a we’ll just call it a large governmental agency known for its. 

00:10:42 Frank 


00:10:42 Frank 


00:10:43 Frank 


00:10:44 Frank 

That that should keep it generic enough. They basically came to us and say we want Synapse. We want a data Lake. We want this. We want that. And I was like, OK, well how much data you’re talking about. And like we have maybe you know 5 maybe 20 gigs of data. 

00:11:02 Frank 

And I’m like, uh, OK, tell me what are you trying to do? And ultimately I kind of pitched the idea like look, you know you don’t have that much data right to make data bricks. 

00:11:14 Frank 

But you really want it so. 

00:11:17 Frank 

If you really want it, I won’t stop you, but I think it’s kind of overkill. I think you’re taking instead of using a steak knife to cut the steak using a chainsaw. 

00:11:25 Frank 


00:11:27 Frank 

You know they kind of came back and ultimately what won the day was they already they couldn’t get approval for whatever we recommended ’cause it didn’t get stamped by there. 

00:11:37 Frank 

They’re people for security usage yet, and things like that so they end up doing kind of the right thing because of their own bureaucracy, which. 

00:11:44 Frank 

It’s kind of weird. It’s kind of like dividing by zero and seeing the universe fold in on itself. 

00:11:50 Frank 

But UM, so the topic of today is kind of like no, you don’t need a data warehouse. Did I get that right? 

00:11:58 Dave 

Exactly, that’s what I believe in, and I believed in it since I was in college and I first learned about data warehouses. I’m not saying data warehouses are always bad, they definitely have their. 

00:12:10 Dave 

Use cases, but in 2021 when we’re talking about advanced analytics and we’re trying to tell customers you need to be more predictive than prescriptive. 

00:12:19 Dave 

The data warehouse really doesn’t deliver. 

00:12:23 Frank 

Really, how so? ’cause? That’s that’s totally not the power. Certainly not the party line. I’m not going to say which party it was. You can figure it out but but why, why why would you say that? 

00:12:33 Dave 

OK so take. 

00:12:33 Dave 

A step back here, right? We’re all data consultants, or we were at some point in our life and probably most of the listeners are. And if you’ve been doing this, I’ve been doing this since the mid 90s in college and when I first started I had an internship with a consumer package. Good company, they made candy. 

00:12:52 Dave 

Hours and they said, hey, we wouldn’t want to do an internship and take a look at our data and figure out where is the best spot to put candy on a shelf so that we sell more candy to kids, right? So we used data for that at the time that was known as business Intelligence in the industry. Nowadays business intelligence means something totally different. In reality, it’s really closer to what? 

00:13:12 Dave 

Today we would call data science right? So my tools of choice were SQL, although I didn’t know what SQL was at the time and we had this goofy SQL engine and and essentially something called ESP’s, which is roughly the equivalent of like our or stats package something. 

00:13:28 Dave 

Like that and we kind of looked at data as just, you know I have data and let me find the Nuggets of gold and I’m not going to concern myself with schema and that is I think the biggest problem with data warehouses. But take a, you know a metal layer higher right? Talk to the average business executive like a you know a CTO or CEO. 

00:13:48 Dave 

And tell them, as a consultant, you’re going to go in and build them a data warehouse. 

00:13:53 Dave 

Instantly, that’s a political statement you just made. Data warehouses have connotations of you, know risky projects over budget projects as far as time and money, and you know a lot of times they fail and executives don’t want to hear that. So we’ve learned interesting ways to avoid the conversation of of calling things a data warehouse. 

00:14:13 Dave 

You know we call them other things in the industry to try to avoid that connotation. 

00:14:17 Dave 

But you know, ultimately, that’s a problem, and you know there’s reasons for that. But most executives don’t know why data warehouses are risky. They just know, hey, we try building a data warehouse every seven years, and it tends to fail, and we’re not really sure why. So we’re going to avoid it. And even companies that have successful data warehouse and there aren’t many. 

00:14:37 Dave 

You know they have problems. You know adding new features to data warehouses. And again, that’s problematic. So they avoid that conversation as much as possible because of the the risk. 

00:14:49 Dave 

But when you stop and look at it and just interrupt me at anytime with questions or you know, push back, especially when you look at it like why did our data warehouse is particularly problematic and why do they they fail and have a lot of risk? And I mean I’ve been doing this for many, many customers over many years and I’ve kind of seen the patterns. 

00:15:09 Dave 

When you take back when you step back and you think about this, you know with a little bit of introspection and I’ll tell you I’ve I’ve narrowed it down to three main causes. One is we spend a lot of time doing requirements gathering. 

00:15:22 Dave 

The number 2 is. We spend a lot of time doing data modeling and #3 is. We spend a lot of time doing ETL and I can avoid all of that or most of that anyway if I just don’t do it and I do something else instead of the data warehouse. So I’ll walk you through a use case here, OK? 

00:15:42 Frank 

That’s interesting, no, I I not now I see where you’re coming from. I didn’t mean to interrupt you, but now I kind of see where you’re coming from, but. 



00:15:49 Frank 

I have billions of questions or formatting in my head, but go ahead. 

00:15:53 Dave 

Let me give you a Canonical use case. OK, your consultant or your data professional. 

00:15:59 Dave 

Some analysts comes up to you and says, hey, I need a report that shows XY and Z help me build it right. So first thing you do is you go out and say well that data for XY and Z doesn’t currently exist in our data warehouse, right? Well, actually take a step back. First thing to do is a whole lot of requirements gathering, right? Well, what do we? What do we need this data? Where are we going to get it from? What if we can’t get the data, you know? 

00:16:20 Dave 

All these kind of questions. How are we going to massage it into formats? We need. It’s a lot of requirements gathering so put that as. 

00:16:25 Dave 

Side right, the next step is what’s the first thing you do. OK, well, you say where am I going to stick this data ultimately so I can report off of it right? And again, if if if you guys don’t do this that way, that’s great, but this is a common pattern that I see anyway, so they’ll say, alright? Well we need to figure out where it goes in the data warehouse, so we had these couple pieces of data yet. 

00:16:45 Dave 

And I don’t know do they go in the fact table, the dimension table which dimension table do we even have the dimension we need? Should we use a junk dimension and this becomes religious arguments with data modelers and I can’t stand data modeling and then bring in the slowly changing dimension type 2 discussion and oh boy, that just knocks everything off the. 

00:17:04 Dave 

Rails so data model OK. Then the next thing is you say OK I need to now that I have a data model. I know that I have this data somewhere so I need to bring in an ETL development that’s going to get it from somewhere into the data warehouse. Now the data is in the data warehouse, right? And that ETL process takes some time, right? Potentially days, weeks and months to do that code. 

00:17:24 Dave 

And then we say, OK, you know, if we’re following, you know inventor Kit or or the Kimball method. We might have data marts at this point we might have analysis services, cubes and then we have a presentation tier where we show the report and that’s powered by right? It’s only at that point when we can take the report back to that person that requested it. The analysts. 

00:17:43 Dave 

Say hey, here’s your data. Here’s what you asked for. Tell me how much of a good job I did, and invariably, what happens? They look at it and say, well, this isn’t at all what I asked for. And why did it take you three months to do this? And or they’ll say something like, well, our requirements change. We don’t need to see that anymore, and we need something else. 

00:18:01 Dave 

Any way you slice and dice it, you’re back to the drawing board, right? So it’s not a good approach, and executives see that. So what’s the answer to this, right? 

00:18:12 Dave 

And I’ll you know, cut to the chase. I think the way you do it, OK is you do something like a data Lake OK but don’t get hung up on terms, but it’s something like a data link. So let’s just take a step back and say well how would the project work if I were a data scientist doing it instead of a? You know more of a business intelligence. 

00:18:32 Dave 

Type person that does ETL and data modeling and things like that. Alright, so has the data. Scientists do it OK, here’s how I would do it and how I’ve always learned to do it. 

00:18:41 Dave 

We don’t do any requirements gathering or data modeling or anything until I get some data. Why do we talk about data? Why aren’t we just looking at data? So the first thing you do is go out and get data. Now sometimes that’s hard to do OK, because you may not necessarily have the data yet because it’s a new product that’s going to be generating data, so we have to start thinking about proxies in the data. Things like that. But forget about all that. 

00:19:01 Dave 

Let’s assume we can get the data from somewhere. 

00:19:04 Dave 

So we get that data and we stick it into this data Lake thing. OK, and we just land it there right now and I love to do hackathons with customers on this. Next thing we take that data in the data Lake and I sit with the business analysts and we start talking about it. What is it you really want to see? What is that negative goal? And we talk about it. 

00:19:24 Dave 

And we look at the data and we massage the data and we take the data and we join it with other pieces of data. We may already have in our data warehouse. Whatever it is, right? And we’re constantly learning about that data. So when we’re sitting side by side, I’m learning about the business. ’cause I probably know nothing about the use case. 

00:19:40 Dave 

And they’re learning about what my thought processes are as a data scientist and a lot of times we’ll find just as a side note that the business analyst that sits with me, they look at the code we’re writing and we have tricks to write this code. We use these interesting things called Jupiter notebooks, where the data tells a story and that’s the key thing, and we’re learning it together and. 

00:20:00 Dave 

I’ve had business analysts look at me and say, wow, all you’re really doing is taking data, enriching it a little bit, putting it in like a little temp table or another area of the data Lake, and then you’re enriching a little more. And then you’re doing that and we’re building some visualizations and we’re thinking through problems. So yeah, that’s all we do. This is not difficult stuff. 

00:20:19 Dave 

Right, so then we sit with the business analysts and we find the nugget of gold. So just assume we write some queries. We figure out what that negative gold is, right? So now we’re kind of done, right? So a lot of times what we’ll find is we have to do an evaluation stage. But we found the business thing that they asked us to find. 

00:20:39 Dave 

Originally, and never do we do data modeling or ETL. Think about that for a second, but we do the evaluation stage when we say now what is this data? This negative gold that we found. 

00:20:49 Dave 

And you know, where should it ultimately go? Maybe it should go in the data warehouse. So now we know, right? OK, here’s the nugget of gold we found. This is what it looks like. And now a data modeler can come in there and say without any doubt. Well, that’s simple. That should go in the fact table over here or that should be in this dimension or that junk dimension or whatever the situation is. And we know. 

00:21:09 Dave 

Because we’ve solved the problem, the modeling exercise is easier if we’re going to do it. 

00:21:14 Dave 

But there’s no, you know. So now we might have to do a little bit of ETL, but all times we find 2021, we don’t need this data in the data warehouse. So what happens if if the nugget of gold you’re trying to find is something that’s like forecast? My sales for the next, you know, two quarters. OK, well, you know data warehouses tend to be historical data, so if I’m doing sales forecasting 

00:21:35 Dave 

Why would I put that in the data warehouse anyway, OK, or here’s here’s a common one marketing people love. 

00:21:40 Dave 

To say of what percentage of my sales is attributable to my Facebook marketing efforts versus my Instagram marketing efforts? OK, that really doesn’t seem like data that I would want to necessarily put into a data warehouse, so maybe I don’t need it there, right? Or here’s simple things, right? In 2021, we’re trying to get more towards prescriptive analytics. 

00:22:00 Dave 

Prescriptive analytics, meaning what do I do next? Right business? People always want to know what’s the next thing I should do, right? Well again, if I’m trying to say what should my next brand campaign? 

00:22:12 Dave 

B. I’m going to do some, you know, interesting things with data and I’m going to come up with hopefully an answer, but does that answer go in the data warehouse? Does that need to be on a, you know Power BI dashboard? Maybe it doesn’t, and that’s the key thing. Maybe it needs to be somewhere else, so I said a lot there. 

00:22:31 Frank 

Yeah, yeah no, that was that was awesome. There’s a lot to unpack. We could probably spend an entire season kind of unpacking that, but. 

00:22:37 Frank 

But but I’ll, I’ll kind of take it back till I’ll boil it down to the simplest thing data warehouses. 

00:22:42 Frank 

Houses and Andy can keep me honest here, so could you Dave data warehouses kind of stem from the from the era of you had your OLTP’s and you roll apps right online? Transaction processing and online analytical processing right? They were split out originally, were they not because you didn’t want people doing a number? 

00:23:03 Frank 

Crunching to mess with the people doing the actual sales right and the real time data is that. Is that a gross simplification? Is it on the mark or what? 

00:23:12 Dave 

I I agree with that. 

00:23:13 Andy 

Yeah me too. 

00:23:14 Frank 

But now in the age of the cloud, when you have kind of this elastic compute or elastic databases. 

00:23:22 Frank 

That kind of reason for existence. I know there’s a fancy French term for that, but there’s a. There’s a reason for that is kind of gone. Now, assuming you’ve gone entirely on into the cloud where there’s more lasted compute is that is that. Is that also the case? 

00:23:38 Dave 

Uh, well, my answer would be partially so yes, you are correct, I I don’t think this is 1997 anymore where the Oracle DBA says thou shalt not run analytics queries that bring my website down. These systems are resilient like you said, but however you know a lot of times when I’m doing those. 

00:23:58 Dave 

Analytics with the business analyst and we’re trying to write those queries. We have to bring in data from multiple places, so even if I hit the OLTP, per you know server directly and the DBA doesn’t slap my hand. 

00:24:08 Dave 

For that you know I still need to bring in data from other places and be able to do analytics on that so you know a lot of times we can do what you’re saying, but a lot of times we can’t. Two, at least not in 2021 as it stands. 

00:24:23 Frank 

So it’s not a siloed system for the sake of performance. It may be a siloed. 

00:24:29 Frank 

System forsake of orchestration. 

00:24:31 Dave 

Yeah exactly, Yep and then a lot of times these systems of record these transactional systems they’re meant to be transactional systems. They’re not necessarily meant to keep history when we do this stuff as a data scientist in the data Lake data lakes are structured. 

00:24:49 Dave 

On the kind of longitudinal access, so they’re structured by time essentially. So what I mean by that is if you look at a data warehouse, you may have a fact table that has the the order information down to, potentially the the order line level, and that’s interesting. OK, that’s the grain of detail, but that doesn’t tell the full lifecycle of the. 

00:25:09 Dave 

Order right so the full lifecycle of the order goes all the way back. 

00:25:12 Dave 

To the user came to my website based on what refer was it Instagram? Was it Facebook right? Once they got in, how long were they sitting? You know, before they made a decision to put it in their shopping cart, how long was the product in the shopping cart before they hit by? You know all these types of things most aren’t in most data. 

00:25:32 Dave 

Warehouses and their structured attention, potentially when you’re looking at the analytics on the time axis so the time axis is much better done in a data Lake. I can say more on that, but I’ll let you respond. 

00:25:46 Frank 

That’s an interesting thing. What you’re talking about. Like because we’ve all had those experiences where we’ll search for something. We’ll even put it in our Amazon cart. And I can say, Amazon, ’cause I don’t work at Microsoft anymore. But I mean, clearly this is being done, and I don’t know. Do you know for a fact? Or is it conjecture that these big kind of retailers EV 2 

00:26:06 Frank 


00:26:07 Frank 

You know they’re not using traditional data warehouses. 

00:26:11 Dave 

Well, again a lot of times they they do have a you know a lot of that information. The entire sales lifecycle in a data warehouse somewhere, but here’s the thing. It goes into the data warehouse once it becomes operational. So once I just need to put it in a report then it goes into the data warehouse and I’m fine with that. It’s just remember the most. 

00:26:24 Frank 


00:26:31 Dave 

Important thing if I’m trying to figure out, so let’s take a different Canonical use case here. OK, let’s say you’re a marketing Department and you invest $10 million a year in Facebook advertising. 

00:26:41 Dave 

And your marketing team comes to you and they say, hey, we want to do Instagram right? How much money should we spend on Instagram or should we even spend money on Instagram right now? Think about your average data warehouse. Is it going to be able to answer that question? Is it going to be able to answer that question in that timely manner? I don’t think so, right? So, here’s where we start gathering data. 

00:27:01 Dave 

About our users, right? We do customer segmentation. All this kind of stuff in what I call a system of insight and a system of insight is forward looking right. It’s not necessarily the. 

00:27:12 Dave 

History that you would have in a data warehouse and these things, again, they’re much better done in a data Lake. All right, but don’t get hung up on the terms you know you can do this stuff in a standard database. What I’m suggesting is you don’t need to do it in a star schema format where there’s a this heavy reliance on on on modeling the data correctly in the Star schema. 

00:27:33 Dave 

Getting the data ETL correctly into that star schema. 

00:27:36 Dave 

And then dealing with the slowly changing you know dimension Type 2. If I’m simply asking or if I’m simply answering the question, should I invest in Instagram? Marketing? OK, do I even need a power BI report for that? I don’t even know what that would look like, right? And this is what you know. Again, we’ve all seen the slides right. 

00:27:56 Dave 

You know, Oh yeah. 

00:27:59 Dave 

Mr CTO, Mr CEO, you want to take your data team from the descriptive analytics to the predictive analytics to the you know the prescriptive analytics. So what they’re saying there is the rearview mirror to the mill, and the predictive to the what do I do? 

00:28:13 Dave 

Next, and you know, executives look at those slides and they go. Yeah, yeah I want that, but they don’t know what the words mean and what the words mean is really I I just need to answer a question with data that normally I would answer with my gut and and I want to be you know more or less data driven on that so you know maybe when you look at that. 

00:28:34 Dave 

Data you start to realize if I’m going back to that Facebook versus Instagram conversation. Maybe when I do the analytics the overlap between my Facebook users and Instagram users is 8085% and it makes absolutely no sense to do an Instagram marketing campaign. 

00:28:48 Andy 


00:28:49 Dave 

So now you just saved yourself potentially $10 million in an Instagram spin. Not to mention you gave your customers a better experience because they’re not getting bombarded from, you know, by your advertisements on yet another platform. 

00:29:01 Andy 

So Dave, I would. I’d like to give you props on product placement there for mentioning data driven in your last name. 



00:29:10 Andy 

I appreciate that. Well, we we need all the help we can get, I promise. 

00:29:10 Dave 

You caught that. 

00:29:10 Dave 

You caught. 

00:29:12 Frank 

I liked it. 

00:29:15 Andy 

Uhm, I love what you’re saying. I I as a a, you know, a practitioner of data warehousing for decades now and ETL some of the things that that I think about when I hear. And you’re not the only person I’ve heard say this stuff. Some of the things I think about our things like data quality. 

00:29:37 Andy 

And Master data Management, and that’s hard to do anywhere. How does that play into your strategy in? You know in using a data like like obj. 

00:29:49 Dave 

OK, I’m going to say something controversial here. Let me just. 

00:29:53 Andy 


00:29:54 Dave 

Let me finish this. 

00:29:55 Andy 

Go ahead. 

00:29:57 Dave 

Let let me finish my thought before you jump down my throat. OK, but I’m gonna tell you right now, data quality doesn’t matter. OK, let me say that again, data quality doesn’t matter. OK, so I talked to CTO of a Heart Hospital system, maybe it was before the pandemic, right? And we were talking about. 

00:30:07 Frank 


00:30:15 Dave 

Interesting things he could do with his hospital data right? As far as the the prescriptive. 

00:30:20 Dave 

And so forth. And he looked at me, pounded his hand, his fists on the table. Now he was at the CTO. He was also an MD. So this guy is a smart guy. OK, that we’re not doing another data project until the data quality improves, right? And I said to him, you know, data quality doesn’t matter and everybody in the room was like, whoa, ’cause this guy’s hot button. 



00:30:41 Dave 

It was data quality and I said let me just explain what I mean by that. OK, if you’re in the nuclear industry, data quality matters. If you’re in healthcare, data quality matters. OK, if you’re designing a system that does accounting and debits don’t equal credits. 

00:30:56 Dave 

The first thing will happen is accountants will never use your system again because devasthan equal credits data quality ***** so data quality matters in certain cases, But when I hear statements like you know well, what about the data quality? And we’re not doing data projects until our data quality improves. I question that because you’re saying one thing but you’re. 

00:31:16 Dave 

Actions are doing something totally different, so I see this with a lot of customers. They have the same data quality projects going on for 20. 

00:31:23 Dave 

Years and years, that and again I think it’s because executives, so the C-Suite, they hear from the IT and the data groups data quality ***** and they don’t know what that means. So they regurgitate it and they say, hey, data quality ***** We’re not doing data project. Still data quality improves all that kind of stuff. But here’s the thing. 

00:31:41 Dave 

If you had serious bugs in your code, what we call in has in the healthcare industry. Critical click. 

00:31:46 Dave 

Where you know it’s going to cause a problem and somebody is going to die, or if it’s in the new you know field, you know we’re going to have a, you know a Chernobyl style event. Then obviously we would fix those bugs, right? And we’ve all fixed bugs on data quality. When you know a system went out and it wasn’t properly tested, so those types of data quality issues will. 

00:31:57 Dave 


00:31:57 Dave 


00:32:06 Dave 

We would have fixed already. OK, so that’s the first point. 

00:32:11 Dave 

Is a lot of times you know, we think data is of bad quality, but really our understanding of the data is what is lacking. So every time we do hackathons with customers and we bring data into a data Lake, then let’s say it’s Salesforce Data app data. We’ll start writing some queries and something will happen and you know again my debits don’t equal my credits. 

00:32:31 Dave 

My sales totals don’t match what’s coming out of the system of record, and then you know an executive will sit there or business personnel say see. This is what I’m talking about. All the data in our company is is bad garbage in, garbage out and they’ll start throwing the platitudes. And I sit there with. 

00:32:44 Dave 

This smirk and I think to myself now you’re you’re a profitable company you know your multibillion dollar Fortune 500 company. I doubt that the system of record for your accounting data or your CRM system or your ERP system is wrong. My guess is I’m not smart enough to write a decent query for you, right? 

00:33:05 Dave 

And that’s usually what it comes down to, and we’ll come back. And you know, we’ll say, hey, you know why am I seeing this? And then somebody will say, well, your query is wrong, you idiot, and then I’ll fix it and suddenly the data quality problems go. 

00:33:16 Dave 

Away with as a data scientist, we see data. You know, quality problems all the time, and honestly data scientists love dirty data because it’s The Dirty data that give you the Nuggets of wisdom. Right now we sit there and ask a question. Why is the data dirty and that becomes a very interesting thing. 

00:33:36 Dave 

Right like why? 

00:33:37 Frank 

So there’s signal in the noise you’ve got. 

00:33:39 Dave 

You got it right, and that’s very valuable information. So here I’ll give you a quick use case here. I was called in as a data scientist number of years ago and it was for call center and they said hey, we want to do real time, you know, call center analytics. 

00:33:54 Dave 

All this kind of stuff, so they used a third party call center management software system and we started ingesting the data in real time into the data Lake and we’re in Jesse and we gave him some basic reports and said just verify this stuff is right. Guy looks at me. He’s looking at the data just huh? I knew it and I said what does that mean? And he said, well, look, he says it says the average time our people are on the phone or. 

00:34:14 Dave 

3 1/2 hours a day. So look at this report I get from the vendor. It’s saying they’re on the phone 6 hours of the day. He says. Now look at the window. He says they’re all eating lunch and smoking cigarettes. I know they’re not on the phone, but the reports tell. 

00:34:25 Dave 

To me it is so. In other words, the data quality is bad. He says your reports not showing that issue. So the first thing I I do is I say, hey, you know it’s probably me. I’m not that smart, right? And so I went back to the vendor and I said, hey, we’re calling your API at night at midnight, and you know, you’re saying the guys are on phone 6 hours. I’m I’m looking at your real time. 

00:34:45 Dave 

Data feed through a different API. I’m saying it’s 3 1/2 hours. When I aggregate the logins and logouts, and that kind of stuff. What am I doing wrong and the guy looked at it and he said, you know, what? Give me 24 hours and we’ll have the problem fixed here. I uncovered a bug in his data quality, pointed it out to him, he went, he fixed his data. 

00:35:04 Dave 

And that happens a lot, right? So use these opportunities to say, hey we found some problems, go and fix them. The last thing you want to do as an executive is fund. This is my opinion, a data quality initiative, right? Because data quality initiatives that are driven from the IT organization, they’re boondoggles, right? They’re never going to. 

00:35:24 Dave 

Team and this is what executives, hey. Now if the business comes in and they say our data quality and the SAP system is terrible and we’re funding an initiative to fix it. And here’s the 10 things that we can’t live without. And you know, because the data quality is bad, we’re going to fix it. 

00:35:41 Dave 

And those types of data quality projects. Obviously they’re going to succeed, or they have a better chance of succeeding because they’re driven from a business problem, right? But their quality isn’t a destination, it’s a journey. 

00:35:55 Andy 

No, I get it. I really loved your. You know the fact that you pointed out that there is some. There’s some baby in the bathwater there that you can. You know, sometimes an outlier is just some crazy data point that you want to ignore. But there are other times where you want to count that, and maybe even search. 

00:36:14 Andy 

For them it depends on if you’re doing inclusive or exclusive filtering, and you know way about you’ve probably forgotten way more about this than I’ll ever know, but I get that I understand what you’re saying and sometimes that is the gold nugget. 

00:36:30 Dave 

Sure absolutely absolutely. Hey, I’ll give you a quick story. Sometimes data quality is what you don’t want. Sometimes you want dirty data, so I was doing some data so now that sounds weird. I I was doing some data science work for a large city, one of the largest cities, if not the largest city in the United States. 

00:36:31 Andy 


00:36:41 Andy 

No, I get it. 

00:36:41 Andy 

No get it. 

00:36:49 Dave 

And they have multiple hospitals. 

00:36:50 Dave 

And they said we want you to match. This is years ago, years ago, probably 15 years ago and they said we want to match up from all these different systems of record. For for patients we want to match up all the patients and we want to create 1 Golden record. So we did that. 

00:37:05 Andy 

That’s awesome we did that. 

00:37:05 Andy 

That’s awesome we did. 

00:37:07 Dave 

Like that we tested it all we we swear we had it right and they went out and they put it in production. Here’s the problem, OK? 

00:37:16 Dave 

Again, they were gaming the system. The patients OK, and some of these places. They do like drug rehab type of places OK and they knew they could get you know their methadone treatments if they went to the first place where they, the patient information was in. And maybe the first. Their birthday wasn’t right or. 

00:37:34 Dave 

Social Security number wasn’t right. Then they could walk down the street, go to a different methadone clinic and get the same dose with slightly different patient information. When we fix those issues, suddenly they weren’t able to get all the methadone they needed caused a major health crisis. 

00:37:49 Dave 

Quickly they came back and said undo all of your data quality initiatives after The Dirty data because we know we have a major crisis. And yeah, that’s a true story. 

00:37:53 BAILey 

Oh no, how? 

00:38:00 Frank 

That’s interesting and scary all at the same time. 

00:38:01 Andy 


00:38:04 Dave 

Yes, scary, definitely scary. 

00:38:07 Frank 

There was a. I think this might have been pre Microsoft. I was in a room with some interesting folk. 

00:38:15 Frank 

And for those who don’t know, I’m in a DC area, so the interesting folk can be very interesting and they were talking about a similar problem how certain certain bad actors would intentionally misspell their name. 

00:38:28 Frank 

So so they would get off of certain lists, and because their names were not in the regular alphabet, as we know at the Latin alphabet, they were able to get away with that pretty well. 

00:38:40 Frank 

And this it was an interesting conversation, so it’s fascinating how even little stuff like that becomes a problem at these institutions. You know, you’ll you’ll. 

00:38:54 Frank 

You know, I mean, my last name has two is split up into two parts, but not every system recognizes that. So I like. So I I mean I I kind of deal with that a lot so I can imagine that and and and. There’s also stories where somebody change, put their license plate as null. 

00:39:10 Frank 

And like the nightmare that that caused and people whose last name is null like causes a lot of problems. And it’s just interesting stuff. And if you’ve not heard the scene, the cartoon about little Bobby tables just use Google or Bing and find the find it, it’s hilarious. 

00:39:16 Frank 


00:39:16 Frank 


00:39:29 Frank 


00:39:30 Frank 

But it’s it’s interesting that that ’cause you’re right. I mean, if you if you kind of say. 

00:39:31 Frank 


00:39:31 Frank 


00:39:37 Frank 

You know we’re not going to do this until our data quality problem is fixed, and I think you’re right. I think it’s regurgitated because you want you. People get weird about their data and not just their personal data by their organizational data. In fact, one of the earliest consulting gigs I actually didn’t get. 



00:39:53 Frank 

Because I was telling them like, well, you know I was, I scoped out the project and I said, well, the first week I’m going to evaluate it and start, you know, cleaning the data and then this customer said no no, our data is already clean. It’s everything is a normalized form. 

00:40:07 Frank 

And I was like. 

00:40:08 Frank 

Uhm, yeah, not, that’s not what I meant, you know, and and and so ever since then I kind of go to use the term shaping in the data because that that doesn’t. 

00:40:19 Frank 

Getting one off the wrong way, yeah. 

00:40:23 Dave 

Yeah, you got to avoid. 

00:40:24 Dave 

Those kind of trigger words. 

00:40:26 Frank 

Right, right now that’s that’s. I mean, it’s interesting. So so. 

00:40:29 Andy 


00:40:32 Frank 

When would you want a datawarehouse like? It’s not necessarily, you know. 

00:40:37 Frank 

When would you want one? Well. 

00:40:43 Dave 

I guess the snarky answer would be possibly never. 

00:40:46 Frank 

Right, right? Well, I mean, that’s a good question because I mean, is data warehousing the whole OLTP versus OLAP? 

00:40:53 Frank 

Is that an artifact of? Like you said, the late 90s? Or when you you you, you, you, you own the metal and if you needed to upgrade the metal you had to buy the metal and install it not go to the Azure portal and just click you know up you know. 

00:41:10 Frank 

Like what? 

00:41:11 Frank 

Is it a relic of a time gone by? 

00:41:14 Dave 

It might be dumb. I’ll tell you this, here’s how I look at it. When I what I don’t like about data warehouses, is the star schema and it’s the. It’s the slowly changing dimension type 2. So when you ask the average you know ETL developer or just an IT manager. And even a CTO and you say. 

00:41:34 Dave 

You know your last just folk thought exercise here for everybody. The last time you developed an ETL system for a data warehouse. 

00:41:42 Dave 

OK, how much of your time was spent developing the queries for the SCD Type 2 and what was the percentage breakdown? Usually it comes back. This is anecdotal. Don’t hold me to this, but whenever I ask this question to customers it usually comes down to, well, you know 20% of our ETL is just, you know, getting stuff in the facts and dims and then the 80 the remaining 80%. 

00:42:03 Dave 

Is trying to figure out how to do the SCD type 2 stuff. You know expire the previous row, build a new row, that kind of stuff, right? Say OK. 

00:42:09 Dave 

80% got it OK. 

00:42:11 Dave 

How many of your queries are actually looking at those historical expired, non active CD2 type Rd? 

00:42:19 Dave 


00:42:20 Dave 

And they’ll say probably about 5% of the reports. OK, well, you seem to have violated the Pareto principle there in theory, and that’s my big problem. We spend a lot of time getting data into that historical format, and very few people are using it. And when we talk about self service analytics, the average business analyst gets confused by, you, know, the CD. 

00:42:40 Dave 

Type 2 hey, I see customer 1235 times in my table. Why is that? And then you gotta explain them well. Only you have to take your query and say you know is active equals one because the rest is the. 

00:42:50 Dave 

History OK if you do stuff like is in a data leak and this kind of gets into the weeds a little bit, the data lakes again, if they’re done correctly, and that’s the key thing. A data Lake is structured on the time dimension. So what I mean by that is I can go back and with a very simple query I can rehydrate and get you the SCD type 2. 

00:43:10 Dave 

For a given you know customer 123 in no time flat. So my point is, even if you’re going to do a data warehouse, you might want to defer some of those decisions if you don’t need them today, right? So when you talk to the average data warehouse practitioner, let’s say hey, look. 

00:43:26 Dave 

If we don’t build the SCD 2 structures today, and six months are going to ask for him, and then I’m going to have no history and then I’m going to get myself in trouble and what I’m saying is in the data Lake you get it for free. It’s not a modeling exercise, it’s not an ETL effort. You get the SC, you get the history. I should say for free Now you can say. 

00:43:47 Dave 

Hey, I’m going to defer the B2 ETL code until a later date when it’s proved its. 

00:43:52 Dave 

Needed right one last thing a lot of times when we’re doing as data scientists. OK and people get confused as to what a data scientist is, and it might be like an interesting conversation to talk about that. The data scientists, you know what they’re looking for, you know primarily is looking at at lifecycles of things you know. How do things change? 

00:44:13 Dave 

Overtime, right? But the thing with the data scientist is and this goes for pretty much anybody that’s doing analytics is they want to see everything on the row. So what I mean by that is data scientists will say you know, whereas a normal data practitioner will say I have rows and columns in data science land. The rows are called observation. 

00:44:33 Dave 

And the columns are called the features right? But here’s the thing with data. Most data science algorithms, one row cannot refer to the previous row or the next row. Every piece of data used for that row or that observation has to be on the row. OK, So what I mean by that is think about this. If I’m if I’m using a data warehouse and. 

00:44:52 Dave 

It’s got slowly changing dimension type 2 every time some attribute of the customer changes. I get a new row right now. Think about what that does to the data scientist. They have to take that data and pivot it so that all those additional rows go back to being one row. OK, so I did for three or four years. 

00:45:13 Dave 

Consulting where we would go in and optimize. Do performance tuning on data science algorithms and it’s very simple again. Thought exercise here for the data scientists. 

00:45:24 Dave 

You need to write an algorithm that does something OK. Predicting sales doesn’t matter what it is. Let’s just say all of your our code or your Python code or your sass code is 500 lines of code. How much of that is actually manipulating data, right and? 

00:45:40 Dave 

You know thoughtfully. Most people will say it’s probably about 400 lines of the 500, and then the last 20% is the actual algorithm. And then you say, Yep, that’s about right. Anecdotally, and they say what are you doing right in that data manipulation? And when you look at the code, invariably people are taking the data warehouse and they’re re pivoting all the rows. 



00:46:01 Dave 

Back into one row. 

00:46:02 Dave 

Right and then we ask data scientists, you know you know, which are? You know, it’s the hottest paying job right now. In it we say, you know why. Why do data scientists stay at a job for six months and then go on to another place? Is it because they’re getting more money? Maybe, but usually when you ask them, they’ll say, oh, I just hate the processes in the data at the last place I worked, dig a little deeper. 

00:46:23 Dave 

In that, and invariably it’s because they’re doing what they need to do off of the data warehouse, and it’s so frustrating to do pivot table queries when it’s not needed. Give somebody a structure that’s meant for analytics and not reporting. 

00:46:39 Dave 

And again, that’s kind of the notion of the data Lake and suddenly life becomes easier. 

00:46:44 Frank 

Interesting side. 

00:46:44 Frank 


00:46:44 Frank 

So no Andy is chomping at the bit to ask you a question, but 

00:46:47 Andy 

I am yeah, I’m just I’m I’m, you know I’m enjoying the the contrast that you’re making one. I have a couple of things but the one thing I would say to you is. 

00:47:01 Andy 

You know, I’ve I’ve been using some some other tools that are available for ETL for years now. One called Business Intelligence Markup language. 

00:47:10 Dave 

Huh, yeah. 

00:47:10 Andy 

And I estimate it takes me about 2 hours to develop a package in SSIS that does an incremental load and this isn’t true for all projects. The project has to lend itself to to this, but I found that I I just needed to replicate that. 

00:47:29 Andy 

Pattern across hundreds of tables and using that math I did about 10 1/2 months worth of work in 3 1/2 days so. 

00:47:37 Dave 

I agree with you, Yep. 

00:47:39 Andy 

That just proposed just kind of, you know, there are some automation efforts out there. That’s not the only one there. 

00:47:45 Andy 

Other tools on the market that try to solve that same problem, and you’re right, the ETL taken 80% of the project is crazy unless you’re billing by the hour and then it’s kind of awesome. I’m just saying, but one of the things that I that I would like then I don’t know the right way to ask you this and I’m not. 

00:47:57 Frank 

Ha ha ha ha. 

00:48:05 Andy 

I promise I’m not criticizing or anything, but I’m just curious how you would respond to you know, are you just proposing like another? 

00:48:16 Andy 

Modeling or, or you know or project methodology. 

00:48:17 Andy 


00:48:17 Andy 


00:48:21 Dave 

Uh, yes, I think that’s a fair statement that you have a. So you’re right, a lot of it isn’t just the data modeling and it’s how we do you know data projects in general, so you know we all like to say oh we practice scrum and Agile and some will say can bond. You know I, I’m not sure. 

00:48:39 Andy 


00:48:41 Dave 

That those things lend themselves to to data projects either can bonds, probably at least head and shoulders better than. 

00:48:50 Dave 

Than anything else out there, but honestly, most data projects should be driven as lean projects. And what I mean by that it is, you know, and and the problem is, is the economics aren’t there for data consultants, but the way it should work. In a perfect world. If there were no economics, we should go in and we should say to customers you know. 

00:49:09 Dave 

We’re going to try some things here right? And we’re going to build an MVP and it’s going to take two weeks and we want you to evaluate it. If it looks like we’re moving in the right direction, continue with the project. If it doesn’t, you know it’s a fail fast mentality, right? And then. 

00:49:22 Andy 

Can I can I break you break in here? 

00:49:25 Andy 

I agree with everything you just said, and in fact I ran by projects like that for 15 years. 

00:49:31 Dave 

Yeah, and and most of us. 

00:49:32 Andy 

I called it project. I called it phase zero. I’d go in and deliver in somewhere between, you know, a week in six weeks and the whole idea was. 

00:49:35 Dave 


00:49:42 Andy 

Go ask the sea level or the person that was the customer. What is the first thing you look at when you get in in the morning to determine how’s my business doing? 

00:49:52 Andy 

And if I can get that metric, if I can get a big Dilbert button on the screen as the interface and it’s either red, green or yellow. I’m done with Project Zero, you know, Phase zero. At that point I’m I’m doing well, I’m not doing well, or things are bad. And then to expand that just a little bit. 

00:50:13 Andy 

When you drill down, it would show you different areas, right? These areas are doing great. You’re showing red. You know these areas. You’re doing great. These are in the yellow, but over here you’re all right. 

00:50:21 Andy 

Yeah, and I just. I think that’s that that I think you get those kinds of results using again the data Lake Ish. You know metaphor that you’re talking about. Is that fair or am I missing something? 

00:50:35 Andy 

But missing. 

00:50:35 Andy 


00:50:36 Dave 

It’s exactly right, so again, you know even if you decide you know, after you find the what I call the Canonical, you know nugget of gold. 

00:50:45 Dave 

If you decide hey OK, we want to stick that. You know that insight you found back into our data warehouse, then go ahead and do that. I don’t mind, OK, you know. It’s just we’ve we’ve proven the project at that point, right? The risk is being removed. The rest is we’re just doing it stuff at that point, right? Operationalization, you know. Data governance. 

00:50:58 Frank 


00:51:05 Dave 

All that kind of stuff and I’m fine with that, but at least we thought the business problem. And by the way, I 100% agree with you or how you stated you know, as data scientists like we always talk about if we’re going to provide a system of insight, it needs to be exactly what you said it’s. 

00:51:20 Dave 

A lot of in the industry they call IT system one thinking I need to look at a dashboard and without any kind of cognitive load. Am I doing good? Am I doing bad? You know, in what areas am I doing good so forth and so on. So the stoplight analogy basically. 

00:51:35 Andy 

Well, I’m not as good at graphics as Frank, so I’m just saying it that’s I have to go with what I’ve got here to work with your day. 



00:51:42 Andy 

No, but I I absolutely love the focus is on the business because you know, a lot of times the focus can get on something else and it could be any number of things. But you know, a lot of people are drawn to new technology for new technologies sake. They may not know that that’s the tech to use to solve this particular problem. 

00:52:03 Andy 

I mean, the first time you use a particular technology, I don’t know that you can know, at least with you know 100% certainty or if you can ever know with 100% certainty. This is where we start. But if you you can mitigate that by using the process you do. 

00:52:18 Andy 


00:52:19 Andy 

You know, way earlier when you talk about just drop the data into a data Lake or some structure like that, start beating on it with some of the tools that are available and and you will find that it. Well, actually the data will lead you in the direction that you should go. And you’ve said this. You said maybe the answer is. 

00:52:39 Andy 

After that that it ends up in some data warehouse is structure. Maybe the answer is not. 

00:52:46 Andy 

So I I think we’re on the same page here. We’re trying to solve a problem for the. For the customer, we want to make something repeatable and sustainable. We don’t want to build tech just to build tech. You know, we want to help help customers achieve a a goal here. 

00:53:02 Dave 

Yep, and the data Lake thing it for me, it’s not really a technology as more of a mindset and a process. So if you want to think of it in people process technology lens, I view the data Lake is really a process. I’ve built data lakes for customers in SQL Server. It can be done. You just have to do a couple of. 

00:53:20 Dave 

Trade offs, but you know if your people know SQL, uhm, maybe that’s a you know, a fairly good choice. And honestly, you know, I know SQL better than Python or anything else. So when I’m doing like what I call you know, EDA in the industry, exploratory data analysis. I’m using SQL and and in the cloud you know you can actually do SQL. 

00:53:41 Dave 

Against data lakes relatively inexpensively, so you know it, it’s not like people have to learn an entirely new technology it you know and and. 

00:53:44 Frank 

That’s true. 

00:53:54 Dave 

Honestly, Andy, you’re probably when you write queries and you’re doing, you know, exploratory data analysis and you’re writing ETL. You probably do it just like me so you know it’s just. 

00:54:05 Dave 

More people kind of need to do that, if that makes sense. 

00:54:08 Andy 

Well, yeah, I mean we yeah I like to start off with profiling because you know good data profiling will take me to some of the answers right away. It’ll let me know hey, this is sparse. This is populated, so I maybe want to pick this part. You take this column apart or whatever and and see what my categories are and I. 

00:54:27 Andy 

Again, I think there’s a lot of similarities in in what we’re talking about, I’m. 

00:54:33 Andy 

But yeah, I don’t and if I’m building transactional data or a reporting. 

00:54:40 Andy 

Data Mart or data warehouse off of transactional data that has, you know, start dates and end dates. Rows that expire then sure I’m going to do something that looks like a type two dimension but and you’re right, it’s no fun to build all of that unless you’ve got a solid design pattern and you can drop it into a tool like. 

00:55:01 Andy 

Demo and have it do all the work you know in about a minute so. 

00:55:05 Andy 

But but I get where you’re going with it. You’re right, there’s and, and I don’t know anybody who likes the idea of, you know 80% of the data warehouse is not being read. Those rows in there that are just being sorted out by in SQL Server the you know the the the not not profile or what. Am I thinking the optimizer? 

00:55:25 Andy 

Query optimizer you know when you have rows in there that are just constantly being thrown out of results. That’s that’s no fun for everyone. But there are other models that’ll work in a relational data warehouse, and one that I found compelling in a number of instances was a data vault. 



00:55:43 Andy 

And it solves some of those problems, but again, you know it’s it’s another way of acknowledging everything that you just said about moving to a data Lake instead. It’s because those problems exist in a star, so you know, I, I think there’s a there’s more overlap. 

00:56:02 Andy 

Here than there is a contrast, but I definitely was intrigued by listening to you explaining your thoughts on it. I think you’re right. 

00:56:11 Andy 

I do. 

00:56:12 Dave 

Yeah, I think ultimately we’re just deferring things until later in the process, right? Definitely the modeling we’re saying, defer that until we know we found you know that business, nugget of gold. And then again we’re deferring the ETL. Possibly we’re not even doing any ETL. And when I say that as I say, how do you not do any ETL? Well again. 

00:56:32 Dave 

You think about how I said that first thing we do is we get data and then we sit down with you, know the business analyst and we massage data. We profile data like you said, you know we figure out how to get the data in a format that’s that finds that negative gold. Well, everything, I just ride, there is ETL. 

00:56:49 Dave 

It’s it’s ETL with a purpose. 

00:56:52 Frank 

That’s interesting. 

00:56:52 Andy 

Well, hopefully it’s all with a purpose, yeah? 

00:56:54 Frank 

Ha ha ha ha. 

00:56:56 Dave 

It’s it’s not. 

00:56:57 Dave 

It’s solving a business problem versus a data movement problem. Most people think of ETL is a data movement problem, right? It it shouldn’t be. It should be. It’s a, you know. I’m solving a business problem. I’m getting data into the format I need, you know to find that nugget of gold. 

00:57:12 Frank 

No, I like the way you approach it. In the interest of shiny object syndrome. 

00:57:17 Frank 

What are your thoughts on Delta Lake and the Databricks is big on their Lake house platform. 

00:57:24 Dave 

I, I really think that is the future, I’m I’m not sure we’re there quite yet, but I envision the next 5 to 10 years of of you know what I’ll call you know prescriptive analytics to be we, you know, more people start to understand what a data Lake is. And by the way, it’s not just a place I archive my CSV after the ETL is done. 

00:57:45 Dave 

Is it? 

00:57:46 Andy 


00:57:47 Dave 

It’s a system of. 

00:57:48 Dave 

Insight OK. And so once people you know start to do that, then they’ll come to the realization you know. Again, I can. You know the big problem with data you know, lakes versus data warehouses. In my opinions customers will say when do I need a data warehouse? Smart guy, you say I don’t need it. 

00:58:06 Dave 

At what point do I need it? OK, usually it comes down to when performance is no longer acceptable in the data Lake. So remember with the data Lake if you’re doing it, you know the way most people do. Data lakes is. It’s a, you know, a bunch of files, hopefully optimized files with a compute engine. You know that’s it’s somewhere else. 

00:58:22 Dave 

Versus a typical data warehouse in Oracle or SQL Server, right? It’s very very, you know, the data persistence engine. The storage engine is, you know, tightly coupled with the query engine, and they’re meant to work together, and columnstore indexes, or you know the cat’s meow and all that kind of stuff. You’re not going to have that with a data Lake, right? So generally performance. 

00:58:42 Dave 

Is not going to be, you know, so great. So you know maybe the data warehouse one point in the decision matrix is to say you know the data link doesn’t perform anywhere. I gotta put it in a data warehouse, but you know to your point now with this this stuff Databricks has with the concept of the Delta Lake or the you know we call it the Lake. 

00:59:00 Dave 

House again, maybe that’s going away. Maybe you can say I can do SCD type 2 right out of my data Lake, so maybe I don’t need. You know it bespoke or not a bespoke a you know a commercial data warehousing you know optimized for star schemas. Type tool maybe I don’t need that anymore. 

00:59:21 Dave 

Uhm, and it’s possible. 

00:59:24 Frank 


00:59:26 Frank 

Interesting, we could probably talk for hours on this. And, uh, we’d love to have you back, uh, but. 

00:59:26 Andy 

Great stuff. 

00:59:28 Andy 


00:59:34 Frank 

Uh, we have a bunch of kind of pre canned questions. 

00:59:37 Dave 


00:59:38 Frank 

Uh, which I think some of these are going to be fun. Some of these are going to be interesting. 

00:59:45 Frank 

So the first question is how did you find your way into data? Did data find you, or did you find data? 

00:59:53 Dave 

So again, I was an intern doing, you know, basically, statistics against you know some datasets for you know, a company that makes candy so kind of data found me. And then I was a history major. Honestly I say yeah and I hated it, but it was a sunk cost fallacy. 

01:00:07 Frank 


01:00:12 Dave 

I was in two years and I didn’t really feel like change. 

01:00:16 Dave 

So I went to QVC after I graduated and I just wanted to help this job. And, you know, I looked over one day and you know this woman was having problems with the SQL query and I said, oh, here’s what you do. It was in Teradata and she said, you know, SQL. I said, what is the sequel of which you speak? ’cause I didn’t know what it was I. 

01:00:34 Dave 

Just knew how to do it and. 

01:00:35 Frank 

You didn’t see the first movie. 

01:00:37 Frank 

So how could you see? 

01:00:37 Dave 


01:00:39 Dave 

And and that’s how it worked, and it was the same thing QVC you know, back in the at the time they did interesting things with data and it was a lot of of you know what should we, you know? Advertise when you know what time of the day should we advertise certain products and things like that. 

01:00:59 Dave 

And all that done with analytics and statistics and just you know, very insightful type data science things. Most people think data scientists you know they’re doing. You know convolutional neural network programming and learning about backpropagation through the network. And no, it’s not that, or you know I’m creating regression models a lot of times. 

01:01:19 Dave 

Data scientists are just, you know, they’re basically asking. 

01:01:22 Dave 

Questions, but honestly pretty much you know anybody who understands how to query data intelligently can do, right. You know that’s what business people want in 2021. You know they don’t want another report that says show me you know, quarter over quarter sales or quarter to quarter sales. They want. They want data, people that can come in and say you know what do I do next, right? 

01:01:42 Dave 

You know what’s the next best thing I should be doing, you know, or you know, how should I structure my next brand campaign where I’ve done all of these things to convert lead to leads to sales? You know what are the things that are actually working and that’s what they need? That’s what they need. 

01:02:00 Dave 

Real quick. 

01:02:00 Andy 

So you mentioned QVC, were you down in Roanoke? I know they have a big operation there. 

01:02:06 Dave 

No outside of Westchester Pennsylvania. 

01:02:08 Andy 

OK, OK, very cool. 

01:02:11 Andy 

So our next question is, what’s your favorite part of your current gig? 

01:02:17 Dave 

So like I’m a boomerang at Microsoft, I worked at Microsoft before, uhm, I like just we have a lot of customers to come in and like I said, we talked to them about different things earlier this week I had an entire C-Suite come in. 

01:02:30 Dave 

To figure out how to monetize their data, that’s fascinating. And then the other end of the spectrum is, you know, another day. This week I can’t remember what they were even on. We did a hackathon on like how can we do interesting, you know analytics you know with our data so we get to do all that kind of stuff. 

01:02:50 Dave 

And like Frank mentioned at the beginning, generally we’re constrained to like. At most we can work as three days, so it’s nice we can get in there. You know, you know, ’cause some some grief. And then we get to walk away and do it again with. 

01:03:01 Dave 

Another lot of fun. 

01:03:03 Frank 

I didn’t know you were all set. Boomerang, that’s cool. 

01:03:06 Dave 

Yes Yep. 

01:03:07 Frank 

Awesome, our third question is, and for those who don’t know what a boomerang is is when you leave Microsoft come back. 

01:03:14 Dave 


01:03:14 Frank 

Yeah, we have a couple of complete this sentence questions when I’m not working I enjoy blank. 

01:03:23 Dave 

I go to the beach a lot so I live close to the beach. That’s my thing. 

01:03:27 Frank 


01:03:29 Andy 

So our next one is, I think, the coolest thing in technology today is blank. 

01:03:36 Dave 

Blockchain and cryptocurrency. 

01:03:39 Frank 


01:03:40 Dave 

And the reason why is, I think you’re going to find in the future. It’s going to. Seriously, we always talk about digital disruption. It is the digital disruptor. So I mean, just think of things like you know right now. If you’re on Facebook, right? They’re they’re mining all of your data. You’re the product, right? And and they’re selling you. 

01:04:01 Dave 

And if we just change the paradigm around and we have all of your data out in the blockchain, now you have the control. You can do it with it what you want, right? You could potentially monetize that for your own liking, and that’s just one example right? All this stuff you read about right now with Nvme non fungible tokens. It’s kind of laughable frankly, but you can see there’s a future in all of this. 

01:04:21 Dave 

And like it’s it’s going to be interesting. 

01:04:25 Frank 

Indeed, indeed, our next question is another complete this sentence. I look forward to the day when I can use technology to blank. 

01:04:38 Dave 

Ah, that’s a good one. 

01:04:41 Dave 

Just be able to solve data problems faster. I mean we’re getting there, right? You know, if you’re going to use Bimmel to build, you know SCD type 2. That’s much better than it was 20 years ago. I think at some point we’re going to get there quickly where you know we have data and we can do analytics on it in real time. And and hopefully it’ll be soon. 

01:05:02 Andy 

Interesting, so our question six is share something different about yourself, but remember it’s a family podcast. 

01:05:10 Dave 


01:05:11 Dave 

I don’t really have anything that’s different about myself. I mean, I just have a wife and two kids. You know that’s it. I’m really a very plain Jane individual. 

01:05:23 Frank 

Now plus, you work at the MTC like that’s like that’s like I was telling I was telling my former. So every MTC has what they call a director. That kind of has. 

01:05:25 Andy 

Well, there’s that yeah. 

01:05:32 Frank 

Fairly good autonomy over what that particular center does, and I was telling like. 

01:05:40 Frank 

Uh, when I? When I said I was leaving my my the director was like oh God, I gotta find your replacement now. 

01:05:49 Frank 

And you know ’cause I mean it. You need someone with good technical understanding, good customer presence and and someone who’s willing and able to, you know, speak all day. 

01:05:59 Frank 

Hey which these engagements generally run? You know what? From like 9:00 AM when they were in person 9:00 AM till like 5:00 PM or 4:00 PM. And virtually I mean it’s basically, you know, one of the things I realized since taking the new job is like, wait, I’m not chained to my computer for like 8-9 hours a day like. 

01:06:20 Frank 

Uh, so as I said, it’s like, yeah, you know, it’s kind of like the Navy Seals, you know, like not everyone there. People who can do it maybe don’t want. 

01:06:29 Frank 

I do it and it’s not for everyone you know, so I would say just being in the MTC is kind of like kind of a big deal I think. 

01:06:38 Dave 

It is, and that’s the reason I came back to Microsoft is because we have a lot of autonomy to do things and to say things that possibly other people can’t, and I really kind of like that autonomy, absolutely. 

01:06:51 Frank 

Yeah, we’re not part of the account teams, and we’re not measured on quota in the same way. I mean, there’s probably some measurement but but I mean it’s it’s definitely unique role and I think it’s it’s something that you know because of the natural kind of shift in customer. 

01:07:08 Frank 

Account teams like you know the the MTC face is probably the one that the customers get to know best. 

01:07:16 Frank 

Uhm, you know. And UM, that’s an interesting kind of story in and of itself. But, uh, moving on ’cause we’re we’re kind of on the long side. But you know what, Andy? 

01:07:16 Dave 


01:07:26 Frank 

It’s season 5. 

01:07:27 Frank 

Premiere like we can, we can kind of. 

01:07:29 Andy 

Do this if this has been an excellent show too so. 

01:07:30 Frank 


01:07:32 Frank 

Oh, it totally has it totally has and dumb you know before we get to the last two questions, I’ll definitely say Dave, you’re you’re welcome back anytime, yeah? 

01:07:41 Dave 

Well, thank you. I would love to do that. That would be awesome. 

01:07:43 Frank 


01:07:43 Frank 


01:07:44 Frank 

And where can people, Speaking of which where can people learn more about you? 

01:07:48 Dave 

I’m on LinkedIn. Uh, and I have a website I doubt not more than three people go to it, but it’s 

01:07:56 Frank 

Cool, well, we’ll get you at least six people. How about that? 

01:08:02 Dave 

So then my mother and who are the other five people now? 

01:08:03 Dave 


01:08:03 Dave 


01:08:06 Frank 

My mom. I think Andy’s mom. 

01:08:08 Andy 

Sure, yeah, my mom will check it out. 

01:08:14 Frank 

Uh, and last question. Audible sponsors data driven. Can you recommend a good book? 

01:08:21 Dave 

Yeah, from from. 

01:08:24 Dave 

So books that I reread every five years is Shantaram. If you’ve ever heard of that, it’s spelled dislike. It sounds it’s it’s a fiction book. But actually, I think it’s a partially true story. I won’t bore you with all the details, but basically a guy goes to prison. 

01:08:30 Frank 


01:08:30 Frank 


01:08:41 Dave 

For murdering somebody in Australia, he escapes, goes to India. 

01:08:45 Dave 

Uh, he becomes a doctor, you know helps the Indian people ends up going back to jail. He’s tortured almost to death. All kinds of interesting things. It’s a lot of plot twists and turns. Very fascinating story, very long story as well. So that’s fiction for it. People like everybody. Probably read the Phoenix project by right by now. 

01:09:06 Dave 

If you haven’t, definitely read that one, but the one that I find far more insightful than the Phoenix project and every day to person show. 

01:09:12 Dave 

Read or actually I advise the audio book, which is the audible point is is the goal. So the goal is by I think it’s pronounced Eli or Hugh Goldratt, and it’s basically if you’ve read the Phoenix project. It’ll feel like a rip off, but he actually wrote it 20 years before the Phoenix project. And if you’re a software developer. 

01:09:29 Frank 

Oh wow. 

01:09:32 Dave 

The insights that you gain out of this and it has nothing to do with software. It’s a. It’s the journey of a guy that runs a manufacturing plant. So you’re probably thinking yourself. How does that you know involve data? It doesn’t, but the the allegory’s that he tells on how to solve prob. 

01:09:48 Dave 

Projects if you’re in software development, I can’t recommend that enough. If you’ve liked the Phoenix project, you will love this book, so. 

01:09:56 Frank 

Interesting, interesting. Well if you don’t have audible already, you can go to the data driven book com and you’ll be routed to the. 

01:10:08 Frank 

Audible page and you’ll be able to get one free book on us. And if you decide to become a subscriber after that, Audible will kick us back enough to buy a Cup of coffee at Starbucks. 

01:10:09 Frank 


01:10:09 Frank 


01:10:20 Andy 

There we go. 

01:10:22 Frank 

Nice and any last words. 

01:10:25 Dave 

No, this was great. Thank you, really appreciate this. This was this is I’ve been listening to your podcast for years. You guys are great and I I think I’ve said this before. I’ve been following Andy for probably feels like 20 years. He’s definitely. He’s definitely a bigwig in the SQL Server industry and I came up through that. You know, learning pretty much everything I know. 

01:10:27 Frank 


01:10:32 Andy 


01:10:45 Frank 

I keep telling Andy that I keep telling Andy like on the MTC, like internal genes, chats which I know you’re you’re on, that that the the Northeast data and AI thing which I do miss. I miss that thread anyway. Andy name comes up in like hushed and reverent tones, quite a bit like and he doesn’t believe me like. 

01:10:45 Andy 

Every time somebody says. 

01:11:01 Andy 

Oh, absolutely, I’m like no. 

01:11:01 Andy 

Oh absolutely, I’m no. 

01:11:04 Frank 

I’ve screen shotted like things I’m like. They’re talking about you. 

01:11:05 Dave 

No, absolutely no, I believe. 

01:11:07 Andy 

So it’s just you know my response to that Frank. I repeated it over and over again. It’s like I’ve been trapped in here with me for 57 years and I am just not that impressed. I’m just saying it just. 



01:11:17 Andy 


01:11:17 Andy 


01:11:19 Frank 

It was funny. I think Dave got the joke somebody posted. I think Dave knows where I’m going with this is that I somebody said oh Andy Andy. 

01:11:22 Dave 


01:11:29 Frank 

And you know, basically check this book by Andy Leonard. Andy Letter stuff is great. 

01:11:32 Frank 

And and and. And I wrote back. Yeah, he has a podcast too and I put a link and then I put with a :). But his co-host is a bit of a jerk. 

01:11:40 Andy 


01:11:40 Andy 


01:11:42 Andy 

True, Mike. 


Is awesome. 

01:11:45 Andy 

He’s not a jerk at all. 

01:11:47 Frank 

But the first person to laugh at it was, Dave said. 

01:11:49 Dave 

Yes, I got the joke right away. 

01:11:53 Frank 

I don’t think the original poster did, I’m I’m. 

01:11:53 Andy 

Y’all are awesome. 

01:11:55 Dave 

They did not. 

01:11:55 Frank 

Not sure about that. 

01:12:00 Frank 

Well, awesome, Dave is always a pleasure. UM, any parting words, Andy. 

01:12:04 Andy 

Thank you Dave so much for this. I love the back and forth. I’d love to work with you on something man, that’d be fun. 

01:12:12 Dave 

Yeah, let’s do it. 

01:12:12 BAILey 


01:12:12 BAILey 


01:12:13 Frank 

Awesome awesome. 

01:12:15 Frank 

All right, and I’ll let the nice British lady finish the show. 

01:12:19 BAILey 

Thanks for listening to data driven and thank you for making this show a success. You know Frank and Andy won’t admit this very often, but they weren’t sure that the show was going to last three seasons. 

01:12:32 BAILey 

So here’s a heartfelt thank you from an AI who would be out of work if it were not for you. 

01:12:37 BAILey 

I don’t get sentimental very often, so soak it up while it lasts. By the way, we know you’re busy and we appreciate you. Listening to our podcast, but we have a favor to ask. Please rate and review our podcast on iTunes, Amazon Music, Stitcher or wherever you subscribe to us. 

01:12:56 BAILey 

You have subscribed to us, haven’t you having high ratings and reviews helps us improve the quality of our show and rank us more favorably with the search algorithms. 

01:13:07 BAILey 

That means more people listen to us spreading the joy and can’t the world use a little more joy these days? 

01:13:15 BAILey 

Now go do your part to make the world just a little better and be sure to rate and review the show. 


About the author, Frank

Frank La Vigne is a software engineer and UX geek who saw the light about Data Science at an internal Microsoft Data Science Summit in 2016. Now, he wants to share his passion for the Data Arts with the world.

He blogs regularly at and has a YouTube channel called Frank's World TV. (www.FranksWorld.TV). Frank has extensive experience in web and application development. He is also an expert in mobile and tablet engineering. You can find him on Twitter at @tableteer.