Dave Wentzel on Why You Don’t Need a Data Warehouse
In this episode of Data Driven, Frank and Andy chat with Philadelphia Microsoft Technology Center Data Architect Dave Wentzel on why you do not need a data warehouse.
Also, Frank discusses leaving Microsoft, Frank and Andy talk about five seasons of Data Driven, and even BAILeY has a sentimental moment.
Transcripts
00:00:00 BAILey
Hello and welcome to data driven, the podcast where we explore the emerging wait a tick. This is the premiere episode of Season Five. Can you believe it? Data driven started four years ago this month.
00:00:14 BAILey
Up until last season, we had a human doing the voiceover work. That is until she was replaced by an AI. Yours truly.
00:00:23 BAILey
In this episode, Frank and Andy speak to Dave Wensel about why you don’t need a datawarehouse. We’re starting off the new season with a bit of contrarian tone.
00:00:33 BAILey
It’s a lively back and forth conversation that runs contrary to prevailing wisdom. Don’t say we didn’t warn you? Now on with the show.
00:00:41 Frank
Hello and welcome to data driven. The podcasts were we wait a minute. We’ve been saying this Andy for four years now. Can you believe it?
00:00:48 Andy
Four years, that’s crazy talk.
00:00:52 Frank
That’s just craziness. So I think when you and I first talked about this and that was that fateful, I think it was December was right after Thanksgiving. But before Christmas, I was thinking about starting a podcast and as a data scientist, I needed someone.
00:01:01 Andy
Yeah, yeah.
00:01:09 Frank
That was a data engineer that could kind of round out the talent there and and and and obviously I wanted someone I knew, liked, and trust.
00:01:11 Frank
Found out.
00:01:11 Frank
00:01:22 Frank
And so it was you.
00:01:25 Andy
Well, I’m just glad all of the real smart data engineers you knew were busy. That’s all I got to say.
00:01:25 Frank
Much.
00:01:30 Frank
Ah, no man. You were the first one. I reached out to and the only one I would have done it with it. So I was delighted when you said yes because starting a podcast can sound like a daunting thing, particularly if you haven’t done it before.
00:01:44 Andy
Yeah, neither one of us really had. And gosh, it’s it’s worked out. What are we up to? 180,000 downloads or something? I mean that’s.
00:01:52 Frank
Something.
00:01:53 Frank
Like that about hundred 8000 downloads. I mean, we’re not Joe Rogan, but that’s OK, Yep.
Yeah.
00:01:57 Andy
No.
00:01:59 Andy
Yep, Yep.
00:01:59 Andy
Yep.
00:02:01 Frank
But you know what, we we we’ve impacted. I think the community in a significant way. We’ve we’ve done a number of things we’ve we’ve innovative how we podcast.
00:02:12 Frank
Uh, we we’ve actually managed to keep a good cadence with some exceptions.
00:02:18 Andy
Yeah, thanks.
00:02:19 Frank
You know, we we finally did earlier this year or late last year, kind of fulfill our vision of it being data driven TV when we actually interviewed guests on.
00:02:27 Andy
Yes.
00:02:32 Frank
On video.
00:02:33 Frank
And that was that actually delayed the launch of the show by about three months.
00:02:38 Andy
It did but also uhm. Yeah, that was interesting, but you know it’s typical software development, right? You release a feature and then you debug it. The I have this saying about that Frank. All software is tested some intentionally.
00:02:52 Frank
Sometimes.
00:02:53 Andy
Right?
00:02:56 Frank
I love it, but I also like how, how, how both our careers have evolved over the last four years. And dumb, you know, this being the premiere episode of Season 5 and we have something special lined up, but I’ll get to that in a minute.
00:02:58 Andy
Hello.
00:03:03 Andy
Oh gosh, itch.
00:03:11 Andy
June.
00:03:12 Frank
You’ve progressed in your career. We, you and I’ve worked on some some projects together or virtual Summit. What we’re calling Ring Gate, which will announce very very soon and and but. But most of all, is been my kind of skilling up in transition into data engineering myself.
00:03:29 BAILey
Ehm
00:03:31 Frank
Which was something that when I joined, so this is just a job update about a year ago. I I left the role of Microsoft kind of field sales and I went into the Microsoft Technology Center stick with me. There’s a point to this story and basically I was at the rest in MTC.
00:03:52 Frank
And basically I was the AI guy on my my my field sales team, but I didn’t really have deep knowledge of kind of the typical typical data engineering pipe work that goes into that role and basically my my. My then manager said you know he’s like hey, you know, if you want this role, you’ve got a skill.
00:04:12 Frank
And skill up I did. And with Andy’s mentoring and a bunch of other folks that helped me kind of skill up on our the data engineering side. I looked at it this morning. I’m like 88 hours on Pluralsight.
00:04:25 Frank
Wow, that was from mid may till we’re recording this on April 30th. So just about a year 88 hours right now tracking on about 200 four 205 consecutive days of getting on LinkedIn. I’m not on LinkedIn on Pluralsight, LinkedIn learning. I also have a number of courses too.
00:04:31 Andy
Yeah.
00:04:43 Frank
Uh, that is something I’m proud of in terms of career evolution.
00:04:47 Andy
Absolutely Frank, you should be. How many cirts are you up to now?
00:04:50 Frank
I 87.
00:04:53 Andy
Slacker.
00:04:54 Frank
I know, I know.
00:04:54 Frank
Know, I know.
00:04:54 Andy
I think I’ve got 4.
00:04:56 Frank
Ah, now I know you and I did the data engineering thing, so you have at least 11.
00:05:00 Andy
That’s true, that’s true. We did that one and you know that was it’s just. It’s just been a nice journey and I’ll take credit for this. ’cause ’cause I can I was. I was actually pestering you years ago. We’ve been friends since 2005 and we started doing.
00:05:20 Andy
Code camps here in the Richmond area.
00:05:22 Andy
Together and co-founded RE co-founded Richmond SQL Server Users Group and you know, worked with the net users group and stuff. And I told you as soon as I saw some of your graphic art and Frank would do a keynote for the Richmond code camps and every time he would make movie posters, the one that.
00:05:41 Frank
Oh yeah.
00:05:42 Andy
Still sticks out is 1 called devs on a plane.
00:05:45 Frank
Ha ha ha.
00:05:49 Andy
Oh yeah, I loved that one that was so so cool and.
00:05:49 Andy
And that was.
00:05:49 Andy
00:05:54 Andy
You know I saw the graphic arts part of it and I just knew I said you, you’d be really good in analytics and data visualization. You should get into by and you were busy doing other stuff which was cool. You were good at that too. It wasn’t, you know you. I don’t know of anything you’ve done that you haven’t mastered. By thank you. You know you when.
00:06:14 Andy
Things took a took, uh, started taking a turn for you in your first rodeo at Microsoft. You got into it and and took off with it. I don’t. I won’t tell the story well, but you just really turned around. You focused on data and.
00:06:32 Andy
You know, I’ll say this Frank. I was right.
00:06:35 Frank
Well, with that he totally I. I think if anything I took away is I should have listened to Andy 10 years earlier.
00:06:36 Dave
You aren’t very good.
00:06:40 Frank
Uhm?
00:06:41 Frank
And that that that that is something that that that that’s the big takeaway we’ll talk about, kind of that journey. ’cause I think that’s worth kind of talking about. And I think one of the things we you, and I’ve been bouncing around is kind of interviewing each other.
We
00:06:55 Frank
Like in asking one of us those those those questions we have, so we definitely will do that, but not today kids.
00:06:55 Frank
Yeah.
00:06:55 Frank
00:06:59 Dave
We need to.
00:06:59 Dave
Need to.
00:07:02 Andy
Today, do we have Dave?
00:07:02 Andy
Today do Dave.
00:07:03 Frank
Today we have a special guest we have Dave Wentzel. Dave Wentzel is a was a peer of mine when I worked at the Microsoft MTC and that reminds me, I no longer work at Microsoft 2 weeks ago was my last day. I turned in my second blue badge.
00:07:05 Frank
Yeah.
00:07:05 Frank
00:07:18 Frank
And I joined a startup called electrify. We’ll talk about them a later day, but I’m so excited to have Dave here. Dave is the data in AI architect out of the Philadelphia Microsoft Technology Center, and he’s an awesome guy. Awesome, got to work with. I worked with him when I was in field sales and I worked with him when I was in the MTC organization.
00:07:38 Frank
It is April. It was a privilege and honor Dave to have you as a colleague, and it’s once again a privilege and an honor to have you here as a guest on data driven.
00:07:46 Dave
Well, thank you so much, appreciate that.
00:07:47 Andy
Welcome Dave.
00:07:49 Dave
Thank you.
00:07:51 Frank
So, uhm, so for folks that don’t know what the MTC is. Shocking that there are actually people that don’t know what that is, what? What is the MTC?
00:08:00 Dave
So basically we’re a free service to our customers and I’m a data and AI technology architect. We talked to customers about data and it could be anything from just, you know. Hey, here’s what we’re doing. State of the art in Azure with.
00:08:16 Dave
With data, but it could also be architectural design sessions where we talk to customers. Our customers bring us their architectures, and then we kind of get it with them. Give them the pros and cons, alternative ways of thinking, and then what I really enjoy doing is hackathons with customers and workshops and just you know, helping them to learn without just.
00:08:37 Dave
Taking a course somewhere so actually using their data and then I guess I’m roughly a data scientist, so we also do design thinking sessions and those are absolutely a lot of fun.
00:08:48 Dave
We did one at the MTC with CSL Behring a couple years ago and it actually won a Forrester Award. So I’m very proud of that one. And yeah, it’s it’s a. It’s a lot of fun and it’s a good way to bring to have executives and business people understand the actual capabilities of data science. And then within two days be able to come up with a use case.
00:08:55 Andy
Oh wow, wow.
00:09:08 Dave
And actually build a prototype out a lot of fun.
00:09:11 Frank
Yeah, the NPC’s are definitely like Microsoft Secret weapon in terms of how ’cause you know. Although I will say and because we were in the DC and we dealt with a lot of government contracts, we could not say that they were a free service. They were and already included paid for service.
That’s.
00:09:26 Dave
Much, much better said yes.
00:09:28 Frank
I I ’cause I said free once and I got kind of slapped.
00:09:31 Frank
On the hand, say that.
00:09:34 Frank
But you know it, it really is something that if you do have a Microsoft account team and you are encountering any kind of questions or or whatever, and it’s not strictly technical, there’s also pretty good. You know, we basically wouldn’t engage with the business development, business decision makers.
00:09:52 Frank
Technical decision makers all the way from kind of like you know, hey, this is what Azure can do. This is what data can do for you all the way down to OK. What’s your problem? Let’s build something out, give you 3 days with one of the top Notch architects in the.
00:10:04 Frank
Space and.
00:10:07 Frank
You know, boom, you know we knock it out and and you know I I enjoyed it you know had this opportunity not come I would have I would have gladly stayed another. You know 5-10 years of the MTC. Like a lot of people do, and it’s a fun organization. So with that in mind, today we’re going to do something a little different. We’re kind of doing the.
00:10:27 Frank
A contrarian approach is that right, Dave.
00:10:29 Dave
Yes, exactly.
00:10:31 Frank
So this this has actually come up one of my last. This is one of the things that intrigued me about about your idea for the show was this came up when I was working with a we’ll just call it a large governmental agency known for its.
00:10:42 Frank
Birds.
00:10:42 Frank
00:10:43 Frank
Tape.
00:10:44 Frank
That that should keep it generic enough. They basically came to us and say we want Synapse. We want a data Lake. We want this. We want that. And I was like, OK, well how much data you’re talking about. And like we have maybe you know 5 maybe 20 gigs of data.
00:11:02 Frank
And I’m like, uh, OK, tell me what are you trying to do? And ultimately I kind of pitched the idea like look, you know you don’t have that much data right to make data bricks.
00:11:14 Frank
But you really want it so.
00:11:17 Frank
If you really want it, I won’t stop you, but I think it’s kind of overkill. I think you’re taking instead of using a steak knife to cut the steak using a chainsaw.
00:11:25 Frank
And.
00:11:27 Frank
You know they kind of came back and ultimately what won the day was they already they couldn’t get approval for whatever we recommended ’cause it didn’t get stamped by there.
00:11:37 Frank
They’re people for security usage yet, and things like that so they end up doing kind of the right thing because of their own bureaucracy, which.
00:11:44 Frank
It’s kind of weird. It’s kind of like dividing by zero and seeing the universe fold in on itself.
00:11:50 Frank
But UM, so the topic of today is kind of like no, you don’t need a data warehouse. Did I get that right?
00:11:58 Dave
Exactly, that’s what I believe in, and I believed in it since I was in college and I first learned about data warehouses. I’m not saying data warehouses are always bad, they definitely have their.
00:12:10 Dave
Use cases, but in 2021 when we’re talking about advanced analytics and we’re trying to tell customers you need to be more predictive than prescriptive.
00:12:19 Dave
The data warehouse really doesn’t deliver.
00:12:23 Frank
Really, how so? ’cause? That’s that’s totally not the power. Certainly not the party line. I’m not going to say which party it was. You can figure it out but but why, why why would you say that?
00:12:33 Dave
OK so take.
00:12:33 Dave
A step back here, right? We’re all data consultants, or we were at some point in our life and probably most of the listeners are. And if you’ve been doing this, I’ve been doing this since the mid 90s in college and when I first started I had an internship with a consumer package. Good company, they made candy.
00:12:52 Dave
Hours and they said, hey, we wouldn’t want to do an internship and take a look at our data and figure out where is the best spot to put candy on a shelf so that we sell more candy to kids, right? So we used data for that at the time that was known as business Intelligence in the industry. Nowadays business intelligence means something totally different. In reality, it’s really closer to what?
00:13:12 Dave
Today we would call data science right? So my tools of choice were SQL, although I didn’t know what SQL was at the time and we had this goofy SQL engine and and essentially something called ESP’s, which is roughly the equivalent of like our or stats package something.
00:13:28 Dave
Like that and we kind of looked at data as just, you know I have data and let me find the Nuggets of gold and I’m not going to concern myself with schema and that is I think the biggest problem with data warehouses. But take a, you know a metal layer higher right? Talk to the average business executive like a you know a CTO or CEO.
00:13:48 Dave
And tell them, as a consultant, you’re going to go in and build them a data warehouse.
00:13:53 Dave
Instantly, that’s a political statement you just made. Data warehouses have connotations of you, know risky projects over budget projects as far as time and money, and you know a lot of times they fail and executives don’t want to hear that. So we’ve learned interesting ways to avoid the conversation of of calling things a data warehouse.
00:14:13 Dave
You know we call them other things in the industry to try to avoid that connotation.
00:14:17 Dave
But you know, ultimately, that’s a problem, and you know there’s reasons for that. But most executives don’t know why data warehouses are risky. They just know, hey, we try building a data warehouse every seven years, and it tends to fail, and we’re not really sure why. So we’re going to avoid it. And even companies that have successful data warehouse and there aren’t many.
00:14:37 Dave
You know they have problems. You know adding new features to data warehouses. And again, that’s problematic. So they avoid that conversation as much as possible because of the the risk.
00:14:49 Dave
But when you stop and look at it and just interrupt me at anytime with questions or you know, push back, especially when you look at it like why did our data warehouse is particularly problematic and why do they they fail and have a lot of risk? And I mean I’ve been doing this for many, many customers over many years and I’ve kind of seen the patterns.
00:15:09 Dave
When you take back when you step back and you think about this, you know with a little bit of introspection and I’ll tell you I’ve I’ve narrowed it down to three main causes. One is we spend a lot of time doing requirements gathering.
00:15:22 Dave
The number 2 is. We spend a lot of time doing data modeling and #3 is. We spend a lot of time doing ETL and I can avoid all of that or most of that anyway if I just don’t do it and I do something else instead of the data warehouse. So I’ll walk you through a use case here, OK?
00:15:42 Frank
That’s interesting, no, I I not now I see where you’re coming from. I didn’t mean to interrupt you, but now I kind of see where you’re coming from, but.
Yep.
00:15:49 Frank
I have billions of questions or formatting in my head, but go ahead.
00:15:53 Dave
Let me give you a Canonical use case. OK, your consultant or your data professional.
00:15:59 Dave
Some analysts comes up to you and says, hey, I need a report that shows XY and Z help me build it right. So first thing you do is you go out and say well that data for XY and Z doesn’t currently exist in our data warehouse, right? Well, actually take a step back. First thing to do is a whole lot of requirements gathering, right? Well, what do we? What do we need this data? Where are we going to get it from? What if we can’t get the data, you know?
00:16:20 Dave
All these kind of questions. How are we going to massage it into formats? We need. It’s a lot of requirements gathering so put that as.
00:16:25 Dave
Side right, the next step is what’s the first thing you do. OK, well, you say where am I going to stick this data ultimately so I can report off of it right? And again, if if if you guys don’t do this that way, that’s great, but this is a common pattern that I see anyway, so they’ll say, alright? Well we need to figure out where it goes in the data warehouse, so we had these couple pieces of data yet.
00:16:45 Dave
And I don’t know do they go in the fact table, the dimension table which dimension table do we even have the dimension we need? Should we use a junk dimension and this becomes religious arguments with data modelers and I can’t stand data modeling and then bring in the slowly changing dimension type 2 discussion and oh boy, that just knocks everything off the.
00:17:04 Dave
Rails so data model OK. Then the next thing is you say OK I need to now that I have a data model. I know that I have this data somewhere so I need to bring in an ETL development that’s going to get it from somewhere into the data warehouse. Now the data is in the data warehouse, right? And that ETL process takes some time, right? Potentially days, weeks and months to do that code.
00:17:24 Dave
And then we say, OK, you know, if we’re following, you know inventor Kit or or the Kimball method. We might have data marts at this point we might have analysis services, cubes and then we have a presentation tier where we show the report and that’s powered by right? It’s only at that point when we can take the report back to that person that requested it. The analysts.
00:17:43 Dave
Say hey, here’s your data. Here’s what you asked for. Tell me how much of a good job I did, and invariably, what happens? They look at it and say, well, this isn’t at all what I asked for. And why did it take you three months to do this? And or they’ll say something like, well, our requirements change. We don’t need to see that anymore, and we need something else.
00:18:01 Dave
Any way you slice and dice it, you’re back to the drawing board, right? So it’s not a good approach, and executives see that. So what’s the answer to this, right?
00:18:12 Dave
And I’ll you know, cut to the chase. I think the way you do it, OK is you do something like a data Lake OK but don’t get hung up on terms, but it’s something like a data link. So let’s just take a step back and say well how would the project work if I were a data scientist doing it instead of a? You know more of a business intelligence.
00:18:32 Dave
Type person that does ETL and data modeling and things like that. Alright, so has the data. Scientists do it OK, here’s how I would do it and how I’ve always learned to do it.
00:18:41 Dave
We don’t do any requirements gathering or data modeling or anything until I get some data. Why do we talk about data? Why aren’t we just looking at data? So the first thing you do is go out and get data. Now sometimes that’s hard to do OK, because you may not necessarily have the data yet because it’s a new product that’s going to be generating data, so we have to start thinking about proxies in the data. Things like that. But forget about all that.
00:19:01 Dave
Let’s assume we can get the data from somewhere.
00:19:04 Dave
So we get that data and we stick it into this data Lake thing. OK, and we just land it there right now and I love to do hackathons with customers on this. Next thing we take that data in the data Lake and I sit with the business analysts and we start talking about it. What is it you really want to see? What is that negative goal? And we talk about it.
00:19:24 Dave
And we look at the data and we massage the data and we take the data and we join it with other pieces of data. We may already have in our data warehouse. Whatever it is, right? And we’re constantly learning about that data. So when we’re sitting side by side, I’m learning about the business. ’cause I probably know nothing about the use case.
00:19:40 Dave
And they’re learning about what my thought processes are as a data scientist and a lot of times we’ll find just as a side note that the business analyst that sits with me, they look at the code we’re writing and we have tricks to write this code. We use these interesting things called Jupiter notebooks, where the data tells a story and that’s the key thing, and we’re learning it together and.
00:20:00 Dave
I’ve had business analysts look at me and say, wow, all you’re really doing is taking data, enriching it a little bit, putting it in like a little temp table or another area of the data Lake, and then you’re enriching a little more. And then you’re doing that and we’re building some visualizations and we’re thinking through problems. So yeah, that’s all we do. This is not difficult stuff.
00:20:19 Dave
Right, so then we sit with the business analysts and we find the nugget of gold. So just assume we write some queries. We figure out what that negative gold is, right? So now we’re kind of done, right? So a lot of times what we’ll find is we have to do an evaluation stage. But we found the business thing that they asked us to find.
00:20:39 Dave
Originally, and never do we do data modeling or ETL. Think about that for a second, but we do the evaluation stage when we say now what is this data? This negative gold that we found.
00:20:49 Dave
And you know, where should it ultimately go? Maybe it should go in the data warehouse. So now we know, right? OK, here’s the nugget of gold we found. This is what it looks like. And now a data modeler can come in there and say without any doubt. Well, that’s simple. That should go in the fact table over here or that should be in this dimension or that junk dimension or whatever the situation is. And we know.
00:21:09 Dave
Because we’ve solved the problem, the modeling exercise is easier if we’re going to do it.
00:21:14 Dave
But there’s no, you know. So now we might have to do a little bit of ETL, but all times we find 2021, we don’t need this data in the data warehouse. So what happens if if the nugget of gold you’re trying to find is something that’s like forecast? My sales for the next, you know, two quarters. OK, well, you know data warehouses tend to be historical data, so if I’m doing sales forecasting
00:21:35 Dave
Why would I put that in the data warehouse anyway, OK, or here’s here’s a common one marketing people love.
00:21:40 Dave
To say of what percentage of my sales is attributable to my Facebook marketing efforts versus my Instagram marketing efforts? OK, that really doesn’t seem like data that I would want to necessarily put into a data warehouse, so maybe I don’t need it there, right? Or here’s simple things, right? In 2021, we’re trying to get more towards prescriptive analytics.
00:22:00 Dave
Prescriptive analytics, meaning what do I do next? Right business? People always want to know what’s the next thing I should do, right? Well again, if I’m trying to say what should my next brand campaign?
00:22:12 Dave
B. I’m going to do some, you know, interesting things with data and I’m going to come up with hopefully an answer, but does that answer go in the data warehouse? Does that need to be on a, you know Power BI dashboard? Maybe it doesn’t, and that’s the key thing. Maybe it needs to be somewhere else, so I said a lot there.
00:22:31 Frank
Yeah, yeah no, that was that was awesome. There’s a lot to unpack. We could probably spend an entire season kind of unpacking that, but.
00:22:37 Frank
But but I’ll, I’ll kind of take it back till I’ll boil it down to the simplest thing data warehouses.
00:22:42 Frank
Houses and Andy can keep me honest here, so could you Dave data warehouses kind of stem from the from the era of you had your OLTP’s and you roll apps right online? Transaction processing and online analytical processing right? They were split out originally, were they not because you didn’t want people doing a number?
00:23:03 Frank
Crunching to mess with the people doing the actual sales right and the real time data is that. Is that a gross simplification? Is it on the mark or what?
00:23:12 Dave
I I agree with that.
00:23:13 Andy
Yeah me too.
00:23:14 Frank
But now in the age of the cloud, when you have kind of this elastic compute or elastic databases.
00:23:22 Frank
That kind of reason for existence. I know there’s a fancy French term for that, but there’s a. There’s a reason for that is kind of gone. Now, assuming you’ve gone entirely on into the cloud where there’s more lasted compute is that is that. Is that also the case?
00:23:38 Dave
Uh, well, my answer would be partially so yes, you are correct, I I don’t think this is 1997 anymore where the Oracle DBA says thou shalt not run analytics queries that bring my website down. These systems are resilient like you said, but however you know a lot of times when I’m doing those.
00:23:58 Dave
Analytics with the business analyst and we’re trying to write those queries. We have to bring in data from multiple places, so even if I hit the OLTP, per you know server directly and the DBA doesn’t slap my hand.
00:24:08 Dave
For that you know I still need to bring in data from other places and be able to do analytics on that so you know a lot of times we can do what you’re saying, but a lot of times we can’t. Two, at least not in 2021 as it stands.
00:24:23 Frank
So it’s not a siloed system for the sake of performance. It may be a siloed.
00:24:29 Frank
System forsake of orchestration.
00:24:31 Dave
Yeah exactly, Yep and then a lot of times these systems of record these transactional systems they’re meant to be transactional systems. They’re not necessarily meant to keep history when we do this stuff as a data scientist in the data Lake data lakes are structured.
00:24:49 Dave
On the kind of longitudinal access, so they’re structured by time essentially. So what I mean by that is if you look at a data warehouse, you may have a fact table that has the the order information down to, potentially the the order line level, and that’s interesting. OK, that’s the grain of detail, but that doesn’t tell the full lifecycle of the.
00:25:09 Dave
Order right so the full lifecycle of the order goes all the way back.
00:25:12 Dave
To the user came to my website based on what refer was it Instagram? Was it Facebook right? Once they got in, how long were they sitting? You know, before they made a decision to put it in their shopping cart, how long was the product in the shopping cart before they hit by? You know all these types of things most aren’t in most data.
00:25:32 Dave
Warehouses and their structured attention, potentially when you’re looking at the analytics on the time axis so the time axis is much better done in a data Lake. I can say more on that, but I’ll let you respond.
00:25:46 Frank
That’s an interesting thing. What you’re talking about. Like because we’ve all had those experiences where we’ll search for something. We’ll even put it in our Amazon cart. And I can say, Amazon, ’cause I don’t work at Microsoft anymore. But I mean, clearly this is being done, and I don’t know. Do you know for a fact? Or is it conjecture that these big kind of retailers EV 2
00:26:06 Frank
Colors.
00:26:07 Frank
You know they’re not using traditional data warehouses.
00:26:11 Dave
Well, again a lot of times they they do have a you know a lot of that information. The entire sales lifecycle in a data warehouse somewhere, but here’s the thing. It goes into the data warehouse once it becomes operational. So once I just need to put it in a report then it goes into the data warehouse and I’m fine with that. It’s just remember the most.
00:26:24 Frank
Ooh.
00:26:31 Dave
Important thing if I’m trying to figure out, so let’s take a different Canonical use case here. OK, let’s say you’re a marketing Department and you invest $10 million a year in Facebook advertising.
00:26:41 Dave
And your marketing team comes to you and they say, hey, we want to do Instagram right? How much money should we spend on Instagram or should we even spend money on Instagram right now? Think about your average data warehouse. Is it going to be able to answer that question? Is it going to be able to answer that question in that timely manner? I don’t think so, right? So, here’s where we start gathering data.
00:27:01 Dave
About our users, right? We do customer segmentation. All this kind of stuff in what I call a system of insight and a system of insight is forward looking right. It’s not necessarily the.
00:27:12 Dave
History that you would have in a data warehouse and these things, again, they’re much better done in a data Lake. All right, but don’t get hung up on the terms you know you can do this stuff in a standard database. What I’m suggesting is you don’t need to do it in a star schema format where there’s a this heavy reliance on on on modeling the data correctly in the Star schema.
00:27:33 Dave
Getting the data ETL correctly into that star schema.
00:27:36 Dave
And then dealing with the slowly changing you know dimension Type 2. If I’m simply asking or if I’m simply answering the question, should I invest in Instagram? Marketing? OK, do I even need a power BI report for that? I don’t even know what that would look like, right? And this is what you know. Again, we’ve all seen the slides right.
00:27:56 Dave
You know, Oh yeah.
00:27:59 Dave
Mr CTO, Mr CEO, you want to take your data team from the descriptive analytics to the predictive analytics to the you know the prescriptive analytics. So what they’re saying there is the rearview mirror to the mill, and the predictive to the what do I do?
00:28:13 Dave
Next, and you know, executives look at those slides and they go. Yeah, yeah I want that, but they don’t know what the words mean and what the words mean is really I I just need to answer a question with data that normally I would answer with my gut and and I want to be you know more or less data driven on that so you know maybe when you look at that.
00:28:34 Dave
Data you start to realize if I’m going back to that Facebook versus Instagram conversation. Maybe when I do the analytics the overlap between my Facebook users and Instagram users is 8085% and it makes absolutely no sense to do an Instagram marketing campaign.
00:28:48 Andy
Right?
00:28:49 Dave
So now you just saved yourself potentially $10 million in an Instagram spin. Not to mention you gave your customers a better experience because they’re not getting bombarded from, you know, by your advertisements on yet another platform.
00:29:01 Andy
So Dave, I would. I’d like to give you props on product placement there for mentioning data driven in your last name.
Jack
00:29:10 Andy
I appreciate that. Well, we we need all the help we can get, I promise.
00:29:10 Dave
You caught that.
00:29:10 Dave
You caught.
00:29:12 Frank
I liked it.
00:29:15 Andy
Uhm, I love what you’re saying. I I as a a, you know, a practitioner of data warehousing for decades now and ETL some of the things that that I think about when I hear. And you’re not the only person I’ve heard say this stuff. Some of the things I think about our things like data quality.
00:29:37 Andy
And Master data Management, and that’s hard to do anywhere. How does that play into your strategy in? You know in using a data like like obj.
00:29:49 Dave
OK, I’m going to say something controversial here. Let me just.
00:29:53 Andy
No.
00:29:54 Dave
Let me finish this.
00:29:55 Andy
Go ahead.
00:29:57 Dave
Let let me finish my thought before you jump down my throat. OK, but I’m gonna tell you right now, data quality doesn’t matter. OK, let me say that again, data quality doesn’t matter. OK, so I talked to CTO of a Heart Hospital system, maybe it was before the pandemic, right? And we were talking about.
00:30:07 Frank
Interesting.
00:30:15 Dave
Interesting things he could do with his hospital data right? As far as the the prescriptive.
00:30:20 Dave
And so forth. And he looked at me, pounded his hand, his fists on the table. Now he was at the CTO. He was also an MD. So this guy is a smart guy. OK, that we’re not doing another data project until the data quality improves, right? And I said to him, you know, data quality doesn’t matter and everybody in the room was like, whoa, ’cause this guy’s hot button.
Yeah.
00:30:41 Dave
It was data quality and I said let me just explain what I mean by that. OK, if you’re in the nuclear industry, data quality matters. If you’re in healthcare, data quality matters. OK, if you’re designing a system that does accounting and debits don’t equal credits.
00:30:56 Dave
The first thing will happen is accountants will never use your system again because devasthan equal credits data quality ***** so data quality matters in certain cases, But when I hear statements like you know well, what about the data quality? And we’re not doing data projects until our data quality improves. I question that because you’re saying one thing but you’re.
00:31:16 Dave
Actions are doing something totally different, so I see this with a lot of customers. They have the same data quality projects going on for 20.
00:31:23 Dave
Years and years, that and again I think it’s because executives, so the C-Suite, they hear from the IT and the data groups data quality ***** and they don’t know what that means. So they regurgitate it and they say, hey, data quality ***** We’re not doing data project. Still data quality improves all that kind of stuff. But here’s the thing.
00:31:41 Dave
If you had serious bugs in your code, what we call in has in the healthcare industry. Critical click.
00:31:46 Dave
Where you know it’s going to cause a problem and somebody is going to die, or if it’s in the new you know field, you know we’re going to have a, you know a Chernobyl style event. Then obviously we would fix those bugs, right? And we’ve all fixed bugs on data quality. When you know a system went out and it wasn’t properly tested, so those types of data quality issues will.
00:31:57 Dave
Right?
00:31:57 Dave
00:32:06 Dave
We would have fixed already. OK, so that’s the first point.
00:32:11 Dave
Is a lot of times you know, we think data is of bad quality, but really our understanding of the data is what is lacking. So every time we do hackathons with customers and we bring data into a data Lake, then let’s say it’s Salesforce Data app data. We’ll start writing some queries and something will happen and you know again my debits don’t equal my credits.
00:32:31 Dave
My sales totals don’t match what’s coming out of the system of record, and then you know an executive will sit there or business personnel say see. This is what I’m talking about. All the data in our company is is bad garbage in, garbage out and they’ll start throwing the platitudes. And I sit there with.
00:32:44 Dave
This smirk and I think to myself now you’re you’re a profitable company you know your multibillion dollar Fortune 500 company. I doubt that the system of record for your accounting data or your CRM system or your ERP system is wrong. My guess is I’m not smart enough to write a decent query for you, right?
00:33:05 Dave
And that’s usually what it comes down to, and we’ll come back. And you know, we’ll say, hey, you know why am I seeing this? And then somebody will say, well, your query is wrong, you idiot, and then I’ll fix it and suddenly the data quality problems go.
00:33:16 Dave
Away with as a data scientist, we see data. You know, quality problems all the time, and honestly data scientists love dirty data because it’s The Dirty data that give you the Nuggets of wisdom. Right now we sit there and ask a question. Why is the data dirty and that becomes a very interesting thing.
00:33:36 Dave
Right like why?
00:33:37 Frank
So there’s signal in the noise you’ve got.
00:33:39 Dave
You got it right, and that’s very valuable information. So here I’ll give you a quick use case here. I was called in as a data scientist number of years ago and it was for call center and they said hey, we want to do real time, you know, call center analytics.
00:33:54 Dave
All this kind of stuff, so they used a third party call center management software system and we started ingesting the data in real time into the data Lake and we’re in Jesse and we gave him some basic reports and said just verify this stuff is right. Guy looks at me. He’s looking at the data just huh? I knew it and I said what does that mean? And he said, well, look, he says it says the average time our people are on the phone or.
00:34:14 Dave
3 1/2 hours a day. So look at this report I get from the vendor. It’s saying they’re on the phone 6 hours of the day. He says. Now look at the window. He says they’re all eating lunch and smoking cigarettes. I know they’re not on the phone, but the reports tell.
00:34:25 Dave
To me it is so. In other words, the data quality is bad. He says your reports not showing that issue. So the first thing I I do is I say, hey, you know it’s probably me. I’m not that smart, right? And so I went back to the vendor and I said, hey, we’re calling your API at night at midnight, and you know, you’re saying the guys are on phone 6 hours. I’m I’m looking at your real time.
00:34:45 Dave
Data feed through a different API. I’m saying it’s 3 1/2 hours. When I aggregate the logins and logouts, and that kind of stuff. What am I doing wrong and the guy looked at it and he said, you know, what? Give me 24 hours and we’ll have the problem fixed here. I uncovered a bug in his data quality, pointed it out to him, he went, he fixed his data.
00:35:04 Dave
And that happens a lot, right? So use these opportunities to say, hey we found some problems, go and fix them. The last thing you want to do as an executive is fund. This is my opinion, a data quality initiative, right? Because data quality initiatives that are driven from the IT organization, they’re boondoggles, right? They’re never going to.
00:35:24 Dave
Team and this is what executives, hey. Now if the business comes in and they say our data quality and the SAP system is terrible and we’re funding an initiative to fix it. And here’s the 10 things that we can’t live without. And you know, because the data quality is bad, we’re going to fix it.
00:35:41 Dave
And those types of data quality projects. Obviously they’re going to succeed, or they have a better chance of succeeding because they’re driven from a business problem, right? But their quality isn’t a destination, it’s a journey.
00:35:55 Andy
No, I get it. I really loved your. You know the fact that you pointed out that there is some. There’s some baby in the bathwater there that you can. You know, sometimes an outlier is just some crazy data point that you want to ignore. But there are other times where you want to count that, and maybe even search.
00:36:14 Andy
For them it depends on if you’re doing inclusive or exclusive filtering, and you know way about you’ve probably forgotten way more about this than I’ll ever know, but I get that I understand what you’re saying and sometimes that is the gold nugget.
00:36:30 Dave
Sure absolutely absolutely. Hey, I’ll give you a quick story. Sometimes data quality is what you don’t want. Sometimes you want dirty data, so I was doing some data so now that sounds weird. I I was doing some data science work for a large city, one of the largest cities, if not the largest city in the United States.
00:36:31 Andy
Yeah.
00:36:41 Andy
No, I get it.
00:36:41 Andy
No get it.
00:36:49 Dave
And they have multiple hospitals.
00:36:50 Dave
And they said we want you to match. This is years ago, years ago, probably 15 years ago and they said we want to match up from all these different systems of record. For for patients we want to match up all the patients and we want to create 1 Golden record. So we did that.
00:37:05 Andy
That’s awesome we did that.
00:37:05 Andy
That’s awesome we did.
00:37:07 Dave
Like that we tested it all we we swear we had it right and they went out and they put it in production. Here’s the problem, OK?
00:37:16 Dave
Again, they were gaming the system. The patients OK, and some of these places. They do like drug rehab type of places OK and they knew they could get you know their methadone treatments if they went to the first place where they, the patient information was in. And maybe the first. Their birthday wasn’t right or.
00:37:34 Dave
Social Security number wasn’t right. Then they could walk down the street, go to a different methadone clinic and get the same dose with slightly different patient information. When we fix those issues, suddenly they weren’t able to get all the methadone they needed caused a major health crisis.
00:37:49 Dave
Quickly they came back and said undo all of your data quality initiatives after The Dirty data because we know we have a major crisis. And yeah, that’s a true story.
00:37:53 BAILey
Oh no, how?
00:38:00 Frank
That’s interesting and scary all at the same time.
00:38:01 Andy
Interest.
00:38:04 Dave
Yes, scary, definitely scary.
00:38:07 Frank
There was a. I think this might have been pre Microsoft. I was in a room with some interesting folk.
00:38:15 Frank
And for those who don’t know, I’m in a DC area, so the interesting folk can be very interesting and they were talking about a similar problem how certain certain bad actors would intentionally misspell their name.
00:38:28 Frank
So so they would get off of certain lists, and because their names were not in the regular alphabet, as we know at the Latin alphabet, they were able to get away with that pretty well.
00:38:40 Frank
And this it was an interesting conversation, so it’s fascinating how even little stuff like that becomes a problem at these institutions. You know, you’ll you’ll.
00:38:54 Frank
You know, I mean, my last name has two is split up into two parts, but not every system recognizes that. So I like. So I I mean I I kind of deal with that a lot so I can imagine that and and and. There’s also stories where somebody change, put their license plate as null.
00:39:10 Frank
And like the nightmare that that caused and people whose last name is null like causes a lot of problems. And it’s just interesting stuff. And if you’ve not heard the scene, the cartoon about little Bobby tables just use Google or Bing and find the find it, it’s hilarious.
00:39:16 Frank
This.
00:39:16 Frank
00:39:29 Frank
Uhm?
00:39:30 Frank
But it’s it’s interesting that that ’cause you’re right. I mean, if you if you kind of say.
00:39:31 Frank
It.
00:39:31 Frank
00:39:37 Frank
You know we’re not going to do this until our data quality problem is fixed, and I think you’re right. I think it’s regurgitated because you want you. People get weird about their data and not just their personal data by their organizational data. In fact, one of the earliest consulting gigs I actually didn’t get.
The.
00:39:53 Frank
Because I was telling them like, well, you know I was, I scoped out the project and I said, well, the first week I’m going to evaluate it and start, you know, cleaning the data and then this customer said no no, our data is already clean. It’s everything is a normalized form.
00:40:07 Frank
And I was like.
00:40:08 Frank
Uhm, yeah, not, that’s not what I meant, you know, and and and so ever since then I kind of go to use the term shaping in the data because that that doesn’t.
00:40:19 Frank
Getting one off the wrong way, yeah.
00:40:23 Dave
Yeah, you got to avoid.
00:40:24 Dave
Those kind of trigger words.
00:40:26 Frank
Right, right now that’s that’s. I mean, it’s interesting. So so.
00:40:29 Andy
Yeah.
00:40:32 Frank
When would you want a datawarehouse like? It’s not necessarily, you know.
00:40:37 Frank
When would you want one? Well.
00:40:43 Dave
I guess the snarky answer would be possibly never.
00:40:46 Frank
Right, right? Well, I mean, that’s a good question because I mean, is data warehousing the whole OLTP versus OLAP?
00:40:53 Frank
Is that an artifact of? Like you said, the late 90s? Or when you you you, you, you, you own the metal and if you needed to upgrade the metal you had to buy the metal and install it not go to the Azure portal and just click you know up you know.
00:41:10 Frank
Like what?
00:41:11 Frank
Is it a relic of a time gone by?
00:41:14 Dave
It might be dumb. I’ll tell you this, here’s how I look at it. When I what I don’t like about data warehouses, is the star schema and it’s the. It’s the slowly changing dimension type 2. So when you ask the average you know ETL developer or just an IT manager. And even a CTO and you say.
00:41:34 Dave
You know your last just folk thought exercise here for everybody. The last time you developed an ETL system for a data warehouse.
00:41:42 Dave
OK, how much of your time was spent developing the queries for the SCD Type 2 and what was the percentage breakdown? Usually it comes back. This is anecdotal. Don’t hold me to this, but whenever I ask this question to customers it usually comes down to, well, you know 20% of our ETL is just, you know, getting stuff in the facts and dims and then the 80 the remaining 80%.
00:42:03 Dave
Is trying to figure out how to do the SCD type 2 stuff. You know expire the previous row, build a new row, that kind of stuff, right? Say OK.
00:42:09 Dave
80% got it OK.
00:42:11 Dave
How many of your queries are actually looking at those historical expired, non active CD2 type Rd?
00:42:19 Dave
Rose
00:42:20 Dave
And they’ll say probably about 5% of the reports. OK, well, you seem to have violated the Pareto principle there in theory, and that’s my big problem. We spend a lot of time getting data into that historical format, and very few people are using it. And when we talk about self service analytics, the average business analyst gets confused by, you, know, the CD.
00:42:40 Dave
Type 2 hey, I see customer 1235 times in my table. Why is that? And then you gotta explain them well. Only you have to take your query and say you know is active equals one because the rest is the.
00:42:50 Dave
History OK if you do stuff like is in a data leak and this kind of gets into the weeds a little bit, the data lakes again, if they’re done correctly, and that’s the key thing. A data Lake is structured on the time dimension. So what I mean by that is I can go back and with a very simple query I can rehydrate and get you the SCD type 2.
00:43:10 Dave
For a given you know customer 123 in no time flat. So my point is, even if you’re going to do a data warehouse, you might want to defer some of those decisions if you don’t need them today, right? So when you talk to the average data warehouse practitioner, let’s say hey, look.
00:43:26 Dave
If we don’t build the SCD 2 structures today, and six months are going to ask for him, and then I’m going to have no history and then I’m going to get myself in trouble and what I’m saying is in the data Lake you get it for free. It’s not a modeling exercise, it’s not an ETL effort. You get the SC, you get the history. I should say for free Now you can say.
00:43:47 Dave
Hey, I’m going to defer the B2 ETL code until a later date when it’s proved its.
00:43:52 Dave
Needed right one last thing a lot of times when we’re doing as data scientists. OK and people get confused as to what a data scientist is, and it might be like an interesting conversation to talk about that. The data scientists, you know what they’re looking for, you know primarily is looking at at lifecycles of things you know. How do things change?
00:44:13 Dave
Overtime, right? But the thing with the data scientist is and this goes for pretty much anybody that’s doing analytics is they want to see everything on the row. So what I mean by that is data scientists will say you know, whereas a normal data practitioner will say I have rows and columns in data science land. The rows are called observation.
00:44:33 Dave
And the columns are called the features right? But here’s the thing with data. Most data science algorithms, one row cannot refer to the previous row or the next row. Every piece of data used for that row or that observation has to be on the row. OK, So what I mean by that is think about this. If I’m if I’m using a data warehouse and.
00:44:52 Dave
It’s got slowly changing dimension type 2 every time some attribute of the customer changes. I get a new row right now. Think about what that does to the data scientist. They have to take that data and pivot it so that all those additional rows go back to being one row. OK, so I did for three or four years.
00:45:13 Dave
Consulting where we would go in and optimize. Do performance tuning on data science algorithms and it’s very simple again. Thought exercise here for the data scientists.
00:45:24 Dave
You need to write an algorithm that does something OK. Predicting sales doesn’t matter what it is. Let’s just say all of your our code or your Python code or your sass code is 500 lines of code. How much of that is actually manipulating data, right and?
00:45:40 Dave
You know thoughtfully. Most people will say it’s probably about 400 lines of the 500, and then the last 20% is the actual algorithm. And then you say, Yep, that’s about right. Anecdotally, and they say what are you doing right in that data manipulation? And when you look at the code, invariably people are taking the data warehouse and they’re re pivoting all the rows.
Hospital
00:46:01 Dave
Back into one row.
00:46:02 Dave
Right and then we ask data scientists, you know you know, which are? You know, it’s the hottest paying job right now. In it we say, you know why. Why do data scientists stay at a job for six months and then go on to another place? Is it because they’re getting more money? Maybe, but usually when you ask them, they’ll say, oh, I just hate the processes in the data at the last place I worked, dig a little deeper.
00:46:23 Dave
In that, and invariably it’s because they’re doing what they need to do off of the data warehouse, and it’s so frustrating to do pivot table queries when it’s not needed. Give somebody a structure that’s meant for analytics and not reporting.
00:46:39 Dave
And again, that’s kind of the notion of the data Lake and suddenly life becomes easier.
00:46:44 Frank
Interesting side.
00:46:44 Frank
00:46:44 Frank
So no Andy is chomping at the bit to ask you a question, but
00:46:47 Andy
I am yeah, I’m just I’m I’m, you know I’m enjoying the the contrast that you’re making one. I have a couple of things but the one thing I would say to you is.
00:47:01 Andy
You know, I’ve I’ve been using some some other tools that are available for ETL for years now. One called Business Intelligence Markup language.
00:47:10 Dave
Huh, yeah.
00:47:10 Andy
And I estimate it takes me about 2 hours to develop a package in SSIS that does an incremental load and this isn’t true for all projects. The project has to lend itself to to this, but I found that I I just needed to replicate that.
00:47:29 Andy
Pattern across hundreds of tables and using that math I did about 10 1/2 months worth of work in 3 1/2 days so.
00:47:37 Dave
I agree with you, Yep.
00:47:39 Andy
That just proposed just kind of, you know, there are some automation efforts out there. That’s not the only one there.
00:47:45 Andy
Other tools on the market that try to solve that same problem, and you’re right, the ETL taken 80% of the project is crazy unless you’re billing by the hour and then it’s kind of awesome. I’m just saying, but one of the things that I that I would like then I don’t know the right way to ask you this and I’m not.
00:47:57 Frank
Ha ha ha ha.
00:48:05 Andy
I promise I’m not criticizing or anything, but I’m just curious how you would respond to you know, are you just proposing like another?
00:48:16 Andy
Modeling or, or you know or project methodology.
00:48:17 Andy
Overall.
00:48:17 Andy
00:48:21 Dave
Uh, yes, I think that’s a fair statement that you have a. So you’re right, a lot of it isn’t just the data modeling and it’s how we do you know data projects in general, so you know we all like to say oh we practice scrum and Agile and some will say can bond. You know I, I’m not sure.
00:48:39 Andy
Right?
00:48:41 Dave
That those things lend themselves to to data projects either can bonds, probably at least head and shoulders better than.
00:48:50 Dave
Than anything else out there, but honestly, most data projects should be driven as lean projects. And what I mean by that it is, you know, and and the problem is, is the economics aren’t there for data consultants, but the way it should work. In a perfect world. If there were no economics, we should go in and we should say to customers you know.
00:49:09 Dave
We’re going to try some things here right? And we’re going to build an MVP and it’s going to take two weeks and we want you to evaluate it. If it looks like we’re moving in the right direction, continue with the project. If it doesn’t, you know it’s a fail fast mentality, right? And then.
00:49:22 Andy
Can I can I break you break in here?
00:49:25 Andy
I agree with everything you just said, and in fact I ran by projects like that for 15 years.
00:49:31 Dave
Yeah, and and most of us.
00:49:32 Andy
I called it project. I called it phase zero. I’d go in and deliver in somewhere between, you know, a week in six weeks and the whole idea was.
00:49:35 Dave
Exactly.
00:49:42 Andy
Go ask the sea level or the person that was the customer. What is the first thing you look at when you get in in the morning to determine how’s my business doing?
00:49:52 Andy
And if I can get that metric, if I can get a big Dilbert button on the screen as the interface and it’s either red, green or yellow. I’m done with Project Zero, you know, Phase zero. At that point I’m I’m doing well, I’m not doing well, or things are bad. And then to expand that just a little bit.
00:50:13 Andy
When you drill down, it would show you different areas, right? These areas are doing great. You’re showing red. You know these areas. You’re doing great. These are in the yellow, but over here you’re all right.
00:50:21 Andy
Yeah, and I just. I think that’s that that I think you get those kinds of results using again the data Lake Ish. You know metaphor that you’re talking about. Is that fair or am I missing something?
00:50:35 Andy
But missing.
00:50:35 Andy
00:50:36 Dave
It’s exactly right, so again, you know even if you decide you know, after you find the what I call the Canonical, you know nugget of gold.
00:50:45 Dave
If you decide hey OK, we want to stick that. You know that insight you found back into our data warehouse, then go ahead and do that. I don’t mind, OK, you know. It’s just we’ve we’ve proven the project at that point, right? The risk is being removed. The rest is we’re just doing it stuff at that point, right? Operationalization, you know. Data governance.
00:50:58 Frank
Right?
00:51:05 Dave
All that kind of stuff and I’m fine with that, but at least we thought the business problem. And by the way, I 100% agree with you or how you stated you know, as data scientists like we always talk about if we’re going to provide a system of insight, it needs to be exactly what you said it’s.
00:51:20 Dave
A lot of in the industry they call IT system one thinking I need to look at a dashboard and without any kind of cognitive load. Am I doing good? Am I doing bad? You know, in what areas am I doing good so forth and so on. So the stoplight analogy basically.
00:51:35 Andy
Well, I’m not as good at graphics as Frank, so I’m just saying it that’s I have to go with what I’ve got here to work with your day.
Raw
00:51:42 Andy
No, but I I absolutely love the focus is on the business because you know, a lot of times the focus can get on something else and it could be any number of things. But you know, a lot of people are drawn to new technology for new technologies sake. They may not know that that’s the tech to use to solve this particular problem.
00:52:03 Andy
I mean, the first time you use a particular technology, I don’t know that you can know, at least with you know 100% certainty or if you can ever know with 100% certainty. This is where we start. But if you you can mitigate that by using the process you do.
00:52:18 Andy
Describe
00:52:19 Andy
You know, way earlier when you talk about just drop the data into a data Lake or some structure like that, start beating on it with some of the tools that are available and and you will find that it. Well, actually the data will lead you in the direction that you should go. And you’ve said this. You said maybe the answer is.
00:52:39 Andy
After that that it ends up in some data warehouse is structure. Maybe the answer is not.
00:52:46 Andy
So I I think we’re on the same page here. We’re trying to solve a problem for the. For the customer, we want to make something repeatable and sustainable. We don’t want to build tech just to build tech. You know, we want to help help customers achieve a a goal here.
00:53:02 Dave
Yep, and the data Lake thing it for me, it’s not really a technology as more of a mindset and a process. So if you want to think of it in people process technology lens, I view the data Lake is really a process. I’ve built data lakes for customers in SQL Server. It can be done. You just have to do a couple of.
00:53:20 Dave
Trade offs, but you know if your people know SQL, uhm, maybe that’s a you know, a fairly good choice. And honestly, you know, I know SQL better than Python or anything else. So when I’m doing like what I call you know, EDA in the industry, exploratory data analysis. I’m using SQL and and in the cloud you know you can actually do SQL.
00:53:41 Dave
Against data lakes relatively inexpensively, so you know it, it’s not like people have to learn an entirely new technology it you know and and.
00:53:44 Frank
That’s true.
00:53:54 Dave
Honestly, Andy, you’re probably when you write queries and you’re doing, you know, exploratory data analysis and you’re writing ETL. You probably do it just like me so you know it’s just.
00:54:05 Dave
More people kind of need to do that, if that makes sense.
00:54:08 Andy
Well, yeah, I mean we yeah I like to start off with profiling because you know good data profiling will take me to some of the answers right away. It’ll let me know hey, this is sparse. This is populated, so I maybe want to pick this part. You take this column apart or whatever and and see what my categories are and I.
00:54:27 Andy
Again, I think there’s a lot of similarities in in what we’re talking about, I’m.
00:54:33 Andy
But yeah, I don’t and if I’m building transactional data or a reporting.
00:54:40 Andy
Data Mart or data warehouse off of transactional data that has, you know, start dates and end dates. Rows that expire then sure I’m going to do something that looks like a type two dimension but and you’re right, it’s no fun to build all of that unless you’ve got a solid design pattern and you can drop it into a tool like.
00:55:01 Andy
Demo and have it do all the work you know in about a minute so.
00:55:05 Andy
But but I get where you’re going with it. You’re right, there’s and, and I don’t know anybody who likes the idea of, you know 80% of the data warehouse is not being read. Those rows in there that are just being sorted out by in SQL Server the you know the the the not not profile or what. Am I thinking the optimizer?
00:55:25 Andy
Query optimizer you know when you have rows in there that are just constantly being thrown out of results. That’s that’s no fun for everyone. But there are other models that’ll work in a relational data warehouse, and one that I found compelling in a number of instances was a data vault.
Yes.
00:55:43 Andy
And it solves some of those problems, but again, you know it’s it’s another way of acknowledging everything that you just said about moving to a data Lake instead. It’s because those problems exist in a star, so you know, I, I think there’s a there’s more overlap.
00:56:02 Andy
Here than there is a contrast, but I definitely was intrigued by listening to you explaining your thoughts on it. I think you’re right.
00:56:11 Andy
I do.
00:56:12 Dave
Yeah, I think ultimately we’re just deferring things until later in the process, right? Definitely the modeling we’re saying, defer that until we know we found you know that business, nugget of gold. And then again we’re deferring the ETL. Possibly we’re not even doing any ETL. And when I say that as I say, how do you not do any ETL? Well again.
00:56:32 Dave
You think about how I said that first thing we do is we get data and then we sit down with you, know the business analyst and we massage data. We profile data like you said, you know we figure out how to get the data in a format that’s that finds that negative gold. Well, everything, I just ride, there is ETL.
00:56:49 Dave
It’s it’s ETL with a purpose.
00:56:52 Frank
That’s interesting.
00:56:52 Andy
Well, hopefully it’s all with a purpose, yeah?
00:56:54 Frank
Ha ha ha ha.
00:56:56 Dave
It’s it’s not.
00:56:57 Dave
It’s solving a business problem versus a data movement problem. Most people think of ETL is a data movement problem, right? It it shouldn’t be. It should be. It’s a, you know. I’m solving a business problem. I’m getting data into the format I need, you know to find that nugget of gold.
00:57:12 Frank
No, I like the way you approach it. In the interest of shiny object syndrome.
00:57:17 Frank
What are your thoughts on Delta Lake and the Databricks is big on their Lake house platform.
00:57:24 Dave
I, I really think that is the future, I’m I’m not sure we’re there quite yet, but I envision the next 5 to 10 years of of you know what I’ll call you know prescriptive analytics to be we, you know, more people start to understand what a data Lake is. And by the way, it’s not just a place I archive my CSV after the ETL is done.
00:57:45 Dave
Is it?
00:57:46 Andy
Yeah.
00:57:47 Dave
It’s a system of.
00:57:48 Dave
Insight OK. And so once people you know start to do that, then they’ll come to the realization you know. Again, I can. You know the big problem with data you know, lakes versus data warehouses. In my opinions customers will say when do I need a data warehouse? Smart guy, you say I don’t need it.
00:58:06 Dave
At what point do I need it? OK, usually it comes down to when performance is no longer acceptable in the data Lake. So remember with the data Lake if you’re doing it, you know the way most people do. Data lakes is. It’s a, you know, a bunch of files, hopefully optimized files with a compute engine. You know that’s it’s somewhere else.
00:58:22 Dave
Versus a typical data warehouse in Oracle or SQL Server, right? It’s very very, you know, the data persistence engine. The storage engine is, you know, tightly coupled with the query engine, and they’re meant to work together, and columnstore indexes, or you know the cat’s meow and all that kind of stuff. You’re not going to have that with a data Lake, right? So generally performance.
00:58:42 Dave
Is not going to be, you know, so great. So you know maybe the data warehouse one point in the decision matrix is to say you know the data link doesn’t perform anywhere. I gotta put it in a data warehouse, but you know to your point now with this this stuff Databricks has with the concept of the Delta Lake or the you know we call it the Lake.
00:59:00 Dave
House again, maybe that’s going away. Maybe you can say I can do SCD type 2 right out of my data Lake, so maybe I don’t need. You know it bespoke or not a bespoke a you know a commercial data warehousing you know optimized for star schemas. Type tool maybe I don’t need that anymore.
00:59:21 Dave
Uhm, and it’s possible.
00:59:24 Frank
Interesting.
00:59:26 Frank
Interesting, we could probably talk for hours on this. And, uh, we’d love to have you back, uh, but.
00:59:26 Andy
Great stuff.
00:59:28 Andy
Yes.
00:59:34 Frank
Uh, we have a bunch of kind of pre canned questions.
00:59:37 Dave
OK.
00:59:38 Frank
Uh, which I think some of these are going to be fun. Some of these are going to be interesting.
00:59:45 Frank
So the first question is how did you find your way into data? Did data find you, or did you find data?
00:59:53 Dave
So again, I was an intern doing, you know, basically, statistics against you know some datasets for you know, a company that makes candy so kind of data found me. And then I was a history major. Honestly I say yeah and I hated it, but it was a sunk cost fallacy.
01:00:07 Frank
Really.
01:00:12 Dave
I was in two years and I didn’t really feel like change.
01:00:16 Dave
So I went to QVC after I graduated and I just wanted to help this job. And, you know, I looked over one day and you know this woman was having problems with the SQL query and I said, oh, here’s what you do. It was in Teradata and she said, you know, SQL. I said, what is the sequel of which you speak? ’cause I didn’t know what it was I.
01:00:34 Dave
Just knew how to do it and.
01:00:35 Frank
You didn’t see the first movie.
01:00:37 Frank
So how could you see?
01:00:37 Dave
Yeah.
01:00:39 Dave
And and that’s how it worked, and it was the same thing QVC you know, back in the at the time they did interesting things with data and it was a lot of of you know what should we, you know? Advertise when you know what time of the day should we advertise certain products and things like that.
01:00:59 Dave
And all that done with analytics and statistics and just you know, very insightful type data science things. Most people think data scientists you know they’re doing. You know convolutional neural network programming and learning about backpropagation through the network. And no, it’s not that, or you know I’m creating regression models a lot of times.
01:01:19 Dave
Data scientists are just, you know, they’re basically asking.
01:01:22 Dave
Questions, but honestly pretty much you know anybody who understands how to query data intelligently can do, right. You know that’s what business people want in 2021. You know they don’t want another report that says show me you know, quarter over quarter sales or quarter to quarter sales. They want. They want data, people that can come in and say you know what do I do next, right?
01:01:42 Dave
You know what’s the next best thing I should be doing, you know, or you know, how should I structure my next brand campaign where I’ve done all of these things to convert lead to leads to sales? You know what are the things that are actually working and that’s what they need? That’s what they need.
01:02:00 Dave
Real quick.
01:02:00 Andy
So you mentioned QVC, were you down in Roanoke? I know they have a big operation there.
01:02:06 Dave
No outside of Westchester Pennsylvania.
01:02:08 Andy
OK, OK, very cool.
01:02:11 Andy
So our next question is, what’s your favorite part of your current gig?
01:02:17 Dave
So like I’m a boomerang at Microsoft, I worked at Microsoft before, uhm, I like just we have a lot of customers to come in and like I said, we talked to them about different things earlier this week I had an entire C-Suite come in.
01:02:30 Dave
To figure out how to monetize their data, that’s fascinating. And then the other end of the spectrum is, you know, another day. This week I can’t remember what they were even on. We did a hackathon on like how can we do interesting, you know analytics you know with our data so we get to do all that kind of stuff.
01:02:50 Dave
And like Frank mentioned at the beginning, generally we’re constrained to like. At most we can work as three days, so it’s nice we can get in there. You know, you know, ’cause some some grief. And then we get to walk away and do it again with.
01:03:01 Dave
Another lot of fun.
01:03:03 Frank
I didn’t know you were all set. Boomerang, that’s cool.
01:03:06 Dave
Yes Yep.
01:03:07 Frank
Awesome, our third question is, and for those who don’t know what a boomerang is is when you leave Microsoft come back.
01:03:14 Dave
Yes.
01:03:14 Frank
Yeah, we have a couple of complete this sentence questions when I’m not working I enjoy blank.
01:03:23 Dave
I go to the beach a lot so I live close to the beach. That’s my thing.
01:03:27 Frank
Cool.
01:03:29 Andy
So our next one is, I think, the coolest thing in technology today is blank.
01:03:36 Dave
Blockchain and cryptocurrency.
01:03:39 Frank
Interesting.
01:03:40 Dave
And the reason why is, I think you’re going to find in the future. It’s going to. Seriously, we always talk about digital disruption. It is the digital disruptor. So I mean, just think of things like you know right now. If you’re on Facebook, right? They’re they’re mining all of your data. You’re the product, right? And and they’re selling you.
01:04:01 Dave
And if we just change the paradigm around and we have all of your data out in the blockchain, now you have the control. You can do it with it what you want, right? You could potentially monetize that for your own liking, and that’s just one example right? All this stuff you read about right now with Nvme non fungible tokens. It’s kind of laughable frankly, but you can see there’s a future in all of this.
01:04:21 Dave
And like it’s it’s going to be interesting.
01:04:25 Frank
Indeed, indeed, our next question is another complete this sentence. I look forward to the day when I can use technology to blank.
01:04:38 Dave
Ah, that’s a good one.
01:04:41 Dave
Just be able to solve data problems faster. I mean we’re getting there, right? You know, if you’re going to use Bimmel to build, you know SCD type 2. That’s much better than it was 20 years ago. I think at some point we’re going to get there quickly where you know we have data and we can do analytics on it in real time. And and hopefully it’ll be soon.
01:05:02 Andy
Interesting, so our question six is share something different about yourself, but remember it’s a family podcast.
01:05:10 Dave
I.
01:05:11 Dave
I don’t really have anything that’s different about myself. I mean, I just have a wife and two kids. You know that’s it. I’m really a very plain Jane individual.
01:05:23 Frank
Now plus, you work at the MTC like that’s like that’s like I was telling I was telling my former. So every MTC has what they call a director. That kind of has.
01:05:25 Andy
Well, there’s that yeah.
01:05:32 Frank
Fairly good autonomy over what that particular center does, and I was telling like.
01:05:40 Frank
Uh, when I? When I said I was leaving my my the director was like oh God, I gotta find your replacement now.
01:05:49 Frank
And you know ’cause I mean it. You need someone with good technical understanding, good customer presence and and someone who’s willing and able to, you know, speak all day.
01:05:59 Frank
Hey which these engagements generally run? You know what? From like 9:00 AM when they were in person 9:00 AM till like 5:00 PM or 4:00 PM. And virtually I mean it’s basically, you know, one of the things I realized since taking the new job is like, wait, I’m not chained to my computer for like 8-9 hours a day like.
01:06:20 Frank
Uh, so as I said, it’s like, yeah, you know, it’s kind of like the Navy Seals, you know, like not everyone there. People who can do it maybe don’t want.
01:06:29 Frank
I do it and it’s not for everyone you know, so I would say just being in the MTC is kind of like kind of a big deal I think.
01:06:38 Dave
It is, and that’s the reason I came back to Microsoft is because we have a lot of autonomy to do things and to say things that possibly other people can’t, and I really kind of like that autonomy, absolutely.
01:06:51 Frank
Yeah, we’re not part of the account teams, and we’re not measured on quota in the same way. I mean, there’s probably some measurement but but I mean it’s it’s definitely unique role and I think it’s it’s something that you know because of the natural kind of shift in customer.
01:07:08 Frank
Account teams like you know the the MTC face is probably the one that the customers get to know best.
01:07:16 Frank
Uhm, you know. And UM, that’s an interesting kind of story in and of itself. But, uh, moving on ’cause we’re we’re kind of on the long side. But you know what, Andy?
01:07:16 Dave
Yes.
01:07:26 Frank
It’s season 5.
01:07:27 Frank
Premiere like we can, we can kind of.
01:07:29 Andy
Do this if this has been an excellent show too so.
01:07:30 Frank
Ah.
01:07:32 Frank
Oh, it totally has it totally has and dumb you know before we get to the last two questions, I’ll definitely say Dave, you’re you’re welcome back anytime, yeah?
01:07:41 Dave
Well, thank you. I would love to do that. That would be awesome.
01:07:43 Frank
Awesome.
01:07:43 Frank
01:07:44 Frank
And where can people, Speaking of which where can people learn more about you?
01:07:48 Dave
I’m on LinkedIn. Uh, and I have a website I doubt not more than three people go to it, but it’s davewenzel.com.
01:07:56 Frank
Cool, well, we’ll get you at least six people. How about that?
01:08:02 Dave
So then my mother and who are the other five people now?
01:08:03 Dave
Building
01:08:03 Dave
01:08:06 Frank
My mom. I think Andy’s mom.
01:08:08 Andy
Sure, yeah, my mom will check it out.
01:08:14 Frank
Uh, and last question. Audible sponsors data driven. Can you recommend a good book?
01:08:21 Dave
Yeah, from from.
01:08:24 Dave
So books that I reread every five years is Shantaram. If you’ve ever heard of that, it’s spelled dislike. It sounds it’s it’s a fiction book. But actually, I think it’s a partially true story. I won’t bore you with all the details, but basically a guy goes to prison.
01:08:30 Frank
No.
01:08:30 Frank
No.
01:08:41 Dave
For murdering somebody in Australia, he escapes, goes to India.
01:08:45 Dave
Uh, he becomes a doctor, you know helps the Indian people ends up going back to jail. He’s tortured almost to death. All kinds of interesting things. It’s a lot of plot twists and turns. Very fascinating story, very long story as well. So that’s fiction for it. People like everybody. Probably read the Phoenix project by right by now.
01:09:06 Dave
If you haven’t, definitely read that one, but the one that I find far more insightful than the Phoenix project and every day to person show.
01:09:12 Dave
Read or actually I advise the audio book, which is the audible point is is the goal. So the goal is by I think it’s pronounced Eli or Hugh Goldratt, and it’s basically if you’ve read the Phoenix project. It’ll feel like a rip off, but he actually wrote it 20 years before the Phoenix project. And if you’re a software developer.
01:09:29 Frank
Oh wow.
01:09:32 Dave
The insights that you gain out of this and it has nothing to do with software. It’s a. It’s the journey of a guy that runs a manufacturing plant. So you’re probably thinking yourself. How does that you know involve data? It doesn’t, but the the allegory’s that he tells on how to solve prob.
01:09:48 Dave
Projects if you’re in software development, I can’t recommend that enough. If you’ve liked the Phoenix project, you will love this book, so.
01:09:56 Frank
Interesting, interesting. Well if you don’t have audible already, you can go to the data driven book com and you’ll be routed to the.
01:10:08 Frank
Audible page and you’ll be able to get one free book on us. And if you decide to become a subscriber after that, Audible will kick us back enough to buy a Cup of coffee at Starbucks.
01:10:09 Frank
Page.
01:10:09 Frank
01:10:20 Andy
There we go.
01:10:22 Frank
Nice and any last words.
01:10:25 Dave
No, this was great. Thank you, really appreciate this. This was this is I’ve been listening to your podcast for years. You guys are great and I I think I’ve said this before. I’ve been following Andy for probably feels like 20 years. He’s definitely. He’s definitely a bigwig in the SQL Server industry and I came up through that. You know, learning pretty much everything I know.
01:10:27 Frank
Awesome.
01:10:32 Andy
Thanks.
01:10:45 Frank
I keep telling Andy that I keep telling Andy like on the MTC, like internal genes, chats which I know you’re you’re on, that that the the Northeast data and AI thing which I do miss. I miss that thread anyway. Andy name comes up in like hushed and reverent tones, quite a bit like and he doesn’t believe me like.
01:10:45 Andy
Every time somebody says.
01:11:01 Andy
Oh, absolutely, I’m like no.
01:11:01 Andy
Oh absolutely, I’m no.
01:11:04 Frank
I’ve screen shotted like things I’m like. They’re talking about you.
01:11:05 Dave
No, absolutely no, I believe.
01:11:07 Andy
So it’s just you know my response to that Frank. I repeated it over and over again. It’s like I’ve been trapped in here with me for 57 years and I am just not that impressed. I’m just saying it just.
Yeah.
01:11:17 Andy
Haha.
01:11:17 Andy
01:11:19 Frank
It was funny. I think Dave got the joke somebody posted. I think Dave knows where I’m going with this is that I somebody said oh Andy Andy.
01:11:22 Dave
Yeah.
01:11:29 Frank
And you know, basically check this book by Andy Leonard. Andy Letter stuff is great.
01:11:32 Frank
And and and. And I wrote back. Yeah, he has a podcast too and I put a link and then I put with a :). But his co-host is a bit of a jerk.
01:11:40 Andy
2nd.
01:11:40 Andy
01:11:42 Andy
True, Mike.
Is awesome.
01:11:45 Andy
He’s not a jerk at all.
01:11:47 Frank
But the first person to laugh at it was, Dave said.
01:11:49 Dave
Yes, I got the joke right away.
01:11:53 Frank
I don’t think the original poster did, I’m I’m.
01:11:53 Andy
Y’all are awesome.
01:11:55 Dave
They did not.
01:11:55 Frank
Not sure about that.
01:12:00 Frank
Well, awesome, Dave is always a pleasure. UM, any parting words, Andy.
01:12:04 Andy
Thank you Dave so much for this. I love the back and forth. I’d love to work with you on something man, that’d be fun.
01:12:12 Dave
Yeah, let’s do it.
01:12:12 BAILey
Yeah.
01:12:12 BAILey
01:12:13 Frank
Awesome awesome.
01:12:15 Frank
All right, and I’ll let the nice British lady finish the show.
01:12:19 BAILey
Thanks for listening to data driven and thank you for making this show a success. You know Frank and Andy won’t admit this very often, but they weren’t sure that the show was going to last three seasons.
01:12:32 BAILey
So here’s a heartfelt thank you from an AI who would be out of work if it were not for you.
01:12:37 BAILey
I don’t get sentimental very often, so soak it up while it lasts. By the way, we know you’re busy and we appreciate you. Listening to our podcast, but we have a favor to ask. Please rate and review our podcast on iTunes, Amazon Music, Stitcher or wherever you subscribe to us.
01:12:56 BAILey
You have subscribed to us, haven’t you having high ratings and reviews helps us improve the quality of our show and rank us more favorably with the search algorithms.
01:13:07 BAILey
That means more people listen to us spreading the joy and can’t the world use a little more joy these days?
01:13:15 BAILey
Now go do your part to make the world just a little better and be sure to rate and review the show.