Skip to content
Exploring Machine Learning, AI, and Data Science

Data Warehousing Deep Dive

In this Deep Dive, Frank and Andy delve into the world of Data Warehousing, what is it and do they know things? Let’s find out!

Frank also shares that he has a new role at Microsoft.

AI Generated Transcription

Hello and welcome to data driven,

the podcast where we explore the emerging field of data

science.

We bring the best minds in data,

software, engineering, machine learning and artificial intelligence.

Now hear your hosts Frank Lavigna and Andy Leonard.

Hello and welcome back to data driven.

The podcast where we explore the emerging fields of data

science machine learning an artificial intelligence.

If you like to think of data as the new

oil then you could consider us like Car Talk.

However, we can’t go on a road trip because of

the Corona virus lock down.

So it’s just Andy and I kind of stuck at

home respectively.

And thanks to the Magic of Technology we can be

on the show at the same time.

And, uh, how’s it going?

Andy? It’s going

well, Frank, how are you doing?

Good, good, uh, you’ll

probably hear my kids in the background.

We

will, and you know what Frank,

I think it’s fine. You know I’m going to.

I understand why you said the word stuck with you

and I work remotely an awful lot.

We usually record like this.

There’s there’s less in the background.

It’s your place most of the time,

but you have couple of young boys there and you

need to be in the room with them when mom

who’s also working from home is you know is doing

some of her work so kudos to you to both

of you for finding a way to manage this.

Everybody’s going through these sorts of things and I’m sure

that none of our listeners will mine here in your

sons play in

the background or hopefully won’t start fighting so that’s Well,

I asked, I asked if they do I think a

lot of folks

can relate though. Yeah, oh absolutely,

absolutely

so. We’re recording us on April 16th.

We Speaking of kids, we had your son on which

if the order of recording goes the way I planted

in my head.

That would have been released last week.

And Uh, which I thought was a pretty good,

uh, discussion on. How stem is taught?

How stemmers perceived by quota quote policymakers?

And how the actuality of it is?

And some of the interesting stuff your son is doing

with Raspberry Pi and stuff like that.

Yeah,

I was a I was first I was very proud

of him.

You know the work that he’s doing and he’s he’s

had his his hands in machine learning for really a

couple of three years.

Now I want to say he was 14 and I

came into his room.

You know just checking on say something or something I

saw.

A Mario Brothers playing in the background.

Like what do you think you know he was?

He he had done his school work?

He was home schooled at the time he done his

school work.

So you know what he wants.

But um, later talking to him about it,

he said he actually came and got me and he

said,

OK, dad, it took, you know with I think it

was like 6.

You know neural nodes. Here he was able to,

Mario was able to figure this out and something like

4 hours or something you know later he said I

wonder what it would be if I added a note.

I wonder what that would do to it and I’m

kind of sitting there with my mouth hanging open.

Going show dad more about that nice,

but he’s been doing it for awhile.

I know your kids are interested in the same thing.

They’re younger Stevie 17 now and you know.

and I know that your sons are coming up in

this.

In this age as well,

they are mentioned Mark Tapatio in that show as he

referred to digital natives.

They are digital natives and yeah,

that comes with some pretty interesting stuff.

So I’m just glad we were able to record that

show as he gets ready for his first sequel Saturday

presentation here on that topic.

So and that’s all assuming that we were able to

overcome the technical glitch.

We we learned something, Frank,

I’d learn something. Yeah,

it’s not a glitch. If you learn something.

So if if for some reason.

The you know what hit the fan then that episode

will be recorded at a future date,

so we’ll see it will,

but we’ve got. We’ve got

a great topic today. You and I’ve been bad this

around I want.

I know it’s been several weeks.

It may have been a couple of months.

We’ve been talking about doing this.

Right absolutely, and part of what motivates this?

An based on the release schedule that I anticipate this

will have already happened.

I’m changing jobs at Microsoft Woo.

At your new job. I will be the data and

the AI technology architect at the Reston MTC or Microsoft

Technology Center,

so congratulations. Thank you very much.

It’s an honor to join such a prestigious team.

If you’re not familiar with what the MTC is.

MTC is a Microsoft Technology Center.

There is about 80 of them around the world,

and they basically are meant to provide specific experiences.

Ends well as architecture design guidance for customers around the

world and it’s an honor to be kind of in

that team.

It’s very rarely does an opening happen in an empty,

so when one opened up in my neck of the

Woods is like I have to take it.

I have to at least try.

Right right? So Fortunately I am super excited.

And Uhm, ’cause That’s what we say at Microsoft were

super excited and it’s a great team.

Great stuff that they do.

They do a lot of work with the community.

They do a lot of work with customers.

It’s just an awesome gig.

I’m really looking forward to it and.

Yeah, I’m really excited about

it. Congratulations brother. That’s a great thing and I think

you’re perfect for that job.

I know, I know, someone else in that job at

an MTC in the northeast.

And it’s it’s kind of a rare breed of person

that has to walk into that role because.

It optimally you have a smattering of exposure to all

whole slew of enterprise architecture,

an both both you and this other individual that I

know fit that mold.

You’ve got programming experience, software development experience,

and you also have data experience,

and it’s just rare to be good at both of

those things I know,

but I know you’re good at it,

and I know my other friend is good at this

as well,

so I just I just think it’s going to be

a great fit for you,

Frank. I’m I’m excited, you got

it. Thank you. Thank you very much.

So with that, one of the things that I’ve been

ramping up on in anticipation for this job or whatever

opportunity I was going to go to next.

I was learning more about the quote Unquote traditional side

of the data world,

which let me move kind of explain my little worldview,

which is twisted and as weird as it may be,

it might actually be right.

I see this alot in my current current or old

roll current as of April 16th.

Role is that we have data in the I cloud

solution architects,

but there’s a very clear line of demarcation between the

data scientist.

Part of the data in the icy essays and the

sequel veterans side of things.

So I actually had a call this morning where it

was.

It was very, very much laid bare ’cause we were

talking about that and that there’s essentially kind of two

types of data in AI folks at Microsoft for sure,

probably everywhere else, to you have the RDBMS folks.

These folks have been doing sequel since it was aside

based joint venture,

right, right? That’s their world.

Ann, you have kind of the big data open source

kind of tooling world,

right? The folks that are more comfortable in spark or

Hadoop or with the crazy statistics and math around machine

learning and AI,

right? You kind of have those two.

Rarely do

the two. Rarely do you

have a person who’s. Comfort,

Rible and happy in both.

I am aiming to be happy and comfortable in both.

Obviously I’m more in the data science kind of world.

And part of my part of what I see is

the opportunity in this new role is to grow into

the kind of the sequel.

RDBMS traditional database world. That makes

sense. They are no. It makes perfect sense.

and I mean coming at coming at this from,

you know, we as we shared in each show the

past few days that we’ve recorded.

We’ve known each other for like 15 years.

And most of that time you were a professional software

developer.

You are a Microsoft MVP in.

I forget which discipline it was.

Frank, I know it was software development related.

I

think

the world has forgotten that this discipline never existed.

tablet

PC. Tablet PC right? OK and you did an awful

lot in there and I know there’s a lot of

people out there working in what that evolved into mobile.

That benefited from the blog post you shared,

solutions. You shared an all of that,

but yeah, that whole mobile thing turned out not to

be such a,

you know, it was a trend it and it evolved.

To what it is now,

and having that experience, I think you’re going to find

that that plays well into kind of backfilling like you

said,

or filling this other bucket that you want to go

after,

which is traditional T SQL an.

I know, I know, from experience and dabbling in machine

learning and AI.

I’m on the opposite side of the fence,

although I’m not really that good at,

you know. Let’s say like DBA level T SQL,

but I you know I can hold my own in

there,

but if we are. If we’re selling tuning performance tuning

to a client.

I may be involved in the project,

but rarely am I the person actually performing the tuning.

There are lots of people out there that we subcontract

as a as a consulting firm,

enterprise data and Analytics. We bring others in who are

better at that much better at that than I am,

and we have people on the team who are much

better than I am as well,

but it’s. I think your experiences has his set you

up really well.

To make this transition and it will like everything else

right?

We talked about this in the other shows.

It takes time. And it’s frustrating,

but I think you’re well positioned to pick up this

skill as fast or faster than almost anyone else I

know just well,

thank you now you know part of it,

I’m not. I’m not completely like naive to the ways

of RDBMS.

I took sequel in college,

database design and college and my professor worked with card

in date.

So you know, like. You know,

I’m only two degrees of Kevin Bacon away.

From the founders of the theory,

so you know that’s going for me,

but I never really got into just kind of the

nuts and bolts of it,

and I’m not. I’m not concerned about that.

I’m actually fascinated about it,

because it’s just another way to solve the same problem.

Absolutely. Ultimately, at the end of the day,

you’re moving bits around, and it’s a question.

What’s your philosophy? Or obviously,

RDBMS has a philosophy and it you know I’m not

knocking.

I mean, it worked well for 5060 years.

But now we live in a world where there’s a

lot more unstructured data.

And how do you deal with that?

And how do you deal with it now that you’re

not making assumptions about spinning

disks, right? Right there’s a whole.

Kevin hazard.

’cause we haven’t hazard.

Yeah who talked about that on our show that yeah

there still leaves it’s 2020 and I would say still

most of our code is designed for that age of

the heads picking up seeking a sector an reading data

and then picking up again.

So there’s there’s a whole new opportunity where obviously relational

databases are going to still matter,

but it’s just one of many tool sets.

In fact, one of the things that I learned when

I was doing start up with angelism for Microsoft was.

You know, having debates with startup founders who UR?

I will say I put them in a hipster category,

right? I worked with when you work with startups runs

the gamut between really like I mean like that this

person is going to be the next Steve Jobs to

this person is kind of like I think they’re living

in their parents basement,

but rather than seeing unemployed they haven’t so somewhere in

the middle you kind of what I have.

The hipster ones where they learned code because of make

the startup.

Now that’s not nothing wrong with that,

but do you think that you’re an expert in all

things technology because you learn to code?

Right, you know, and then you go to a person

that is supposed to help you take your stuff to

the next level and kind of talk down to them.

So right context this conversation.

So they were basically lamenting the fact that they wanted

to.

They wanted to have the reliability of.

Up an RDBMS, but they wanted to do it in

a note SQL type of environment.

An I was like that’s

a fair. You know that’s a fair thing to want.

I’m just all cards on the table.

Approaching that, architecturally, that’s that’s not an unreasonable request.

But unless and until you get into the engineering part

of it.

And that’s where you start to see that you just

can’t have everything that you want.

I mean, there’s no single do it all type application,

everything, every software application ever.

And I’m going to maintain,

probably forever. They’re going to be applications.

There’s going to be some spot that I define as

a corner.

It’s something that the application or server or what have

you doesn’t do well.

And what you’ll often find is there’s some other application

out there that’s available,

or some other platform, and it will do that part

well.

But again, that also has its corners,

and So what you’re trading is pain.

The nicest way possible. You’re picking your picking your poison,

picking your pain. What is it that you want to

fight?

And it depends on. You know.

Relational databases have their pain points.

No sequel. It turns out a lot of companies have

learned this over the past few years.

Also has its pain points as well so.

You can’t always get what you want,

but if you try sometimes you might get get what

you need.

Awesome. So, so I

mean part of it is,

you know, sometimes whether it’s technology,

anything else, you have, kind of these dueling philosophes an

there is a point where they just won’t meet just

because of.

They’re they’re kind of philosophically opposed an you’re right,

you have to kind of pick which one you want

to have over the other.

And there’s cause and effect to that.

So with that kind of deep philosophical you were data

philosopher,

so that’s good. So I wanted to talk to you

about.

We want to do a deep dive.

It’s not officially a deep dive until I have fun

with my soundboard there.

Into data warehousing, what is data warehousing?

Where did it start? I’ll channel A little bit of

bojack horseman.

What is data warehousing? What do they know?

Do they know things? Let’s find out.

Well, yeah, data

warehousing in my opinion in my experience is really this

idea of of collecting data from all over different places

and placing it into a centralized location.

Now there’s some distinctions and there’s other scientific answers to

that question,

and you can actually build something that today is not

considered a technically a data warehouse.

You can gather all of the information that spread across

the enterprise in different places.

Into what’s now called an operational data store.

Ann, it’s not totally unlike a data warehouse.

In fact, I think the Euler diagrams have quite a

bit of overlap for that,

at least if we if we kind of improve or

add to the word data warehouse or the term data

warehouse with relational data warehousing,

there’s a lot of overlap between relational data warehousing and

operational data store.

Wanna confuse that really with our listeners?

But I just want to make you aware if you

hear oh DS or DW or EW.

It could be that they’re talking about largely the same

thing.

And when you think about like you think about supply

chain management,

which is a topic on everyone’s mind these days as

we’re talking about the economic impact of the pandemic.

Supply chains are where really where really way more important

than we realize and it’s kind of like oxygen or

money.

You don’t recognize how important it is until you don’t

have enough.

An supply chains are like this and you could think

of a data warehouse.

In that terminology. The analogy holds for quite a bit,

and I’m going. I’m just going to use Walmart and

Amazon as you know,

is kind of examples of this.

They both have these distribution centers and they have these

network set up all over the United States,

probably all over the world and its places where the

goods come from the source and they’re trucked into.

You know, they may be collected at other points along

the way.

But they’re trucked into these large,

physically large warehouses and then stocked.

And then from there there actually shipped out to in

the case of Amazon.

Usually there handed off to some delivery service.

In the case of Walmart,

they’re placed on other Walmart trucks that are shipped to

the stores.

The actual brick and mortar stores and that warehouse in

the middle.

That distribution center. That’s what I think of when I

think of data warehouses.

I think of the the electronic equivalent of that because

you know,

there’s all of these. You’ll see especially at what I

consider an EDW enterprise data warehouse.

You’ve got a collection of companies that have been acquired

in mergers and acquisitions,

and they’re looking at. I want to get all of

their data.

But they have and want to bring that into this

one location,

and that I want it there for a number of

reasons.

But one of the big reasons is so I can

query that data and I can learn how my entire

enterprises performing.

How’s it working? And. And in that,

and now if I apply that Walmart Amazon analogy to

that to the data there,

they reports that come out of querying a data warehouse.

Those are like the end user customers for Amazon,

say and like the Walmart stores in that other analogy.

So it’s and it really turns out Frank.

It’s always good to to answer the question what’s the

problem we’re trying to solve?

And here’s the problem. Most people are trying to solve

with an enterprise data warehouse.

They want to know how’s my business doing today.

And if they can answer that question,

that’s going to be answered by a report of some

kind or an analytics dashboard.

Still a report. In my mind,

it’s in the best analogy is something we’re used to.

It’s a stoplight. All we green everything is good or

good enough.

Are we yellow? Some things are in trouble,

and most enterprise data warehouses during the pandemic are going

to be showing yellow or some red.

We’re in trouble. It’s going to be very few showing

green and but that’s the kind of thing that you

want.

You want to know at the very highest level are

we good?

Are we bad or are we somewhere in the middle?

And of course, you want to always be able to

drill in.

Especially things are in the warning state or in the

bad state.

You definitely want to drill in.

See what the problem is.

Pick up more information about that,

but at the very tippy top.

You want to be able to answer that question.

How are we doing

today? So I have two questions.

Uhm one is you mentioned that with mergers and acquisitions

you want to have one place for all the data

to the live?

Sure how is how is it data warehouse different than

a data Lake?

That’s

a That’s a really good question,

so you can think about data lakes being a similar

collection point.

So from a purely functional standpoint,

my experience with data lakes I’ll share is limited.

So that you know to take this with a grain

of salt,

but I consider data lakes to be more of just

a a collection of the data.

A copy of the data in its raw or raw

ish state,

and when I draw a line between what I call

a data warehouse and what I call an operational data

store there more on that side of the operational data

store. There more of a copy of the data from

wherever it’s come from.

Now I know for a fact working with people who

work with data lakes that they see that a little

bit differently.

An I’m not going to speak for them.

But I just know that my definition is there definition

differs a little bit because you can achieve an awful

lot of what an enterprise data where warehouse does for

you in delivering those results that answer the question.

How are we doing today?

You can achieve a lot of that just going by

querying directly querying the data in a data Lake and

the same can be said for an 0 DS so

don’t. Don’t don’t you know,

trying to mix and match here and draw some distinctions.

I wish I could share with you the picture in

my mind.

But maybe like to show notes,

so do our best. So what is that?

I’m one

of the things is that someone on my side where

I kind of see.

I see things for more.

The data Lake POV just want to place the for

stuff to land weather that’s coming,

streaming data, iot or from various types of data stores.

You just want to place to put this stuff absolutely

so one of the things I’ve seen in architectures is

that at least kind of in more modern architectures,

because historically I think data warehouses have been used like

data lakes,

but one of the things that at least data warehouses.

Half they still assume a schema,

right? Like you still have to have a schema right

for it to be in a data warehouse so,

but so if you have completely unstructured data.

Uh, you still dealing with the primacy of the schema

rise at that.

Failing, it

is an you know what I was going to share

too,

is just thinking through this a little bit more.

You know, one of the distinctions that I’ll make from

certainly from a relational data warehouse.

Between that in an operational data store is,

you know, there’s a little bit of hybrid here,

and let’s just kind of.

I’m going to take Cody S out of the picture

for just a minute.

I’m going to talk about staging from an extract transform

and load or ETL.

Or data engineering data integration,

whatever you want to call it,

our perspective. Whatever the format that date is in,

when it starts its track into into whatever we’re going

to end up with with an EDW.

When it begins that the very first thing I like

to do,

and that is an area where I specialize is going

to get a copy of that data into some central

location.

Now it could be that that central location is is

a data Lake Ann,

often in modern data warehouses they are.

It goes into a data Lake,

but it could also be a relational database,

or it could be a just a centralized collection of

files.

And there’s a lot of things that you can define

here,

but there’s this concept of it at the very first

step,

collecting and staging. It’s all here now.

And I still do that when I’m designing ETL solutions

for customers.

I get it to that first step that first place

first and I want to get it all there and

I mostly deal still with relational data warehouses that go

to tabular models or cubes.

You know, some sort of analytics solution and you know

when I do that,

Frank if I’m, let’s say I want to load data

from files into a SQL Server.

I load that data, it’s just as as constraint free

as possible,

so if it exists in the file I want to

read whatever is in that field and bring it into

a table.

What we would call a heap in.

In Relational Database management that if there’s any constraint on

it,

it’s some sort of identity column.

That’s just, you know, counting the rows.

Basically enumerating the Rosen that’s there just to make sure

nothing.

No row is identical to any other Rd.

And that would be the only constraint I would place

on that.

And I would do the same for a data Lake.

You know we were pumping data into a data Lake.

I want to get all of the data from all

of the places into this stage first.

So that’s that’s kind of the very first step that

I do.

After that, we then began applying what you just said.

We format it into a schema,

we make it make sense the very first thing I

would do if I was staging it into usaia SQL

Server table a heap.

The very next step would be read the heap and

take these these text columns that contain dates an numeric

values and try to fit them into date an numeric

fields. Right, I would make that attempt an if something

happens and it won’t fit.

Maybe one of the date columns has February 30th in

it.

You know, maybe somebody fat Fingered something and put the

13th month in there.

Anything could happen. I’m going to redirect that row and

try to store that data,

but I won’t want to human to look at it

and that could happen in a data Lake.

It could happen in loading heaps tables from anything else,

but the idea is I want to pick up whatever’s

out there,

get it into this engin that I can then apply

these rules to like in my very first rule is

as strong typing.

Want to make sure that I get all of the.

Values that are valid out of there and I also

want to mark the invalid values for later.

So and eventually we flow down this to the end

and we get to a place where you know we’ve

got a collection of all of the values we’ve ever

loaded, and we then fit our new data that are

new load.

We fit that new data into that prospect,

but not before going through that staging process.

Then some cleansing, and I would first step in cleansing

for Maine is strong typing using.

There’s a second step that minimally minimal of another step

that.

Apply soft rules so if I’m loading claims data for

an insurance company and I’ve gotta clean initiated date.

I may have a claim closing date or settle date.

I want to check a software would be the claim

settled,

8 is greater than the claim initialize date.

You know stuff like that.

That makes sense. So kind of like common sense rules.

So so I guess you said you want to remove

the constraints.

So that probably plays into the fact that you hear

the term denormalization being used a lot is that.

Is that what they’re

referring to somewhat? Usually I applied the normalization when we

get to the end of a relation of funnel here,

where I’m loading ETL stuff and what I’m after there

is is flattening out the data so that it’s ready

for the consumption.

In these reports that tell me how’s my enterprise doing

today.

And you get various various and sundry opinions about D

normalizing data.

If you think about third normal form something you mentioned

earlier,

the premise behind cards rules of.

Normal forms at normalization is that one of the principles

is you only want to have data live in one

place so this value you want in a single location

you want to copy it in 14 different places or

even even 2 you want to be able to keep

it in one location because if that data needs to

be changed or if we make a change to it

we want to go hunt it down.

Here and also also one

of the Canonical examples is an address right so I

think this is really to me anyway this underscores kind

of what’s why the 2 philosophies on that.

Are important and why they’re both compatible in a sense

so if you have an address if I if I’m

more like an online transaction?

Platform right OLTP. Or processing that’s what the TV stands

for not toilet paper although people supporting it.

But but but if you’re dealing with like a real

time transaction you want to know the customers current address

exactly right

but if you want people kind of.

Now go ahead

but if you want to know historically if you’re doing

reports on where customers have sent stuff or what their

dresses have been then you would want to buy definition

you kind of have to store those address in more

than one place that this order was placed On this

date and will send to that address as opposed to

this is where customer ex lives today they live in.

This address.

And it’s a great example an there’s yet another use

case some people own multiple homes and so they have

multiple current address is and this is where you start

getting into things kind of getting interesting I’ll say it

that way in data warehousing but that’s if you look

at the percentage of all of the people who have

an address in the world most of them by far

have a single address at a time and they go

through this but you do definitely have to account.

For those customers who may have something shipped to say

one location and kind of like going back to Amazon

this is happened to me I own you know I

have one address only one home well me in a

mortgage company but I have been on vacation and ordered

from Amazon and had it shipped to the address where

I was you know overnight.

There as well so it’s while it look this is

one of the things that looks simple that you know

it’s not so simple and a friend of mine and

coauthor tell Mitchell he’s ETL specialist San Tim actually blog

he wrote a really good blog post about you know

how fuzzy some of these concepts can be and I

believe the title was what is a day.

You think about if I called yeah well you know

at tenmitchell.net if you want to check that out and

search for it he brought up some really good use

cases you know what is a dangerous thing sure and

we think that simple no it’s not.

And part of what happens and I kind of skipped

right over this is when I’m collecting data from these

different locations.

But I’m doing ETL it could be somebody has a

warehouse in Liverpool england and they’ve got a bunch of

warehouse locations in Canada.

And the US in Mexico and maybe even in East

East Asia wherever they store dates in different formats.

Yeah so at I could have what looks like a

13th month right I could have 13 dash 03 dash

2020 you and I know because we work with us

so much that that’s referring to the 13th of March

but in the United States we would reverse the first

2 we would have 03 dash 13 dash 2020 and

one of the one of the goals of data warehousing

or even to DS is to collect that data and

then pick a format it doesn’t matter which just pick

one. And make that perform at your going to use

for dates the same goes for other measurements that you’re

using and and it can go beyond that the weights

in some locations are going if I set up a

something that includes how much something weighs in Germany I’m

probably going to get kilograms but if I do that

in wiscconsin I’m going to get pounds that’s against Wisconsin

nothing against Germany it’s just different.

So we have to convert so that when we do

comparisons on how much things weigh it doesn’t matter which

measure we picked by the way we did it just

needs to be consistent so that we can compare apples

to apples and we’re trying to tell which is heavier

something that weighs 2 pounds or something to weighs 2

kilograms going to get different answers.

No that’s a good point that’s a good point and

there’s all those sorts of things and and you right

when you when it comes to you kind of narrowing

down what do you mean by this and I think

one example that’s not necessarily data related was I was

on a mapping project it was basically for a hotel

chain in they wanted to map out local points of

interest right in San Francisco.

And this was all done in Silverlight Maps and all

that very cool 3 D work although doesn’t work anymore

but take your word for it was awesome.

But uh one of the questions that I had and

I worked with the work.

I work with the manager of the hotel is you

know they wanted to be able to put pins on.

You know various points of interest now for some things

that’s easy but when it came to like a park.

Well where’s the interest of the park where is the

park really like right I think was Golden Gate State

Park or whatever it was like well where is Clark

where do you want that pin to be do you

want to be at the center of the park do

you want to be at the gate or do you

want to be the part that was ultimately they went

with the part that was closest to the hotel so

there was not was a little bit less but I

mean but I mean like you know how you define

where something is precisely.

That’s that’s a you know when that ultimately I think

it should come down to what’s the problem you’re trying

to solve which in this case was you know how

do you get there and sort things so it’s an

interesting problem that data modeling and as well as geospatial

stuff kind of

does

get into sure yeah it really is and you know

when you start looking at that people could say well

your fudge Ng you know the answer because maybe there’s

only 1 entrance to the park and you want to

show who’s closest and you happen to be closest to.

You know the complete opposite end of the park and

yet if your hotel you may want to advertise Hayward

closer to the park than anyone else even though you

can’t get in that way.

You know it’s not in

this case I am working with them they wanted to

make sure that that it was the entrance closest to

them was like what they

were going for. Yeah totally fair

remember. So go ahead drink now go ahead.

Oh real estate people play this game all the time

like you know they say their their minutes in New

York city.

OK 90 minutes 10 minutes.

Yeah you know I once looked at a property back

when I lived up there they said that they were

10 minutes of New York City and it was in

Bayonne and I’m like that’s not possible turns out remember

Staten Island is technically part in New York city so

you were 10 minutes in Staten Island no knock on

Staten Island I was born there so you know respect

to shaolin Wu Tang would say.

Uh if you have no idea what I’m talking about

that’s fine I’ll explain it another day.

But uhm no I mean it wasn’t true yes was

it accurate now yeah but wasn’t meaningful now ’cause most

people wouldn’t say New York City they you assume Manhattan

and right. Now at that was closer to 60 minutes

or 65 minutes which again is minutes away.

It’s just yeah. Now a personal pet peeve

sorry I don’t know it’s it’s accurate and it’s like

you said yeah I play this game all the time

with data the difference between accuracy and truth an and

really in the middle there’s this whole other thing where

it’s the art of communication right an the fact that

the book how to lie with statistics was written in

the Forties. Is a clue not new yeah people have

been playing these games with statistics anan our language and

the exactitude of our language a lot of people throw

off on King James English but it was very accurate

so way more precise than what we use now and

you know there’s all of this science that goes into

literally and into tricking someone into believing what you want

them to believe while presenting non truth using accurate terms?

And one just one of them went Rick is anchoring.

And if you haven’t read about anchoring it is a

you know it’s you can just suggest any number it

has nothing to do with anything else but if that

number is. Is a high number and then you ask

someone to estimate some other thing that it’s been shown

it’s been proven over and over again they will guess

higher? On that estimate and if you then anchor them

with a very low number and do the same trick

you know different group of people may be different day

they’ll guess consistently lower the estimation is is anchored to

the number they previously heard.

And it’s just. It’s a scientific fact about the way

the human mind works,

so. You know you can.

You can see this, you know we see this in

data is probably.

If these days more more so maybe than than before,

is we’re looking at statistics dealing with the pandemic an

A friend.

Andrea Benedetti, who is as a Microsoft employee power BI

specialist,

pretty sure he was an MVP before he joined Microsoft.

He lives in Italy and he’s been posting at AKA

Dot Ms Slash Kovid report and he’s been updating this

almost daily with numbers from around the world about that

he he posted about a week or so back that.

When he stops and thinks about the numbers and realises

these our lives,

especially the lives that are lossed It,

yeah, it’s almost what he was saying was like you

almost can’t do this kind of work and keep that

in your mind at the same time you as.

I don’t know the right word for this,

but in my opinion it’s a Noble thing to do.

It’s like. When I hear about jaded doctors and nurses,

my first thought these days is what did it take

to get you to Jay?

Did you know? It’s like if your options are functioning

jaded or not functioning in that role?

Especially in these days Frank?

I’ll take jaded. It’s you don’t understand,

you know it’s really easy to set back and arm

arm chair quarterback and say now you know you should

not disengage your emotions.

You should just do this job and my ex wife

my ex-wife said neonatal intensive care unit nurse.

And you know, I’ve had some insight into what this

can do when someone loses a patient,

especially in infant. It’s crazy hard and to ask people

to just now.

I want you to keep feeling everything that’s like.

I don’t think you understand really what you’re asking folks

to do.

And Andrea mentioned something like that.

You know these numbers are lives.

These numbers are people with

two things, one on the numbers aspect.

We’ve addressed this before. And the story is you can

go back and listen to it.

We’ve been recording almost three years so crazy that we

have that much content,

but you have done nothing about collecting data about the

hurricane and all that.

And you know, I kind of set offline.

These are people’s lives, and yeah,

kind of story behind this was when I am a

survivor of the attack on the World Trade Center on

9/11.

Don’t want to go down that rabbit hole right now.

Right, but I was sharing my pictures that I took

with the with researchers and kind of people that were

doing the fire research over at Nest.

No. There was this moment where I didn’t respond to

the guy in awhile and.

You know he was he he I basically said sorry,

I’ve been to like maybe about 7:00 or 8 funerals

this week because you know what happened and right.

And then he kind of wrote back because,

you know, it’s really hard.

I forget sometimes that when I look at the pictures

of the burning building,

I’m looking at peoples, basically death scene.

And you know he was kind of like the funeral

thing,

kind of. Carlo brought him back to that like these

are people’s lives like and you’re right.

I mean, one of the things I did when I

was in college.

I got halfway through college.

I originally went to college to be an engineer chemical

engineer,

then switch to computer science.

We talked about that in a previous episode of the

humorous aspect of that,

but midway through I thought,

well, when I was, maybe I want to go to

pre Med and I became an EMT and stuff like

that and it’s a hard job because you really have

to. Kind of mentally insulate yourself from what’s happening around

you.

Absolutely the hardest part of that job was really emotional

toll because you can see somebody injured and kind of

think about it.

Almost like a car, like not a living thing like

you have to repair this.

You have to Patch that up.

That kind of helps you do it,

but then like what kind of sticks with you is,

why would one human being being do that to another

person,

and that psychological aspect? So I’ve always had respect for

the medical community because they can kind of do that,

but I think if you know that separation of kind

of what you’re seeing from what it means,

I think is a core survival instinct that I think

all people have to a certain degree.

But the medical people have really perfected it.

Well

and you know the whole concept there.

I think we’re dancing around empathy and being able to

not turn it off,

but maybe turn it down.

So that you could continue to function logically and walk

through the steps,

especially in the case of an EMT and Frank.

I say this being the type of person who cannot

do that.

It is not possible for me to do that consciously,

but I had this one experience went through.

Of course I worked in manufacturing back in the 90s.

I was a plant electrician.

We had to go through first aid just in case

and late one night I’m on 3rd shift and we

had a medical emergency occur now.

Keep in mind that at the end of the medical

training,

the nurse registered nurse had delivered this training to us

first aid.

In such she says to Maine Andy,

if somebody gets hurt, you take a flashlight and go

outside and wave down the ambulance.

Make sure they know how to get into the plant

parking lot because you’re not going to be very helpful

and it they weren’t wrong.

You know just listening to my responses to this,

but here’s this did happen.

Somebody got hurt and they were potentially in a life

threatening situation and.

I guess it was my old army basic training.

First aid kicked in or something,

but as soon as I realize the first thing I

just saw the person I was the first one to

see him come out of the place where they had

been hurt. I won’t explain in detail,

but I could tell they were hurt.

It was very easy to glance at this person to

realize they were hurt bad.

The. I remember thinking Frank,

I’m the only one here who can help.

Right right they rest of the people were working on

the other side of the building.

It was a large facility.

And when once I had that thought,

it was like the entropy entropy.

It was, it was like the you know the emotional

connections.

They all, then empathy. They just turned off and I

just went into gear.

And realizing that I was able to help him,

I would say I saved it live.

I don’t think so, but I definitely helped you know,

by getting to the first calling 911 and doing first

aid.

So it you know I I’ve had that experience,

so I can kind of relate to just let you

know a thousandth of a percent of what it must

be like.

Being in Madison on a normal day.

And these are not normal days.

Now it’s true. It’s true,

it’s uh. Now, God bless people who can do that

and who are doing agreed,

agreed. So, uhm, so one last question on the data

warehouse deep dive.

I think we’ve kind of danced around about the normalization,

but star schema versus Snowflake Schema.

Yeah, so I consider snowflakes to be actually a component

of star.

It’s a way to represent hierarchies in a star schema.

You can have a Snowflake Demention.

A great example of that is found in the adventure

works database,

where they have products that are have sub categories and

sub categories are related to categories.

When you load that date,

of course you want to start at the top of

the hierarchy,

load categories first because there could be a brand new

category that has a brand new sub category that has

a brand new product.

If you try to load up from the bottom which

you want to do is you want to kind of

grab these keys that are artificial,

we call them sorted keys.

You want to grab that from the next level up

on the hierarchy,

so you have to load it from the top down

to get that in an.

And adding confusion to this these days is there is

a data database engine called snowflake

and it’s going to say totally

online engin and I am not an expert in it

but I do know just from working closely with someone

who is an expert in it that it’s really fast

for loading large amounts of data as long as you’re

not trying to maintain like updates to that data it

gets painfully slow when you try to update data in

that platform again. There are advantages and disadvantages and the

truth is Frank there’s some data that you never update

like you never hear the weather person say yeah I

know we said yesterday that I was 55 Fahrenheit but

it was really 56 that that never happens.

That data is set and it’s recorded that way whether

it’s right or wrong it’s recorded that way for that

time financial currency exchange rates same sort of thing they

pick a daytime some agency runs the algorithms and bam

you don’t ever hear them changing that that’s not updated

so if you’re loading that kind of forward only type

data. Any kind of platform that’s good at inserts is

going to be really good at managing that type of

data so and there’s other ways to?

Attack updates as well. You can do these transactional things

where you have a from date.

Anna two date appended to the data and you can

convert.

Then you’re you know maybe data that is being updated

the way you update it is you put and it’s

from this state.

Time to this other day time.

Maybe it is updated from the last time,

but you’re inserting a neuro every time that there’s an

update and you see this in accounting generally accepted accounting

practices.

Double entry. You will entry ledgers there done that way

all the time,

every time. Even if there’s a mistake.

Even somebody goes to the bank and takes out,

you know $200 an. You know they only get 100.

Don’t realize it till later and they have to adjust

that what you’ll see you won’t see them erase the

200 and put 100.

They will add another entry that says there’s an extra

$100 in that account.

That’s the way that accounting works on paper.

And again it’s an update.

But it’s a transactional kind of list or log if

you will.

That says no, we just add a new transaction to

fix the mistake.

So it’s kind of like H base.

That way where one state is in the Ledger,

it’s immutable. Absolutely

yeah, that’s a great way.

Much better way to say that then I did.

Jesuit education, I guess. I got I got a bag

full employee

dollar words that’s funny. But a lot of ways to

attack it.

I’m sorry, go ahead. Go ahead.

I was just going to say there’s a lot of

ways to solve some of the same problems that we’re

doing,

and we can end up playing to the strengths of

the software that we’re using all the platform that we’re

using.

We always want to do that because speed is King,

especially these days or Queen.

It’s true. That’s true. So what are your thoughts?

Is a light as a longtime kind of data warehouse

data engineer?

What are your thoughts on synapse Azure synapses?

This meant to be kind of the next generation of

SQL data warehouse.

I have seen presentations where I’ve kind of seen the

bigger picture and it’s fascinating,

but what are your thoughts about that?

So I’ve not yet used Azure Synapse and.

Can’t even get access to it and I like to

think I’m one of the cool kids.

I’ve to my to my discredit.

At this point. I haven’t yet attempted to gain access

to it,

so I’m looking at the looking at it purely based

on marketing information and we know marketing information is usually

accurate.

But we had that discussion earlier

about accuracy versus true. There’s

a caveat hanging off that so having not.

They’re sort of hanging on the wall and throwing the

switch.

I don’t know enough to really give you experience based

feedback on that,

but what I can tell you from just kind of

a general architectural overview,

I have seen enough information not only from marketing people

buy something from some practitioners,

and the way they are approaching solving the problem they’re

trying to solve it very much appears to be a

next generation.

Type thing and Frank. I’ve written a couple of books

with design patterns in the in the title,

but one two editions of the same book.

SSI S design patterns and I think about that a

lot when I’m when I’m speaking architecturally,

Ann. I use this analogy to describe just the concept

of design patterns.

You can think about it as like Legos,

right? Everybody, maybe everybody has played with Legos.

I did, and you build that you’re going to build

a Lego wall.

So you put a few legos.

You know, bottom up against each other,

and that’s your base. And then you build your next

layer on top of them an you know another layer

on top of those.

And before you done, you know you’ve got 3,

maybe 8 Legos High. You’ve got this fall.

I picture design patterns that way,

and elegant patterns lend themselves to that sign.

That kind of of. His self similarity just at different

scales and you know,

the more you know, the more you’re able to do

that.

Once you’re able to take these basic functions like storing

data,

reporting data. You know data acquisition.

You know all of these pieces once you put these

together?

That’s kind of like your base bottom part.

But then if you can add a little bit of

automation to each of those,

now you’re building on that second layer,

an maybe on that. Thirdly,

and I think that’s where synapses.

I think it’s. I think it’s at least #2,

maybe third layer where they’re starting to group these together

under a common interface and kind of mask some of

the complexity that happens beneath that.

And I’m totally all for that.

I’m an automation freak, you know this.

And if that’s what it looks like to me,

it looks like they’re trying to simplify this stuff that

could be challenging.

Does that mean challenges won’t exist in it?

Gonna snow? They will, and you’ll still need expert help

in these other areas here.

And you know, in specific areas,

but you may not need it.

You may not need as much as you did before,

So what? what I like about this division of Labor

is you replying expertise only where you need it.

And In other words, another way to look at it

is.

You’re not hiring an expert to come in and do

this very basic stuff for you.

You’ve automated that. So you don’t need to pay an

expert expert wages to do that.

Pretty, you know what we consider kind of menial for

the expert.

If you automate those types of things away and there’s

a slew of that available in tons of examples,

business intelligence, markup language is one for generating SSI S

packages.

Staging is a pretty simple pattern.

Read the data and write it into this heap as

I talked about earlier,

that’s a pattern that doesn’t.

You know you don’t want to pay somebody a Virginian

Dollars an hour to do that.

You get a junior person to do that type of

work.

Even better, you pay someone who has expertise in Bemol

Business Intelligence Markup language and they just build it all

for you through automation.

and I got some wild stories about that.

My biggest anecdote is idea.

10 1/2 months of work in 3 1/2 days.

So it’s that kind of game changing stuff,

but again, as long as you can get,

you know, as long as you create an interface that

masked the drudgery of repetitive work away,

then you only need experts for the places where you

need expertise and that can change the game.

It changes economically and it changes it from a technological

standpoint as well so.

That’s what appeals to me about.

The tidbits I’ve gleaned from listening to people talk about

synapse and looking at a couple of,

you know, a couple of demos.

I’m as an MVP. I get very honored to be

able to do these product group interactions with Microsoft and

attended A at least one of those.

I want to say maybe 2 and we get to

see a little behind the scenes.

Not much. It’s not as much as you think,

but you know we’re talking to the team.

Not and nothing against the marketing people.

They’re doing their thing and God bless him.

We need him. But I’m talking to the engineers.

You know,

I’m watching the music, making those decisions about what’s possible.

What’s practical, exactly. You definitely get better or clearer picture

of what’s going on.

See yeah, I

would say different. It’s nothing again.

Marketing is doing its thing and I’m not a marketing

person.

There we go. Yeah, I agree with you and it

is definitely clearer and it makes more sense to me

’cause I’m an engineer so I get it.

But that’s options would say the main nerve.

There we go. And so you know,

in a nutshell, I would say Azure Synapse based on

again my limited exposure to it appears to be one

of these next level jumps and I refer to these

often is tectonic. They’re laying down a whole new layer

here,

you know, and they’re building on like on my my

Lego example.

They’re building the next layer of the wall,

if that’s what it looks like those are.

You know they used to be every decade or so.

We have that sort of stuff,

but it’s accelerating, so now I’m seeing a couple of

three times a decade where we get this new tectonic

shift in what we can do,

and automation that’s built on top of the previous layer

of automation.

That’s how I see it.

Don’t know if that helps or hurts.

Not definitely helps. Uh, so with that were at the

hour?

Uh, just want to point out.

I also have a meeting.

I gotta go onto. OK,

yeah? We had a great session here,

hopefully this answer some of your data warehousing questions and

if you want we could do another deep dive.

And let us know what you think of the show.

If you’re interested in iBooks or not iBooks audiobooks,

audible is a sponsor, so go check it out.

I should probably see if there’s any good data warehousing

books on audible.

Yeah, I don’t know a lot of them have to

be written,

I think because

the example I don’t know it would be interesting to

see if there are all.

I’ll do a search if.

You are not already inaudible subscriber and you would like

to get one free audio book on our dime,

or actually it’s audibles time and you go to the

data driven book com routes you to inaudible page if

you get a subscription.

After that they toss us a dime or two and

help support the show.

Uh, particularly given that my monetization strategy has been primarily

driven through.

Amazon merch, which has been shut down?

Yeah, every little bit helps.

It’ll be back. You spring is still shipping.

Sorry true, very true that design that.

I got in trouble with.

I actually have one printed version on the way so.

It’ll be fun anyway. I’m not going to post it

to Instagram,

I’ll just put it that way,

but with that in mind,

um, have a great day and stay safe out there.

Any any other parting thoughts?

Now that’s perfect. Awesome and will let the nice British

Lady and the show.

Thanks for listening to data driven.

Don’t just listen. Become a data driver by going to

datadriven.tv to sign up to join the community.

Access to special events, tips and tricks and more.

Sign up today

at datadriven.tv.

About the author, Frank

Frank La Vigne is a software engineer and UX geek who saw the light about Data Science at an internal Microsoft Data Science Summit in 2016. Now, he wants to share his passion for the Data Arts with the world.

He blogs regularly at FranksWorld.com and has a YouTube channel called Frank's World TV. (www.FranksWorld.TV). Frank has extensive experience in web and application development. He is also an expert in mobile and tablet engineering. You can find him on Twitter at @tableteer.