Data Warehousing Deep Dive
In this Deep Dive, Frank and Andy delve into the world of Data Warehousing, what is it and do they know things? Let’s find out!
Frank also shares that he has a new role at Microsoft.
AI Generated Transcription
Hello and welcome to data driven,
the podcast where we explore the emerging field of data
science.
We bring the best minds in data,
software, engineering, machine learning and artificial intelligence.
Now hear your hosts Frank Lavigna and Andy Leonard.
Hello and welcome back to data driven.
The podcast where we explore the emerging fields of data
science machine learning an artificial intelligence.
If you like to think of data as the new
oil then you could consider us like Car Talk.
However, we can’t go on a road trip because of
the Corona virus lock down.
So it’s just Andy and I kind of stuck at
home respectively.
And thanks to the Magic of Technology we can be
on the show at the same time.
And, uh, how’s it going?
Andy? It’s going
well, Frank, how are you doing?
Good, good, uh, you’ll
probably hear my kids in the background.
We
will, and you know what Frank,
I think it’s fine. You know I’m going to.
I understand why you said the word stuck with you
and I work remotely an awful lot.
We usually record like this.
There’s there’s less in the background.
It’s your place most of the time,
but you have couple of young boys there and you
need to be in the room with them when mom
who’s also working from home is you know is doing
some of her work so kudos to you to both
of you for finding a way to manage this.
Everybody’s going through these sorts of things and I’m sure
that none of our listeners will mine here in your
sons play in
the background or hopefully won’t start fighting so that’s Well,
I asked, I asked if they do I think a
lot of folks
can relate though. Yeah, oh absolutely,
absolutely
so. We’re recording us on April 16th.
We Speaking of kids, we had your son on which
if the order of recording goes the way I planted
in my head.
That would have been released last week.
And Uh, which I thought was a pretty good,
uh, discussion on. How stem is taught?
How stemmers perceived by quota quote policymakers?
And how the actuality of it is?
And some of the interesting stuff your son is doing
with Raspberry Pi and stuff like that.
Yeah,
I was a I was first I was very proud
of him.
You know the work that he’s doing and he’s he’s
had his his hands in machine learning for really a
couple of three years.
Now I want to say he was 14 and I
came into his room.
You know just checking on say something or something I
saw.
A Mario Brothers playing in the background.
Like what do you think you know he was?
He he had done his school work?
He was home schooled at the time he done his
school work.
So you know what he wants.
But um, later talking to him about it,
he said he actually came and got me and he
said,
OK, dad, it took, you know with I think it
was like 6.
You know neural nodes. Here he was able to,
Mario was able to figure this out and something like
4 hours or something you know later he said I
wonder what it would be if I added a note.
I wonder what that would do to it and I’m
kind of sitting there with my mouth hanging open.
Going show dad more about that nice,
but he’s been doing it for awhile.
I know your kids are interested in the same thing.
They’re younger Stevie 17 now and you know.
and I know that your sons are coming up in
this.
In this age as well,
they are mentioned Mark Tapatio in that show as he
referred to digital natives.
They are digital natives and yeah,
that comes with some pretty interesting stuff.
So I’m just glad we were able to record that
show as he gets ready for his first sequel Saturday
presentation here on that topic.
So and that’s all assuming that we were able to
overcome the technical glitch.
We we learned something, Frank,
I’d learn something. Yeah,
it’s not a glitch. If you learn something.
So if if for some reason.
The you know what hit the fan then that episode
will be recorded at a future date,
so we’ll see it will,
but we’ve got. We’ve got
a great topic today. You and I’ve been bad this
around I want.
I know it’s been several weeks.
It may have been a couple of months.
We’ve been talking about doing this.
Right absolutely, and part of what motivates this?
An based on the release schedule that I anticipate this
will have already happened.
I’m changing jobs at Microsoft Woo.
At your new job. I will be the data and
the AI technology architect at the Reston MTC or Microsoft
Technology Center,
so congratulations. Thank you very much.
It’s an honor to join such a prestigious team.
If you’re not familiar with what the MTC is.
MTC is a Microsoft Technology Center.
There is about 80 of them around the world,
and they basically are meant to provide specific experiences.
Ends well as architecture design guidance for customers around the
world and it’s an honor to be kind of in
that team.
It’s very rarely does an opening happen in an empty,
so when one opened up in my neck of the
Woods is like I have to take it.
I have to at least try.
Right right? So Fortunately I am super excited.
And Uhm, ’cause That’s what we say at Microsoft were
super excited and it’s a great team.
Great stuff that they do.
They do a lot of work with the community.
They do a lot of work with customers.
It’s just an awesome gig.
I’m really looking forward to it and.
Yeah, I’m really excited about
it. Congratulations brother. That’s a great thing and I think
you’re perfect for that job.
I know, I know, someone else in that job at
an MTC in the northeast.
And it’s it’s kind of a rare breed of person
that has to walk into that role because.
It optimally you have a smattering of exposure to all
whole slew of enterprise architecture,
an both both you and this other individual that I
know fit that mold.
You’ve got programming experience, software development experience,
and you also have data experience,
and it’s just rare to be good at both of
those things I know,
but I know you’re good at it,
and I know my other friend is good at this
as well,
so I just I just think it’s going to be
a great fit for you,
Frank. I’m I’m excited, you got
it. Thank you. Thank you very much.
So with that, one of the things that I’ve been
ramping up on in anticipation for this job or whatever
opportunity I was going to go to next.
I was learning more about the quote Unquote traditional side
of the data world,
which let me move kind of explain my little worldview,
which is twisted and as weird as it may be,
it might actually be right.
I see this alot in my current current or old
roll current as of April 16th.
Role is that we have data in the I cloud
solution architects,
but there’s a very clear line of demarcation between the
data scientist.
Part of the data in the icy essays and the
sequel veterans side of things.
So I actually had a call this morning where it
was.
It was very, very much laid bare ’cause we were
talking about that and that there’s essentially kind of two
types of data in AI folks at Microsoft for sure,
probably everywhere else, to you have the RDBMS folks.
These folks have been doing sequel since it was aside
based joint venture,
right, right? That’s their world.
Ann, you have kind of the big data open source
kind of tooling world,
right? The folks that are more comfortable in spark or
Hadoop or with the crazy statistics and math around machine
learning and AI,
right? You kind of have those two.
Rarely do
the two. Rarely do you
have a person who’s. Comfort,
Rible and happy in both.
I am aiming to be happy and comfortable in both.
Obviously I’m more in the data science kind of world.
And part of my part of what I see is
the opportunity in this new role is to grow into
the kind of the sequel.
RDBMS traditional database world. That makes
sense. They are no. It makes perfect sense.
and I mean coming at coming at this from,
you know, we as we shared in each show the
past few days that we’ve recorded.
We’ve known each other for like 15 years.
And most of that time you were a professional software
developer.
You are a Microsoft MVP in.
I forget which discipline it was.
Frank, I know it was software development related.
I
think
the world has forgotten that this discipline never existed.
tablet
PC. Tablet PC right? OK and you did an awful
lot in there and I know there’s a lot of
people out there working in what that evolved into mobile.
That benefited from the blog post you shared,
solutions. You shared an all of that,
but yeah, that whole mobile thing turned out not to
be such a,
you know, it was a trend it and it evolved.
To what it is now,
and having that experience, I think you’re going to find
that that plays well into kind of backfilling like you
said,
or filling this other bucket that you want to go
after,
which is traditional T SQL an.
I know, I know, from experience and dabbling in machine
learning and AI.
I’m on the opposite side of the fence,
although I’m not really that good at,
you know. Let’s say like DBA level T SQL,
but I you know I can hold my own in
there,
but if we are. If we’re selling tuning performance tuning
to a client.
I may be involved in the project,
but rarely am I the person actually performing the tuning.
There are lots of people out there that we subcontract
as a as a consulting firm,
enterprise data and Analytics. We bring others in who are
better at that much better at that than I am,
and we have people on the team who are much
better than I am as well,
but it’s. I think your experiences has his set you
up really well.
To make this transition and it will like everything else
right?
We talked about this in the other shows.
It takes time. And it’s frustrating,
but I think you’re well positioned to pick up this
skill as fast or faster than almost anyone else I
know just well,
thank you now you know part of it,
I’m not. I’m not completely like naive to the ways
of RDBMS.
I took sequel in college,
database design and college and my professor worked with card
in date.
So you know, like. You know,
I’m only two degrees of Kevin Bacon away.
From the founders of the theory,
so you know that’s going for me,
but I never really got into just kind of the
nuts and bolts of it,
and I’m not. I’m not concerned about that.
I’m actually fascinated about it,
because it’s just another way to solve the same problem.
Absolutely. Ultimately, at the end of the day,
you’re moving bits around, and it’s a question.
What’s your philosophy? Or obviously,
RDBMS has a philosophy and it you know I’m not
knocking.
I mean, it worked well for 5060 years.
But now we live in a world where there’s a
lot more unstructured data.
And how do you deal with that?
And how do you deal with it now that you’re
not making assumptions about spinning
disks, right? Right there’s a whole.
Kevin hazard.
’cause we haven’t hazard.
Yeah who talked about that on our show that yeah
there still leaves it’s 2020 and I would say still
most of our code is designed for that age of
the heads picking up seeking a sector an reading data
and then picking up again.
So there’s there’s a whole new opportunity where obviously relational
databases are going to still matter,
but it’s just one of many tool sets.
In fact, one of the things that I learned when
I was doing start up with angelism for Microsoft was.
You know, having debates with startup founders who UR?
I will say I put them in a hipster category,
right? I worked with when you work with startups runs
the gamut between really like I mean like that this
person is going to be the next Steve Jobs to
this person is kind of like I think they’re living
in their parents basement,
but rather than seeing unemployed they haven’t so somewhere in
the middle you kind of what I have.
The hipster ones where they learned code because of make
the startup.
Now that’s not nothing wrong with that,
but do you think that you’re an expert in all
things technology because you learn to code?
Right, you know, and then you go to a person
that is supposed to help you take your stuff to
the next level and kind of talk down to them.
So right context this conversation.
So they were basically lamenting the fact that they wanted
to.
They wanted to have the reliability of.
Up an RDBMS, but they wanted to do it in
a note SQL type of environment.
An I was like that’s
a fair. You know that’s a fair thing to want.
I’m just all cards on the table.
Approaching that, architecturally, that’s that’s not an unreasonable request.
But unless and until you get into the engineering part
of it.
And that’s where you start to see that you just
can’t have everything that you want.
I mean, there’s no single do it all type application,
everything, every software application ever.
And I’m going to maintain,
probably forever. They’re going to be applications.
There’s going to be some spot that I define as
a corner.
It’s something that the application or server or what have
you doesn’t do well.
And what you’ll often find is there’s some other application
out there that’s available,
or some other platform, and it will do that part
well.
But again, that also has its corners,
and So what you’re trading is pain.
The nicest way possible. You’re picking your picking your poison,
picking your pain. What is it that you want to
fight?
And it depends on. You know.
Relational databases have their pain points.
No sequel. It turns out a lot of companies have
learned this over the past few years.
Also has its pain points as well so.
You can’t always get what you want,
but if you try sometimes you might get get what
you need.
Awesome. So, so I
mean part of it is,
you know, sometimes whether it’s technology,
anything else, you have, kind of these dueling philosophes an
there is a point where they just won’t meet just
because of.
They’re they’re kind of philosophically opposed an you’re right,
you have to kind of pick which one you want
to have over the other.
And there’s cause and effect to that.
So with that kind of deep philosophical you were data
philosopher,
so that’s good. So I wanted to talk to you
about.
We want to do a deep dive.
It’s not officially a deep dive until I have fun
with my soundboard there.
Into data warehousing, what is data warehousing?
Where did it start? I’ll channel A little bit of
bojack horseman.
What is data warehousing? What do they know?
Do they know things? Let’s find out.
Well, yeah, data
warehousing in my opinion in my experience is really this
idea of of collecting data from all over different places
and placing it into a centralized location.
Now there’s some distinctions and there’s other scientific answers to
that question,
and you can actually build something that today is not
considered a technically a data warehouse.
You can gather all of the information that spread across
the enterprise in different places.
Into what’s now called an operational data store.
Ann, it’s not totally unlike a data warehouse.
In fact, I think the Euler diagrams have quite a
bit of overlap for that,
at least if we if we kind of improve or
add to the word data warehouse or the term data
warehouse with relational data warehousing,
there’s a lot of overlap between relational data warehousing and
operational data store.
Wanna confuse that really with our listeners?
But I just want to make you aware if you
hear oh DS or DW or EW.
It could be that they’re talking about largely the same
thing.
And when you think about like you think about supply
chain management,
which is a topic on everyone’s mind these days as
we’re talking about the economic impact of the pandemic.
Supply chains are where really where really way more important
than we realize and it’s kind of like oxygen or
money.
You don’t recognize how important it is until you don’t
have enough.
An supply chains are like this and you could think
of a data warehouse.
In that terminology. The analogy holds for quite a bit,
and I’m going. I’m just going to use Walmart and
Amazon as you know,
is kind of examples of this.
They both have these distribution centers and they have these
network set up all over the United States,
probably all over the world and its places where the
goods come from the source and they’re trucked into.
You know, they may be collected at other points along
the way.
But they’re trucked into these large,
physically large warehouses and then stocked.
And then from there there actually shipped out to in
the case of Amazon.
Usually there handed off to some delivery service.
In the case of Walmart,
they’re placed on other Walmart trucks that are shipped to
the stores.
The actual brick and mortar stores and that warehouse in
the middle.
That distribution center. That’s what I think of when I
think of data warehouses.
I think of the the electronic equivalent of that because
you know,
there’s all of these. You’ll see especially at what I
consider an EDW enterprise data warehouse.
You’ve got a collection of companies that have been acquired
in mergers and acquisitions,
and they’re looking at. I want to get all of
their data.
But they have and want to bring that into this
one location,
and that I want it there for a number of
reasons.
But one of the big reasons is so I can
query that data and I can learn how my entire
enterprises performing.
How’s it working? And. And in that,
and now if I apply that Walmart Amazon analogy to
that to the data there,
they reports that come out of querying a data warehouse.
Those are like the end user customers for Amazon,
say and like the Walmart stores in that other analogy.
So it’s and it really turns out Frank.
It’s always good to to answer the question what’s the
problem we’re trying to solve?
And here’s the problem. Most people are trying to solve
with an enterprise data warehouse.
They want to know how’s my business doing today.
And if they can answer that question,
that’s going to be answered by a report of some
kind or an analytics dashboard.
Still a report. In my mind,
it’s in the best analogy is something we’re used to.
It’s a stoplight. All we green everything is good or
good enough.
Are we yellow? Some things are in trouble,
and most enterprise data warehouses during the pandemic are going
to be showing yellow or some red.
We’re in trouble. It’s going to be very few showing
green and but that’s the kind of thing that you
want.
You want to know at the very highest level are
we good?
Are we bad or are we somewhere in the middle?
And of course, you want to always be able to
drill in.
Especially things are in the warning state or in the
bad state.
You definitely want to drill in.
See what the problem is.
Pick up more information about that,
but at the very tippy top.
You want to be able to answer that question.
How are we doing
today? So I have two questions.
Uhm one is you mentioned that with mergers and acquisitions
you want to have one place for all the data
to the live?
Sure how is how is it data warehouse different than
a data Lake?
That’s
a That’s a really good question,
so you can think about data lakes being a similar
collection point.
So from a purely functional standpoint,
my experience with data lakes I’ll share is limited.
So that you know to take this with a grain
of salt,
but I consider data lakes to be more of just
a a collection of the data.
A copy of the data in its raw or raw
ish state,
and when I draw a line between what I call
a data warehouse and what I call an operational data
store there more on that side of the operational data
store. There more of a copy of the data from
wherever it’s come from.
Now I know for a fact working with people who
work with data lakes that they see that a little
bit differently.
An I’m not going to speak for them.
But I just know that my definition is there definition
differs a little bit because you can achieve an awful
lot of what an enterprise data where warehouse does for
you in delivering those results that answer the question.
How are we doing today?
You can achieve a lot of that just going by
querying directly querying the data in a data Lake and
the same can be said for an 0 DS so
don’t. Don’t don’t you know,
trying to mix and match here and draw some distinctions.
I wish I could share with you the picture in
my mind.
But maybe like to show notes,
so do our best. So what is that?
I’m one
of the things is that someone on my side where
I kind of see.
I see things for more.
The data Lake POV just want to place the for
stuff to land weather that’s coming,
streaming data, iot or from various types of data stores.
You just want to place to put this stuff absolutely
so one of the things I’ve seen in architectures is
that at least kind of in more modern architectures,
because historically I think data warehouses have been used like
data lakes,
but one of the things that at least data warehouses.
Half they still assume a schema,
right? Like you still have to have a schema right
for it to be in a data warehouse so,
but so if you have completely unstructured data.
Uh, you still dealing with the primacy of the schema
rise at that.
Failing, it
is an you know what I was going to share
too,
is just thinking through this a little bit more.
You know, one of the distinctions that I’ll make from
certainly from a relational data warehouse.
Between that in an operational data store is,
you know, there’s a little bit of hybrid here,
and let’s just kind of.
I’m going to take Cody S out of the picture
for just a minute.
I’m going to talk about staging from an extract transform
and load or ETL.
Or data engineering data integration,
whatever you want to call it,
our perspective. Whatever the format that date is in,
when it starts its track into into whatever we’re going
to end up with with an EDW.
When it begins that the very first thing I like
to do,
and that is an area where I specialize is going
to get a copy of that data into some central
location.
Now it could be that that central location is is
a data Lake Ann,
often in modern data warehouses they are.
It goes into a data Lake,
but it could also be a relational database,
or it could be a just a centralized collection of
files.
And there’s a lot of things that you can define
here,
but there’s this concept of it at the very first
step,
collecting and staging. It’s all here now.
And I still do that when I’m designing ETL solutions
for customers.
I get it to that first step that first place
first and I want to get it all there and
I mostly deal still with relational data warehouses that go
to tabular models or cubes.
You know, some sort of analytics solution and you know
when I do that,
Frank if I’m, let’s say I want to load data
from files into a SQL Server.
I load that data, it’s just as as constraint free
as possible,
so if it exists in the file I want to
read whatever is in that field and bring it into
a table.
What we would call a heap in.
In Relational Database management that if there’s any constraint on
it,
it’s some sort of identity column.
That’s just, you know, counting the rows.
Basically enumerating the Rosen that’s there just to make sure
nothing.
No row is identical to any other Rd.
And that would be the only constraint I would place
on that.
And I would do the same for a data Lake.
You know we were pumping data into a data Lake.
I want to get all of the data from all
of the places into this stage first.
So that’s that’s kind of the very first step that
I do.
After that, we then began applying what you just said.
We format it into a schema,
we make it make sense the very first thing I
would do if I was staging it into usaia SQL
Server table a heap.
The very next step would be read the heap and
take these these text columns that contain dates an numeric
values and try to fit them into date an numeric
fields. Right, I would make that attempt an if something
happens and it won’t fit.
Maybe one of the date columns has February 30th in
it.
You know, maybe somebody fat Fingered something and put the
13th month in there.
Anything could happen. I’m going to redirect that row and
try to store that data,
but I won’t want to human to look at it
and that could happen in a data Lake.
It could happen in loading heaps tables from anything else,
but the idea is I want to pick up whatever’s
out there,
get it into this engin that I can then apply
these rules to like in my very first rule is
as strong typing.
Want to make sure that I get all of the.
Values that are valid out of there and I also
want to mark the invalid values for later.
So and eventually we flow down this to the end
and we get to a place where you know we’ve
got a collection of all of the values we’ve ever
loaded, and we then fit our new data that are
new load.
We fit that new data into that prospect,
but not before going through that staging process.
Then some cleansing, and I would first step in cleansing
for Maine is strong typing using.
There’s a second step that minimally minimal of another step
that.
Apply soft rules so if I’m loading claims data for
an insurance company and I’ve gotta clean initiated date.
I may have a claim closing date or settle date.
I want to check a software would be the claim
settled,
8 is greater than the claim initialize date.
You know stuff like that.
That makes sense. So kind of like common sense rules.
So so I guess you said you want to remove
the constraints.
So that probably plays into the fact that you hear
the term denormalization being used a lot is that.
Is that what they’re
referring to somewhat? Usually I applied the normalization when we
get to the end of a relation of funnel here,
where I’m loading ETL stuff and what I’m after there
is is flattening out the data so that it’s ready
for the consumption.
In these reports that tell me how’s my enterprise doing
today.
And you get various various and sundry opinions about D
normalizing data.
If you think about third normal form something you mentioned
earlier,
the premise behind cards rules of.
Normal forms at normalization is that one of the principles
is you only want to have data live in one
place so this value you want in a single location
you want to copy it in 14 different places or
even even 2 you want to be able to keep
it in one location because if that data needs to
be changed or if we make a change to it
we want to go hunt it down.
Here and also also one
of the Canonical examples is an address right so I
think this is really to me anyway this underscores kind
of what’s why the 2 philosophies on that.
Are important and why they’re both compatible in a sense
so if you have an address if I if I’m
more like an online transaction?
Platform right OLTP. Or processing that’s what the TV stands
for not toilet paper although people supporting it.
But but but if you’re dealing with like a real
time transaction you want to know the customers current address
exactly right
but if you want people kind of.
Now go ahead
but if you want to know historically if you’re doing
reports on where customers have sent stuff or what their
dresses have been then you would want to buy definition
you kind of have to store those address in more
than one place that this order was placed On this
date and will send to that address as opposed to
this is where customer ex lives today they live in.
This address.
And it’s a great example an there’s yet another use
case some people own multiple homes and so they have
multiple current address is and this is where you start
getting into things kind of getting interesting I’ll say it
that way in data warehousing but that’s if you look
at the percentage of all of the people who have
an address in the world most of them by far
have a single address at a time and they go
through this but you do definitely have to account.
For those customers who may have something shipped to say
one location and kind of like going back to Amazon
this is happened to me I own you know I
have one address only one home well me in a
mortgage company but I have been on vacation and ordered
from Amazon and had it shipped to the address where
I was you know overnight.
There as well so it’s while it look this is
one of the things that looks simple that you know
it’s not so simple and a friend of mine and
coauthor tell Mitchell he’s ETL specialist San Tim actually blog
he wrote a really good blog post about you know
how fuzzy some of these concepts can be and I
believe the title was what is a day.
You think about if I called yeah well you know
at tenmitchell.net if you want to check that out and
search for it he brought up some really good use
cases you know what is a dangerous thing sure and
we think that simple no it’s not.
And part of what happens and I kind of skipped
right over this is when I’m collecting data from these
different locations.
But I’m doing ETL it could be somebody has a
warehouse in Liverpool england and they’ve got a bunch of
warehouse locations in Canada.
And the US in Mexico and maybe even in East
East Asia wherever they store dates in different formats.
Yeah so at I could have what looks like a
13th month right I could have 13 dash 03 dash
2020 you and I know because we work with us
so much that that’s referring to the 13th of March
but in the United States we would reverse the first
2 we would have 03 dash 13 dash 2020 and
one of the one of the goals of data warehousing
or even to DS is to collect that data and
then pick a format it doesn’t matter which just pick
one. And make that perform at your going to use
for dates the same goes for other measurements that you’re
using and and it can go beyond that the weights
in some locations are going if I set up a
something that includes how much something weighs in Germany I’m
probably going to get kilograms but if I do that
in wiscconsin I’m going to get pounds that’s against Wisconsin
nothing against Germany it’s just different.
So we have to convert so that when we do
comparisons on how much things weigh it doesn’t matter which
measure we picked by the way we did it just
needs to be consistent so that we can compare apples
to apples and we’re trying to tell which is heavier
something that weighs 2 pounds or something to weighs 2
kilograms going to get different answers.
No that’s a good point that’s a good point and
there’s all those sorts of things and and you right
when you when it comes to you kind of narrowing
down what do you mean by this and I think
one example that’s not necessarily data related was I was
on a mapping project it was basically for a hotel
chain in they wanted to map out local points of
interest right in San Francisco.
And this was all done in Silverlight Maps and all
that very cool 3 D work although doesn’t work anymore
but take your word for it was awesome.
But uh one of the questions that I had and
I worked with the work.
I work with the manager of the hotel is you
know they wanted to be able to put pins on.
You know various points of interest now for some things
that’s easy but when it came to like a park.
Well where’s the interest of the park where is the
park really like right I think was Golden Gate State
Park or whatever it was like well where is Clark
where do you want that pin to be do you
want to be at the center of the park do
you want to be at the gate or do you
want to be the part that was ultimately they went
with the part that was closest to the hotel so
there was not was a little bit less but I
mean but I mean like you know how you define
where something is precisely.
That’s that’s a you know when that ultimately I think
it should come down to what’s the problem you’re trying
to solve which in this case was you know how
do you get there and sort things so it’s an
interesting problem that data modeling and as well as geospatial
stuff kind of
does
get into sure yeah it really is and you know
when you start looking at that people could say well
your fudge Ng you know the answer because maybe there’s
only 1 entrance to the park and you want to
show who’s closest and you happen to be closest to.
You know the complete opposite end of the park and
yet if your hotel you may want to advertise Hayward
closer to the park than anyone else even though you
can’t get in that way.
You know it’s not in
this case I am working with them they wanted to
make sure that that it was the entrance closest to
them was like what they
were going for. Yeah totally fair
remember. So go ahead drink now go ahead.
Oh real estate people play this game all the time
like you know they say their their minutes in New
York city.
OK 90 minutes 10 minutes.
Yeah you know I once looked at a property back
when I lived up there they said that they were
10 minutes of New York City and it was in
Bayonne and I’m like that’s not possible turns out remember
Staten Island is technically part in New York city so
you were 10 minutes in Staten Island no knock on
Staten Island I was born there so you know respect
to shaolin Wu Tang would say.
Uh if you have no idea what I’m talking about
that’s fine I’ll explain it another day.
But uhm no I mean it wasn’t true yes was
it accurate now yeah but wasn’t meaningful now ’cause most
people wouldn’t say New York City they you assume Manhattan
and right. Now at that was closer to 60 minutes
or 65 minutes which again is minutes away.
It’s just yeah. Now a personal pet peeve
sorry I don’t know it’s it’s accurate and it’s like
you said yeah I play this game all the time
with data the difference between accuracy and truth an and
really in the middle there’s this whole other thing where
it’s the art of communication right an the fact that
the book how to lie with statistics was written in
the Forties. Is a clue not new yeah people have
been playing these games with statistics anan our language and
the exactitude of our language a lot of people throw
off on King James English but it was very accurate
so way more precise than what we use now and
you know there’s all of this science that goes into
literally and into tricking someone into believing what you want
them to believe while presenting non truth using accurate terms?
And one just one of them went Rick is anchoring.
And if you haven’t read about anchoring it is a
you know it’s you can just suggest any number it
has nothing to do with anything else but if that
number is. Is a high number and then you ask
someone to estimate some other thing that it’s been shown
it’s been proven over and over again they will guess
higher? On that estimate and if you then anchor them
with a very low number and do the same trick
you know different group of people may be different day
they’ll guess consistently lower the estimation is is anchored to
the number they previously heard.
And it’s just. It’s a scientific fact about the way
the human mind works,
so. You know you can.
You can see this, you know we see this in
data is probably.
If these days more more so maybe than than before,
is we’re looking at statistics dealing with the pandemic an
A friend.
Andrea Benedetti, who is as a Microsoft employee power BI
specialist,
pretty sure he was an MVP before he joined Microsoft.
He lives in Italy and he’s been posting at AKA
Dot Ms Slash Kovid report and he’s been updating this
almost daily with numbers from around the world about that
he he posted about a week or so back that.
When he stops and thinks about the numbers and realises
these our lives,
especially the lives that are lossed It,
yeah, it’s almost what he was saying was like you
almost can’t do this kind of work and keep that
in your mind at the same time you as.
I don’t know the right word for this,
but in my opinion it’s a Noble thing to do.
It’s like. When I hear about jaded doctors and nurses,
my first thought these days is what did it take
to get you to Jay?
Did you know? It’s like if your options are functioning
jaded or not functioning in that role?
Especially in these days Frank?
I’ll take jaded. It’s you don’t understand,
you know it’s really easy to set back and arm
arm chair quarterback and say now you know you should
not disengage your emotions.
You should just do this job and my ex wife
my ex-wife said neonatal intensive care unit nurse.
And you know, I’ve had some insight into what this
can do when someone loses a patient,
especially in infant. It’s crazy hard and to ask people
to just now.
I want you to keep feeling everything that’s like.
I don’t think you understand really what you’re asking folks
to do.
And Andrea mentioned something like that.
You know these numbers are lives.
These numbers are people with
two things, one on the numbers aspect.
We’ve addressed this before. And the story is you can
go back and listen to it.
We’ve been recording almost three years so crazy that we
have that much content,
but you have done nothing about collecting data about the
hurricane and all that.
And you know, I kind of set offline.
These are people’s lives, and yeah,
kind of story behind this was when I am a
survivor of the attack on the World Trade Center on
9/11.
Don’t want to go down that rabbit hole right now.
Right, but I was sharing my pictures that I took
with the with researchers and kind of people that were
doing the fire research over at Nest.
No. There was this moment where I didn’t respond to
the guy in awhile and.
You know he was he he I basically said sorry,
I’ve been to like maybe about 7:00 or 8 funerals
this week because you know what happened and right.
And then he kind of wrote back because,
you know, it’s really hard.
I forget sometimes that when I look at the pictures
of the burning building,
I’m looking at peoples, basically death scene.
And you know he was kind of like the funeral
thing,
kind of. Carlo brought him back to that like these
are people’s lives like and you’re right.
I mean, one of the things I did when I
was in college.
I got halfway through college.
I originally went to college to be an engineer chemical
engineer,
then switch to computer science.
We talked about that in a previous episode of the
humorous aspect of that,
but midway through I thought,
well, when I was, maybe I want to go to
pre Med and I became an EMT and stuff like
that and it’s a hard job because you really have
to. Kind of mentally insulate yourself from what’s happening around
you.
Absolutely the hardest part of that job was really emotional
toll because you can see somebody injured and kind of
think about it.
Almost like a car, like not a living thing like
you have to repair this.
You have to Patch that up.
That kind of helps you do it,
but then like what kind of sticks with you is,
why would one human being being do that to another
person,
and that psychological aspect? So I’ve always had respect for
the medical community because they can kind of do that,
but I think if you know that separation of kind
of what you’re seeing from what it means,
I think is a core survival instinct that I think
all people have to a certain degree.
But the medical people have really perfected it.
Well
and you know the whole concept there.
I think we’re dancing around empathy and being able to
not turn it off,
but maybe turn it down.
So that you could continue to function logically and walk
through the steps,
especially in the case of an EMT and Frank.
I say this being the type of person who cannot
do that.
It is not possible for me to do that consciously,
but I had this one experience went through.
Of course I worked in manufacturing back in the 90s.
I was a plant electrician.
We had to go through first aid just in case
and late one night I’m on 3rd shift and we
had a medical emergency occur now.
Keep in mind that at the end of the medical
training,
the nurse registered nurse had delivered this training to us
first aid.
In such she says to Maine Andy,
if somebody gets hurt, you take a flashlight and go
outside and wave down the ambulance.
Make sure they know how to get into the plant
parking lot because you’re not going to be very helpful
and it they weren’t wrong.
You know just listening to my responses to this,
but here’s this did happen.
Somebody got hurt and they were potentially in a life
threatening situation and.
I guess it was my old army basic training.
First aid kicked in or something,
but as soon as I realize the first thing I
just saw the person I was the first one to
see him come out of the place where they had
been hurt. I won’t explain in detail,
but I could tell they were hurt.
It was very easy to glance at this person to
realize they were hurt bad.
The. I remember thinking Frank,
I’m the only one here who can help.
Right right they rest of the people were working on
the other side of the building.
It was a large facility.
And when once I had that thought,
it was like the entropy entropy.
It was, it was like the you know the emotional
connections.
They all, then empathy. They just turned off and I
just went into gear.
And realizing that I was able to help him,
I would say I saved it live.
I don’t think so, but I definitely helped you know,
by getting to the first calling 911 and doing first
aid.
So it you know I I’ve had that experience,
so I can kind of relate to just let you
know a thousandth of a percent of what it must
be like.
Being in Madison on a normal day.
And these are not normal days.
Now it’s true. It’s true,
it’s uh. Now, God bless people who can do that
and who are doing agreed,
agreed. So, uhm, so one last question on the data
warehouse deep dive.
I think we’ve kind of danced around about the normalization,
but star schema versus Snowflake Schema.
Yeah, so I consider snowflakes to be actually a component
of star.
It’s a way to represent hierarchies in a star schema.
You can have a Snowflake Demention.
A great example of that is found in the adventure
works database,
where they have products that are have sub categories and
sub categories are related to categories.
When you load that date,
of course you want to start at the top of
the hierarchy,
load categories first because there could be a brand new
category that has a brand new sub category that has
a brand new product.
If you try to load up from the bottom which
you want to do is you want to kind of
grab these keys that are artificial,
we call them sorted keys.
You want to grab that from the next level up
on the hierarchy,
so you have to load it from the top down
to get that in an.
And adding confusion to this these days is there is
a data database engine called snowflake
and it’s going to say totally
online engin and I am not an expert in it
but I do know just from working closely with someone
who is an expert in it that it’s really fast
for loading large amounts of data as long as you’re
not trying to maintain like updates to that data it
gets painfully slow when you try to update data in
that platform again. There are advantages and disadvantages and the
truth is Frank there’s some data that you never update
like you never hear the weather person say yeah I
know we said yesterday that I was 55 Fahrenheit but
it was really 56 that that never happens.
That data is set and it’s recorded that way whether
it’s right or wrong it’s recorded that way for that
time financial currency exchange rates same sort of thing they
pick a daytime some agency runs the algorithms and bam
you don’t ever hear them changing that that’s not updated
so if you’re loading that kind of forward only type
data. Any kind of platform that’s good at inserts is
going to be really good at managing that type of
data so and there’s other ways to?
Attack updates as well. You can do these transactional things
where you have a from date.
Anna two date appended to the data and you can
convert.
Then you’re you know maybe data that is being updated
the way you update it is you put and it’s
from this state.
Time to this other day time.
Maybe it is updated from the last time,
but you’re inserting a neuro every time that there’s an
update and you see this in accounting generally accepted accounting
practices.
Double entry. You will entry ledgers there done that way
all the time,
every time. Even if there’s a mistake.
Even somebody goes to the bank and takes out,
you know $200 an. You know they only get 100.
Don’t realize it till later and they have to adjust
that what you’ll see you won’t see them erase the
200 and put 100.
They will add another entry that says there’s an extra
$100 in that account.
That’s the way that accounting works on paper.
And again it’s an update.
But it’s a transactional kind of list or log if
you will.
That says no, we just add a new transaction to
fix the mistake.
So it’s kind of like H base.
That way where one state is in the Ledger,
it’s immutable. Absolutely
yeah, that’s a great way.
Much better way to say that then I did.
Jesuit education, I guess. I got I got a bag
full employee
dollar words that’s funny. But a lot of ways to
attack it.
I’m sorry, go ahead. Go ahead.
I was just going to say there’s a lot of
ways to solve some of the same problems that we’re
doing,
and we can end up playing to the strengths of
the software that we’re using all the platform that we’re
using.
We always want to do that because speed is King,
especially these days or Queen.
It’s true. That’s true. So what are your thoughts?
Is a light as a longtime kind of data warehouse
data engineer?
What are your thoughts on synapse Azure synapses?
This meant to be kind of the next generation of
SQL data warehouse.
I have seen presentations where I’ve kind of seen the
bigger picture and it’s fascinating,
but what are your thoughts about that?
So I’ve not yet used Azure Synapse and.
Can’t even get access to it and I like to
think I’m one of the cool kids.
I’ve to my to my discredit.
At this point. I haven’t yet attempted to gain access
to it,
so I’m looking at the looking at it purely based
on marketing information and we know marketing information is usually
accurate.
But we had that discussion earlier
about accuracy versus true. There’s
a caveat hanging off that so having not.
They’re sort of hanging on the wall and throwing the
switch.
I don’t know enough to really give you experience based
feedback on that,
but what I can tell you from just kind of
a general architectural overview,
I have seen enough information not only from marketing people
buy something from some practitioners,
and the way they are approaching solving the problem they’re
trying to solve it very much appears to be a
next generation.
Type thing and Frank. I’ve written a couple of books
with design patterns in the in the title,
but one two editions of the same book.
SSI S design patterns and I think about that a
lot when I’m when I’m speaking architecturally,
Ann. I use this analogy to describe just the concept
of design patterns.
You can think about it as like Legos,
right? Everybody, maybe everybody has played with Legos.
I did, and you build that you’re going to build
a Lego wall.
So you put a few legos.
You know, bottom up against each other,
and that’s your base. And then you build your next
layer on top of them an you know another layer
on top of those.
And before you done, you know you’ve got 3,
maybe 8 Legos High. You’ve got this fall.
I picture design patterns that way,
and elegant patterns lend themselves to that sign.
That kind of of. His self similarity just at different
scales and you know,
the more you know, the more you’re able to do
that.
Once you’re able to take these basic functions like storing
data,
reporting data. You know data acquisition.
You know all of these pieces once you put these
together?
That’s kind of like your base bottom part.
But then if you can add a little bit of
automation to each of those,
now you’re building on that second layer,
an maybe on that. Thirdly,
and I think that’s where synapses.
I think it’s. I think it’s at least #2,
maybe third layer where they’re starting to group these together
under a common interface and kind of mask some of
the complexity that happens beneath that.
And I’m totally all for that.
I’m an automation freak, you know this.
And if that’s what it looks like to me,
it looks like they’re trying to simplify this stuff that
could be challenging.
Does that mean challenges won’t exist in it?
Gonna snow? They will, and you’ll still need expert help
in these other areas here.
And you know, in specific areas,
but you may not need it.
You may not need as much as you did before,
So what? what I like about this division of Labor
is you replying expertise only where you need it.
And In other words, another way to look at it
is.
You’re not hiring an expert to come in and do
this very basic stuff for you.
You’ve automated that. So you don’t need to pay an
expert expert wages to do that.
Pretty, you know what we consider kind of menial for
the expert.
If you automate those types of things away and there’s
a slew of that available in tons of examples,
business intelligence, markup language is one for generating SSI S
packages.
Staging is a pretty simple pattern.
Read the data and write it into this heap as
I talked about earlier,
that’s a pattern that doesn’t.
You know you don’t want to pay somebody a Virginian
Dollars an hour to do that.
You get a junior person to do that type of
work.
Even better, you pay someone who has expertise in Bemol
Business Intelligence Markup language and they just build it all
for you through automation.
and I got some wild stories about that.
My biggest anecdote is idea.
10 1/2 months of work in 3 1/2 days.
So it’s that kind of game changing stuff,
but again, as long as you can get,
you know, as long as you create an interface that
masked the drudgery of repetitive work away,
then you only need experts for the places where you
need expertise and that can change the game.
It changes economically and it changes it from a technological
standpoint as well so.
That’s what appeals to me about.
The tidbits I’ve gleaned from listening to people talk about
synapse and looking at a couple of,
you know, a couple of demos.
I’m as an MVP. I get very honored to be
able to do these product group interactions with Microsoft and
attended A at least one of those.
I want to say maybe 2 and we get to
see a little behind the scenes.
Not much. It’s not as much as you think,
but you know we’re talking to the team.
Not and nothing against the marketing people.
They’re doing their thing and God bless him.
We need him. But I’m talking to the engineers.
You know,
I’m watching the music, making those decisions about what’s possible.
What’s practical, exactly. You definitely get better or clearer picture
of what’s going on.
See yeah, I
would say different. It’s nothing again.
Marketing is doing its thing and I’m not a marketing
person.
There we go. Yeah, I agree with you and it
is definitely clearer and it makes more sense to me
’cause I’m an engineer so I get it.
But that’s options would say the main nerve.
There we go. And so you know,
in a nutshell, I would say Azure Synapse based on
again my limited exposure to it appears to be one
of these next level jumps and I refer to these
often is tectonic. They’re laying down a whole new layer
here,
you know, and they’re building on like on my my
Lego example.
They’re building the next layer of the wall,
if that’s what it looks like those are.
You know they used to be every decade or so.
We have that sort of stuff,
but it’s accelerating, so now I’m seeing a couple of
three times a decade where we get this new tectonic
shift in what we can do,
and automation that’s built on top of the previous layer
of automation.
That’s how I see it.
Don’t know if that helps or hurts.
Not definitely helps. Uh, so with that were at the
hour?
Uh, just want to point out.
I also have a meeting.
I gotta go onto. OK,
yeah? We had a great session here,
hopefully this answer some of your data warehousing questions and
if you want we could do another deep dive.
And let us know what you think of the show.
If you’re interested in iBooks or not iBooks audiobooks,
audible is a sponsor, so go check it out.
I should probably see if there’s any good data warehousing
books on audible.
Yeah, I don’t know a lot of them have to
be written,
I think because
the example I don’t know it would be interesting to
see if there are all.
I’ll do a search if.
You are not already inaudible subscriber and you would like
to get one free audio book on our dime,
or actually it’s audibles time and you go to the
data driven book com routes you to inaudible page if
you get a subscription.
After that they toss us a dime or two and
help support the show.
Uh, particularly given that my monetization strategy has been primarily
driven through.
Amazon merch, which has been shut down?
Yeah, every little bit helps.
It’ll be back. You spring is still shipping.
Sorry true, very true that design that.
I got in trouble with.
I actually have one printed version on the way so.
It’ll be fun anyway. I’m not going to post it
to Instagram,
I’ll just put it that way,
but with that in mind,
um, have a great day and stay safe out there.
Any any other parting thoughts?
Now that’s perfect. Awesome and will let the nice British
Lady and the show.
Thanks for listening to data driven.
Don’t just listen. Become a data driver by going to
datadriven.tv to sign up to join the community.
Access to special events, tips and tricks and more.
Sign up today
at datadriven.tv.