How Can Someone Become a Data Scientist?
A real data scientist, the high-end data scientists,
are mostly PhDs.
They often come out of physics, out of statistics,
they have to have a computer science background,
they have to have a math background,
they have to know about databases and statistics
and probability and all that stuff.
I think the first skills you need is you need
to know how to program,
at least have some computational thinking,
so having taken a programing course,
you need to know some algebra, at least up to analytics,
geometry, and hopefully some calculus,
some basic probability, some basic statistics,
I mean really have to understand the difference
and different statistical distributions, and database.
I mean, one of the easiest places to start
is relational databases, which stores lots and lots
of our data so people can first walk before they can run
by at least understanding about computers and databases
and how we store things and if you understand
relational databases nowadays you can still,
just with that understanding, use big data clusters
as if they were just a big relational database.
You don't have to really have understand the whole
MapReduce programming model.
But then, as you go further up in the field,
then you have to know a lot of computer science theory
and statistics, it's really, and probability,
it's really the intersection of them
that the high end data scientists,
the PhD data scientists work with.
(music)
I do a lot of self-learning.
I think everybody these days,
I mean, I learned about Hadoop all by myself,
I read some articles, I watched some videos,
I thought, I played, although I'm a builder,
I'm a tinkerer, so if I wanna figure out
how to do something, I build it.
I mean, my first HPC cluster
I heard about this term a Beowulf cluster,
I mean, yeah, what the hell's that?
So I looked it up and said, oh,
it's just a bunch of computers hooked together
with a TCP/IP network, that's pretty easy,
so we get a grant from Citi Bank
and we built a five thing cluster and I said,
oh, well, that's HPC.
I said, I had one of the first HPC clusters
at the university, it was tiny but a lot of our
researchers loved it because they could run stuff
40 and 50 times faster.
So I think one of the ways you learn things is you do them,
you have to do them, and these online learning platforms
especially now that we have things like IPython
and Jupyter Notebooks and I guess Zeppelin
means that you can actually go in
and take some of these courses
and you can do things right then
and you can see them and feel them and play with them
and, at that point, you know, you'll start to get your
head around what is actually happening.
Motivation is the key problem in all of these,
is how to keep people motivated
and I think the badge system that the, what was it,
Big Data University has, is one of the ways
is how do you get people to keep going through.
But if they want to, they can.
It's up to the individual to.
So they have to understand what the goal is.
(music)
The place it can't sit
is probably under the CIO, the Chief Information Officer.
CIOs current chief information officers in many companies
got there from an accounting background
or a finance background, they're clueless.
Sorry.
But they really, it has to come out of the research side.
So you'll find data scientists primarily in companies
that have some research agenda, pharmaceuticals,
finance, all of, any technology company.
If you look at, we can't keep some of our
PhD data scientists in our program,
they are now at Facebook,
they're at Linkedin, they're at Uber, they're at Lyft,
because the demand out there for the PhD level
data scientist is just unbelievable.
They make large amounts of money
and they're playing with problems
that are really, really neat.
How do you schedule the Uber cars?
You have enormous amounts of data.
Comments
Post a Comment