Kenny Darrell
The Project Sponsor - Represents the business interests; champions the project.
The Client - Represents end users' interests; domain expert.
The Data Scientist - Sets and executes analytic strategy; communicates with Sponsor and Client
The Data Architect - Manages data and data storage; sometimes manages data collection.
Operations - Manages infrastructure; deploys final project results
When we speak of data it is too ambiguous
I argue that we should add some precision
What would this take and what value would it provide?
A Data Scientist, A Project Manager and Client walk into a bar meeting
An analyst on the phone says, just got the data and currently cleaning it.
The client hears, I'll have my results by COB!
The PM hears, Great we can finally start working.
The data scientist hears, Making sure this is even the data we wanted!
There is no punchline, that was not a joke.
Could we talk like a recipe about data?
Somebody with very little knowledge could recreate this
A method to do data science in any way, shape, or form
noun
Words used on a particular occasion or in a particular sphere.
The body of words known to an individual person.
noun
the study of the way the sentences of a language are constructed; morphology and syntax
an account of these features; a set of rules accounting for these constructions
knowledge or usage of the preferred or prescribed forms in speaking or writing
the elements of any science, art, or subject
noun
the arrangement of words and phrases to create well-formed sentences in a language
a set of rules for or an analysis of this
noun
a body of words and the systems for their use common to a people who are of the same community or nation
any set or system of such symbols as used in a more or less uniform fashion by a number of people, who are thus enabled to communicate intelligibly with one another
We really need building blocks or rules.
A part of speech is a category of words (or, more generally, of lexical items) which have similar grammatical properties.
a word or lexical item denoting any abstract or concrete entity; a person, place, thing, idea, or quality
a qualifier of a noun or pronoun (big, brave)
a word denoting an action (walk), occurrence (happen), or state of being (be)
a qualifier of an adjective, verb, clause, sentence, or other adverb (very, quite)
an establisher of relation and syntactic context (in, of)
a syntactic connector (and, but)
a grammatical marker of definiteness (the) or indefiniteness (a, an). Not always listed among the parts of speech. Sometimes determiner (a broader class) is used instead
words used to express emotional states
We have no need for such non-sense, they will not be mentioned again
Chenopodium album is a fast-growing weedy annual plant in the genus Chenopodium.
Our current surroundings, as we have got in the weeds
This is not English Class, this should be technical
Data is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information.
Data as an abstract concept can be viewed as the lowest level of abstraction, from which information and then knowledge are derived.
The word "data" used to be considered as the plural of "datum", but now is generally used in the singular, as a mass noun.
I am sure there are more nouns, but they are less interesting
The most prominant adjective, Big
As in Big Data
Buzzword - a word or phrase, often an item of jargon, that is fashionable at a particular time or in a particular context.
Raw Data - This is the data as it was natuarlly collected.
Processed Data - It has been modified from its original form in any way.
All data is messy!
A dataset is said to be tidy if it satisfies the following conditions
observations are in rows
variables are in columns
contained in a single dataset
Tidy data makes it easy to carry out data analysis
There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns.
Column headers are values, not variable names
Multiple variables are stored in one column
Variables are stored in both rows and columns
Multiple types of experimental unit stored in the same table
One type of experimental unit stored in multiple tables
The most interesting part of speech, verb, moves data from messy to tidy
Hadley Wickham is a prolific programmer; he has recreated big chunks of R’s standard library, making his versions so much better that people call it ‘the Hadleyverse’. His paper The split-apply-combine strategy for data analysis argues that much of what we do day-to-day is reshaping data. His packages help in that direction, and the Hadleyverse is a collection of R packages (dplyr, lubridate, ggplot2, etc). If you are starting with R, it’s advisable to forget about the standard lib and use these.
In the beginning there was reshape and plyr
These gave us access to the tidy method and the split apply combine philosophy.
This was all to much. Just as reshape2 did less than reshape, tidyr does less than reshape2.
tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.
Hadley Wickham
tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modeling packages).Hadley Wickham
gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.
Sometimes two variables are clumped together in one column. separate() allows you to tease them apart.
spread() takes two columns (a key-value pair) and spreads them into multiple columns, making “long” data wider. spread() is used when you have variables that form rows instead of columns. You need spread() less frequently than gather() or separate()
These verbs have a number of synonyms:
tidyr | gather | spread |
---|---|---|
reshape(2) | melt | cast |
spreadsheets | pivot | unpivot |
databases | fold | unfold |
dplyr is the next iteration of plyr, focussed on tools for working with data frames. It has three main goals:
Identify the most important data manipulation tools needed for data analysis and make them easy to use from R.
Provide blazing fast performance for in-memory data by writing key pieces in C++.
Use the same interface to work with data no matter where it's stored, whether in a data frame, a data table or database.
dplyr aims to provide a function for each basic verb of data manipulating:
select takes certain columns from a table, you can also rename them in the process
filter removes observations based on some condition, similar to where
slice is similar but different, remove every third every second row (keeps even rows)
arrange allows you to order by rows in some specific order
distinct remove redundant rows
mutate lets you add new variables based on some expression applied to existing variables
transmute is similar but modifies in place
sample gives various ways to downsize the rows of your data
I am sure there are many more, let me know if you know some common data processing step that was missed.
I believe the road to hell is paved with adverbs, and I will shout it from the rooftopsStephen King, On Writing
I am open to suggestions here, they may not exist
describe where something happens
here, nowhere
Maybe the machine, VM, server IP?
describe why something happens
because, accidentally
Since it comes from Excel, etc
describe how often something happens
always, seldom
real time, every 30 days
describe when something happens
after, during
after we get new payment data we can inner_join()
describe how something happens
carefully, eagerly
what you do when a client is watching, carefully seperate columns
Does this qualify
group_by (person) %>% summarise (gpa = mean(grades))
dplyr calls joins and exclusions Two table verbs
They seem more correct here though
Add new variables to one table from matching rows in another
inner_join
left_join
right_join
full_join
Support for non-equi joins is planned for dplyr 0.5.0.
Filter observations from one table based on whether or not they match an observation in the other table
anti_join
Combine the observations in two data sets as if they were set elements
intersect
union
setdiff
expand, extract, unite, unnest