Do You Speak Data?

March 13, 2015

Kenny Darrell

For Whom Does This Apply?

  • The Project Sponsor - Represents the business interests; champions the project.

  • The Client - Represents end users' interests; domain expert.

  • The Data Scientist - Sets and executes analytic strategy; communicates with Sponsor and Client

  • The Data Architect - Manages data and data storage; sometimes manages data collection.

  • Operations - Manages infrastructure; deploys final project results

From Practical Data Science with R

Goal

When we speak of data it is too ambiguous

I argue that we should add some precision

What would this take and what value would it provide?

A Data Scientist, A Project Manager and Client walk into a bar meeting

An analyst on the phone says, just got the data and currently cleaning it.

The client hears, I'll have my results by COB!

The PM hears, Great we can finally start working.

The data scientist hears, Making sure this is even the data we wanted!

There is no punchline, that was not a joke.

Could we talk like a recipe about data?

example

Somebody with very little knowledge could recreate this

What This Talk is Not

A method to do data science in any way, shape, or form

English 101

Vocabulary

noun

Words used on a particular occasion or in a particular sphere.

The body of words known to an individual person.

Grammar

noun

the study of the way the sentences of a language are constructed; morphology and syntax

an account of these features; a set of rules accounting for these constructions

knowledge or usage of the preferred or prescribed forms in speaking or writing

the elements of any science, art, or subject

Syntax

noun

the arrangement of words and phrases to create well-formed sentences in a language

a set of rules for or an analysis of this

Language

noun

a body of words and the systems for their use common to a people who are of the same community or nation

any set or system of such symbols as used in a more or less uniform fashion by a number of people, who are thus enabled to communicate intelligibly with one another

None of those get us very far

We really need building blocks or rules.

Parts of Speech

A part of speech is a category of words (or, more generally, of lexical items) which have similar grammatical properties.

Noun

a word or lexical item denoting any abstract or concrete entity; a person, place, thing, idea, or quality

Adjective

a qualifier of a noun or pronoun (big, brave)

Verb

a word denoting an action (walk), occurrence (happen), or state of being (be)

Adverb

a qualifier of an adjective, verb, clause, sentence, or other adverb (very, quite)

Preposition

an establisher of relation and syntactic context (in, of)

Conjunction

a syntactic connector (and, but)

Article

a grammatical marker of definiteness (the) or indefiniteness (a, an). Not always listed among the parts of speech. Sometimes determiner (a broader class) is used instead

Interjections

words used to express emotional states

We have no need for such non-sense, they will not be mentioned again

Chenopodium album

Chenopodium album is a fast-growing weedy annual plant in the genus Chenopodium.

Our current surroundings, as we have got in the weeds

This is not English Class, this should be technical

A Comical Attempt

Nouns

Data

Data is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information.

Data as an abstract concept can be viewed as the lowest level of abstraction, from which information and then knowledge are derived.

The word "data" used to be considered as the plural of "datum", but now is generally used in the singular, as a mass noun.

Analytical base table

I am sure there are more nouns, but they are less interesting

Adjective

The most prominant adjective, Big

As in Big Data

Buzzword - a word or phrase, often an item of jargon, that is fashionable at a particular time or in a particular context.

Tall Data vs Wide Data

Raw Data vs Processed Data

Raw Data - This is the data as it was natuarlly collected.

Processed Data - It has been modified from its original form in any way.

Tidy Data vs Messy Data

All data is messy!

Stricter definition

What is Tidy Data?

A dataset is said to be tidy if it satisfies the following conditions

observations are in rows

variables are in columns

contained in a single dataset

Tidy data makes it easy to carry out data analysis

Causes of Messiness

There are various features of messy data that one can observe in practice. Here are some of the more commonly observed patterns.

Column headers are values, not variable names

Multiple variables are stored in one column

Variables are stored in both rows and columns

Multiple types of experimental unit stored in the same table

One type of experimental unit stored in multiple tables

The most interesting part of speech, verb, moves data from messy to tidy

Hadleyverse

Hadley Wickham is a prolific programmer; he has recreated big chunks of R’s standard library, making his versions so much better that people call it ‘the Hadleyverse’. His paper The split-apply-combine strategy for data analysis argues that much of what we do day-to-day is reshaping data. His packages help in that direction, and the Hadleyverse is a collection of R packages (dplyr, lubridate, ggplot2, etc). If you are starting with R, it’s advisable to forget about the standard lib and use these.

History

In the beginning there was reshape and plyr

These gave us access to the tidy method and the split apply combine philosophy.

This was all to much. Just as reshape2 did less than reshape, tidyr does less than reshape2.

tidyr is a reframing of reshape2 designed to accompany the tidy data framework, and to work hand-in-hand with magrittr and dplyr to build a solid pipeline for data analysis.

Hadley Wickham

Verbs

tidyR

tidyr is new package that makes it easy to “tidy” your data. Tidy data is data that’s easy to work with: it’s easy to munge (with dplyr), visualise (with ggplot2 or ggvis) and model (with R’s hundreds of modeling packages).
Hadley Wickham

Gather

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.

Separate

Sometimes two variables are clumped together in one column. separate() allows you to tease them apart.

Spread

spread() takes two columns (a key-value pair) and spreads them into multiple columns, making “long” data wider. spread() is used when you have variables that form rows instead of columns. You need spread() less frequently than gather() or separate()

These verbs have a number of synonyms:

tidyr gather spread
reshape(2) melt cast
spreadsheets pivot unpivot
databases fold unfold

dplyr

dplyr is the next iteration of plyr, focussed on tools for working with data frames. It has three main goals:

  • Identify the most important data manipulation tools needed for data analysis and make them easy to use from R.

  • Provide blazing fast performance for in-memory data by writing key pieces in C++.

  • Use the same interface to work with data no matter where it's stored, whether in a data frame, a data table or database.

Intro

Single table verbs

dplyr aims to provide a function for each basic verb of data manipulating:

select

select takes certain columns from a table, you can also rename them in the process

filter

filter removes observations based on some condition, similar to where

slice is similar but different, remove every third every second row (keeps even rows)

arrange

arrange allows you to order by rows in some specific order

distinct

distinct remove redundant rows

mutate

mutate lets you add new variables based on some expression applied to existing variables

transmute is similar but modifies in place

sample

sample gives various ways to downsize the rows of your data

others

I am sure there are many more, let me know if you know some common data processing step that was missed.

Adverbs

I believe the road to hell is paved with adverbs, and I will shout it from the rooftops
Stephen King, On Writing

I am open to suggestions here, they may not exist

Adverbs of place

describe where something happens

here, nowhere

Maybe the machine, VM, server IP?

Adverbs of purpose

describe why something happens

because, accidentally

Since it comes from Excel, etc

Adverbs of frequency

describe how often something happens

always, seldom

real time, every 30 days

Adverbs of time

describe when something happens

after, during

after we get new payment data we can inner_join()

Adverbs of manner

describe how something happens

carefully, eagerly

what you do when a client is watching, carefully seperate columns

Preposition

Does this qualify

group_by (person) %>% summarise (gpa = mean(grades))

Conjunction

dplyr calls joins and exclusions Two table verbs

They seem more correct here though

Mutating joins

Add new variables to one table from matching rows in another

inner_join

left_join

right_join

full_join

Support for non-equi joins is planned for dplyr 0.5.0.

Filtering joins

Filter observations from one table based on whether or not they match an observation in the other table

semi_join

anti_join

Set operations

Combine the observations in two data sets as if they were set elements

intersect

union

setdiff

More Verbs!

expand, extract, unite, unnest