Data Science

Week 1

data portfolio

In this course we extract data from web sites, often by building our own scrapers. To prepare for that, it is helpful to understand the basics of how the web works. In the first lecture and lab we introduce HTML, CSS and JavaScript as you build and style your first web site. You will also embed your first automated and interactive charts.

Skills and concepts: Text editors, HTML, CSS, GitHub.

Week 2

live data

This week we will build two live charts, embedding them in the web site you built in week 1. The first will run direct from data provided by an API, auto-updating itself daily. The second will run from your GitHub repository. We will discuss the strengths and weaknesses of these two approaches, and how you can use them in your project.

Skills and concepts: APIs, Javascript, JSON, Vega-Lite, Charts.js.

Week 3

api data

Many analysts access data by clicking download icons to get Excel or csv files. As data scientists we want to access data programmatically—without touching our mouse or keyboard—since this avoids error and repetitive tasks, means our work is transparent and verifiable, and makes time-saving automation possible. In our third class we access data from APIs, discussing the benefits, pitfalls and debugging.

Skills and concepts: APIs, CORS, JSON.

Week 4

scraping data

Lots of interesting and useful data is not provided by an API but is embedded in a website. This week we will build our first scrapers to extract data from websites and build our own data sets. We learn the art of inspecting a web site to find the data within it, and use this new skill to extract data from three different websites, comparing the results we get.

Skills and concepts: Python, BeautifulSoup, HTML, Stata.

Week 5

automating data

Repetition is dull, slow and a source of error. This is a problem since much of what we have learned so far: fetching data from an API, scraping a web site you will want to repeat many times. This week is devoted to loops, one of the most powerful tools in any coder’s arsenal. Our class will show how a loop takes you from a small dataset to the world of big data.

Loops. Layers. Python. JavaScript.

Week 6

reading week

There are no classes or office hours this week.

Relax.

Then work on your project!

Week 7

cleaning data

Data is only helpful when it is in a clean and useable form. This week we discuss how to get your dataset into shape for analysis. We focus on three unglamorous skills that are the foundation of data science: cleaning data; matching and merging datasets, and re-shaping data.

Data manipulation functions, Python, Stata, JavaScript.

Week 8

data learning

By this stage you have the tools to access data and loops that allow you to automate and repeat this. The result is large and interesting datasets. We now build tools to learn from this data. This week we begin to discuss Machine Learning (ML) and the difference between “supervised” and “unsupervised” learning, and the use of labelled and unlabelled data.

Machine learning. Python (PyTorch, TensorFlow)

Week 9

data patterns

We continue building our ML skills by discuss four common tasks that these tools can be used for: for supervised learning—classification and regression and for unsupervised—clustering and association. We apply these tools to example datasets and discuss how you could use them.

Machine learning. Python (PyTorch, TensorFlow)

Week 10

data stories

Our final week of analysis we discuss the way data can be used to prove or disprove a point. We re-cap on how to use moments of a distribution—the spread and range of data—on correlation, and on steps to establish causation. We discuss, with examples, ways to calculate and visualise the results of this in-depth analysis.

Machine learning. Python (PyTorch, TensorFlow)

Week 11

interactive data

As the course draws to a close we have the tools to define a research question, build and clean a complex data set and analyse it. In our final session we discuss how to make charts interactive in ways that help users draw their own stories and conclusions. We use large data sets to demonstrate this.

Interactives (filters, toggles, sliders), colour, tone and opacity, Vega-Lite.

Deadline day: Monday 9th January 2023.

Your project

We will discuss project ideas each week, and help you with code and data.

A reminder that the deadline for your DS project is Monday 9th January 2023.

Good luck!

build

Coursework

Your project will present between 3 and 8 charts. These must be embedded in your site, hosted by GitHub pages. You must also briefly discuss four topics: (1) the aims of your project; (2) the data you used, how you accessed it, including notes on automation/replication; (3) challenges in data cleaning and/or analysis, and the tools you used to overcome them; (4) your conclusions. Each section must not exceed 200 words.

Get help

office hours

There are four office hours each week, at the following times:

RD: Mon, 14:00-15:00
DC: Thu, 15:00-16:00
CM: Wed, 09:00-10:00
EV: Thu, 14:00-15:00

There are no office hours during reading week (W6). The final slots are held in week 11.

Read

books, papers and sites

Some useful books and papers:

Heroes and heroines.Biographies of some key figures in data, past and present.
Nightingale Magazine.The publication of the Data Visualisation Society
MDN Starters guide.A superb intro to HMTL, CSS and JavaScript from Mozilla.

Watch

videos

Links to videos that will help you cover the material and with your project:

1. Day one: Setting up

Resources

links

Tools and links to assist your Data Science project.

datascience

for economics

dataresources