Applied Statistics (MET 2010)
Fall 2023
NTNU Business School
Johannes Mauritzen
About the course
This course is about how to do a complete data analysis with modern tools and methods. Statistical models, methods and tests are important components of this course, but the course also introduces data-management, transformation, visualisation and discriptive analysis.
This is a Do Course
The idea behind the organisation of this course is that the best way to learn how to do data-analysis is to do data analysis. From the first class period you will be actively working with data. The evaluation in this course will also be by a term project where you will complete all steps of a data analysis. You will need to work actively throughout the semester in order to do well in the course.
Python and co.
In this course we use Python and associated data analysis libraries. This is why:
- Python (together with R) are the tools that are most used in business and industry. Students that can combine a technical competance with these tools with a solid understanding of business and economics are in demand in the job market.
- Many other courses at NTNU (as well as at other universities) in statistics, machine learning and business analytics make use of Python.
- Python and its associated libraries are open-source, transparent and have a large network of deveopers and users.
- It is easy to find help when you have problems in discussion forums like https://stackoverflow.com.
- As an NTNU bachelor student, you should already have had an introduction to the basics of using Python.
The course is difficult and useful
Student feedback about the course is that it is one of the most difficult courses they have taken during their bachelor degree. You will almost certainly experience some confusion and frustration while taking this course. This is a normal part of learning something difficult.
Another common feedback is that students learned a lot. You won't learn everything about statistics and data analysis in this course. But you will learn the tools and skills necessary to get a good start, and know how to learn and pick-up other tools later. In other words, you will learn how to learn.
Course setup
Labs
The course is built around digital "labs". Each lab consists of a mix of explanation, short videos and code that you should actively replicate. At the end of each lab there are problems. You should work through the labs in small groups and then to do the problems at the end of the lab.
Lecture and extended office hours
We will not have traditional lectures in this course. All of the information you need will be in the labs or the text book.
Other than the first week, we will not make use of any fixed class period. You can disregard the schedule that is on your ntnu calendar.
Instead, I will have extended office hours Monday to Thursday from 9 to 13, where you are welcome to come and ask questions and get help. You can find my office on the fourth floor, on the same side as the main entrance, in the corner by the staircase.
There will also be a student assistant who will hold a help session once per week. More information will be posted on blackboard.
Important: In the course schedule below, you may notice that that we will be working on a compressed schedule with all the material covered before November (disregard what you see in the NTNU calendar). I have two reasons for this setup: 1.) to give you plenty of time to work on your term project. Working on your own project is where a lot, if not most of the learning takes place, so it makes sense to prioritize this. 2.) From the 1. November I will no longer work at NTNU and will be moving to a job in industry. This means you should put in a solid effort before the 1. november and make use of this time to get help from me.
Assignments and colloquiem groups
In order to get course approval, you will need to turn in an obligatory assignment. This assignment consists of a subset of the weekly problems at the end of each lab for a core portion of the course. You should work individually or in small groups to complete the assignment.
However, you will also organize yourselves into Colloquiem groups of between 4 and 8 people, registered in blackboard. The point of the Colloquiem groups is to create a group that you can discuss problems with and get help from. There is only one instructor and potentially more than 100 students, so your main source of help and support will necessarily be other students. For the obligatory assignment, while you should be working in small groups or alone, you will only turn in one assignment per colloquiem group.
Evaluation and term project
The course culminates in a term project which you can work on individually or together with one other person (no exceptions will be given for more people per group). You should start working on the term project as soon as possible. Details are provided below.
Policy on ChatGPT and related "generative" technologies.
The purpose of this course is to prepare you to do a real-life data analysis and the evaluation is therefor also a real-life data analysis. In learning the necessary coding and data-analysis techniques, you are encouraged to make use of ChatGPT and related technologies. My experience is that these can be effective tools for coding more efficiently and getting help with coding problems. Potentially, these tools could make learning data analysis easier and more enjoyable. In other words, I see chatGPT etc as being tools to help you learn data analysis, but not as a substitute for learning data analysis.
I will not set any explicit limitations on using chatGPT or other technologies to generate text in your project. But be warned: these technologies have a tendency to produce text that sounds plausible, but is inherently meaningless - in other words bullshit, and bullshit in your term project (whether generated by AI or not) will result in a poor grade.
ChatGPT and other such technologies are tools, sources and references, and you should carefully cite, document and explain how you have used them in your term project. If you are unsure how to correctly cite ChatGPT or other sources, the library can assist you. If you, for example, generate text with ChatGPT and directly copy it into your term project without citing your source, this is plagiarism, and this can result in a failing grade.
- Vanderplas. Python Data Science Handbook (2016). [PDS]
- Gelman, Hill and Vehtari. Regression and Other Stories [ROS]
- Hyndman and Athanasopoulos. Forecasting: Principles and Practice (3rd ed.) [FPP3]
- Optional: Downey. Think Stats
- Optional: Spiegelhalter. The Art of Statistics.
Reading:
We use readings from three books in this course. All three are freely available in digital formats for personal use. I also recommend two additional texts.
Course plan
Week | Topic | Lab | Sol. Sketch | Reading |
---|---|---|---|---|
Week 1 | Introduction and review, Python and Numpy | Prelab Lab 1 | Sketch, lab 1 | PDS Ch 1, 2 |
Week 2 | Data management with Pandas, Split-Apply-Combine and Merge | Lab 2 Lab 3 | Sketch, lab 2 Sketch, lab 3 | PDS Ch 3 |
Week 3 | Visualisation and Transformation | Lab 4 | Sketch, lab 4 | ROS Ch 2, PDS Ch 4 |
Week 4 | Probability, simulation and inference | Lab 5 Lab 6 | Sketch, lab 5 Sketch, lab 6 | ROS Ch 3 - Ch 5 |
Week 5 | Simple regression | Lab 7 | Sketch, lab 7 | ROS 6-8 |
Week 6 | Multiple regression | Lab 8 | Sketch, Lab 8 | ROS 10, ROS 12.1-12.4 |
Week 7 | Evaluating models and diagnostics | Lab 9 | Sketch, Lab 9 | ROS 11 |
Week 8 | GLM and logistic regression | Lab 10 | Sketch, lab 10 | ROS 13-14 |
Week 9 | Bayesian Statistics/Identification and causal modelling | Lab 11 Lab 12 | Sketch, lab 11 Sketch, lab 12 | ROS 19-21 |
Week 10 | Time series statistics | Lab 13 Lab 14 | Sketch, lab 13 Sketch, lab 14 | FPP3 Ch. 9 |
How-to notes
Some short notebooks demonstrating some common tasks.
Obligatory assignment
Deadline: Friday October 13th, 1200 (noon).
- Turn in via Blackboard.
- In the form of a Jupyter Notebook
- Turn in one Jupyter notebook file (.ipynb)
- Show all your work
- Make sure to comment and explain your work.
- You will turn in one assignment per Colloqiuem group
Components of obligatory assignment
- Lab 1: Assignment 3
- Lab 2: Assignment 3
- Lab 4: Assignment 3
- Lab 6: Assignment 4
- Lab 7: Assignment 5
- Lab 8: Assignment 2
- Lab 9: Assignment 4
In the assignment I am first and foremost looking for evidence of a substantial effort to understanding and make use of the models and tools. Some mistakes, coding problems and misintpretations are natural and a submission will likely still be accepted. I will provide feedback on these types of issues. If an assignment is not evaluated to meet minimum requirements, you will get feedback on what you need to improve, and then you have one additional attempt to submit the assignment.
You will notice that all the component assignments are open-ended assignments. You are encouraged to use these assignments as a way to explore possible topics and analysis for your term project. It is ok to reuse parts of these assignments in your term project.
Term Project
Deadline: Tentatively set for 15. December, but you should check the NTNU Exam Calendar as the end of the semester approaches.
Delivered as a PDF of a Jupyter notebook
(In Jupyter Notebook you can choose from the menu Download as, thereafter PDF via Latex. If that does not work, you can choose to download as a HTML file. Open the file in a browser, and thereafter print as PDf.)
The course culminates in a term project.
- You can work individually or with one other person. No exceptions will be given to this rule. If you choose to write with one other person, you will need to write a short description of how you cooperated and split up the work. As a rule, both group members should be actively involved in the technical work, and the contribution to the project should be roughly equal. In cases where there is doubt about the contribution of a group member, I can call into a oral defense of the project.
- You have both the freedom and responsibility to choose your own topic and dataset. Finding an interesting topic and related data is an important component of the work, and I will not provide you with a topic or data.
- You also have a good deal of freedom in deciding the structure and presentation of the project. I will not provide example projects or templates. Instead, I want to encourage you to make independent and creative choices on the form of the term project. I provide a detailed description of what should be included below. Beyond that, you are free to organize your project as you see fit.
- You are free to write your term project in a Scandinavian language or in English.
- The project should be completed independently (in the sense of an individual or a group of 2), but you can and should consult other students for help on technical issues and feedback. Your Colloqiuem group is a good forum for getting help and feedback.
- You can and should begin working on your project as soon as possible (Hint: Several of the problems in the lab are "open", where you can choose your own dataset. This is an opportunity to start exploring some ideas and different data).
- If you are unsure of where to begin, you can start with the list of the data sources that I provide below (not all links are up-to-date). Alternatively, you can also consider the ready-made datasets that Statistics Norway (SSB) provides on a range of topics. You are not limited to these data sources if you have your own data. Neither are you limited to just one dataset, and you are encourage to combine datasets.
- In most situations, you should probably try to find cross-sectional data (many individuals or other units at one time), or panel data (many individuals or other units at several time points). Time series --one unit (a stock price, GDP of a country, etc)-- over many time periods, can be included in an analysis, but preferably with an analysis that also makes use of other data types.
- There are no strict length requirements, but a project should normally be between 2,000-5,000 words (not including code). If you have more than this, that's ok, but longer projects will generally not lead to a higher grade.
- The project should be a mix of a traditional term paper and a lab report:
- You should have a well written introduction where you motivate the topic, give some background and describe the data. In the introduction you should also give a summary of your findings
- Otherwise, the requirements for structure and format are much looser for this project. The project is intended to be loosely organized, where instead of just showing end results, the student also show the analysis that led to your results — exploratory graphs, summary statistics, statistical tests, and various hypothesis exploration. This project could be seen as complimentary to writing a full article or thesis — serving as a well structured "lab notebook" detailing the research/analysis process.
- You should include your code for data management, transformation, visualization, modelling, model evaluation and diagnostics, etc.
- You do not need a section for methodology but you should of course discuss methodology. Copying some general formulas from a text book has about zero value added. Talk about why a methodology is appropriate for your topic and data (this is good general advice for written assignments).
- You should describe and interpret your results.
- You should have a concluding section where you summarize your results and discuss implications.
- All problems should have the following elements (but you should not organize your thesis according to the below list):
- Data management and transformation (labs 1-4)
- A exploratory and descriptive analysis in the form of visualisation, test statistics and tables (labs 4 and after)
- A simple regression and appropriate visualisation of uncertainty (lab 7).
- Multiple regression and associated diagnostics and visualisation (lab 8). You should probably not use time series data for this or the previous point.
- "Fake" data simulasjon (lab 6-9) to explore model validity. (If you are unsure of where to begin here, start by testing one or more of the assumptions of in a regression analysis (ROS 11.1)).
- Diagnostics and model evaluation (lab 9).
- Discussion of identification and causality (lab 12). Optional use of identification method:
- Regression discontinuity
- Matching
- Difference-in-difference
- Instrumental variables
- Others
- In addition you must make use of at least one of the following (can combine this requirement with requirement of using multiple regression)
- Logistic regression og other GLM model (lab 10)
- Bayesian analysis, decision analysis (lab 11)
- Time series analysis (lab 13-14)
- You are welcome to also use other methods not included in the labs, though this can not replace any of the requirements above.
- The labs are probably the best source to give you an idea of how the projects should generally look: A combination of description, methodology, results and interpretation.
- The period between the last class period and the due date for the term project should be considered an exam period. Since I will be starting a new full-time position on 1. november, I will not be available to answer questions during this period. In other words, get started early!
Evaluation
Compared to a standard exam, you learn a lot more by completing a term project. And that is why we have a term project in this course. On the other hand, grading necessarily becomes holistic. The internal and external graders will try to evaluate the project in a fair and thoughful manner. When that is said, it can be difficult to say ahead of time what grade a project will get or a breakdown of what you need to get a certain grade. Therefor I can not give any indication ahead of time when it comes to grades. My best advice: Choose a topic you think is interesting and write a project that you are proud of. If you follow the instructions, this will generally correlate with a good grade.
To set reasonable expectations about grades, you should consult NTNU's guidelines for grades. Here an A-grade is reserved for exceptional project. You can have a very good project where nothing is necessarily wrong, and still not get an A.
Finally, a little reminder. Everyone starts with an F. No matter what grade you get, turning in and getting a passing grade on a term project represents a substantial amount of learning.
Data sources
- Here is a list of data sources I have encountered over the years. Some links have not been updated.
- I also recommend looking at Statistics Norway's ready-made datasets as a good resource if you are looking for a place to start with an analysis.
- You have access to several large databases through the library. For example Eikon for financial data. Contact the library for assistance.
- I also encourage you to find your own datasets. The internet is bursting with data!
Sources for learning more Python and R
- Here is a list of resources I have encountered for learning more Python.
- Here is a list of resources for learning more R.
- MIT Open: Introduction to computer science and programming in python.
- MIT Open: Intro to programming and data science with python
Sources for learnign more statistics and probability
- Here is a list of resources for learning more Bayesian analaysis.
- Here is a list of resources for learning more statistics in general.
- Richard McElreath course in Bayesian Analysis
- Aki Vehtari course in Bayesian Analysis