Linguistic Data: Quantitative Analysis and Visualisation: computational linguistics

Материал из MathINFO
Перейти к навигации Перейти к поиску
  • Instructors: Olga Lyashevskaya and Ivan Pozdnyakov
  • Assistant: Lidia Ostyakova
  • HSE Course [syllabus: Link * Group in Telegram


Data Topics Links video
Jan 11 Introduction to R. R and R Studio. R basic: functions, variables, types html practice
Jan 18 Data analysis in linguistics. Research design. Types of variables pdf data to practice with read Gries Chapter 1.3
Jan 21 R: vectors, implicit and explicit coercion, recycling rule, missing values html video
Feb 4 R: matrices, arrays, lists, data.frames. Packages. Data import and export html [video link]
Feb 18 R: Conditions and loops. Functions. Apply family. html html [video link]
Mar 4 R: Tidyverse html video
Mar 18 R: Advanced Tidyverse html
Apr 6 Visualizations: ggplot2 html Rmd [video link]
Apr 15 Introduction to inferential statistics. Central limit theorem. Null hypothesis significance testing. P-value. Confidence interval html
Apr 20 Tests of difference for two-sample designs: T-test, independent and paired. Nonparametric equivalents: Mann-Whitney and Wilcoxon tests. Tests for nominal data: chi-squared test, Fischer exact test. Effect size tests for big data. html [video link]
Apr 29 Covariance and correlation. Pearson's correlation, Spearman's and Kendall non-parametric correlation tests. Multiple comparisons problem. html Playing with students' data zip
May 11 Linear regression. Multiple linear regression. Family of Generalized and General linear models Rmd html
May 18 Logistic regression. Model selection. Rmd
May 20 Analysis of variance (ANOVA) html
May 25 Mixed-effect linear models. Rmd html
Jun 3 Dimensionality reduction. PCA. CA. MCA Rmd Rmd template hint: how to (not) interpret CA see also MDS
Jun 8 CART: decision trees and random forest. Cluster analysis Rmd: CART Rmd:Clustering
Jun 10 Bayesian statistics
Jun 17

Assignment #1

Research hypothesis: formulate your pilot research hypothesis. Fill in the form Due date: 2021-01-24 23:00 MSK.

Assignment #2

Data description. Create a repository on GitHub and put the link in the same table form Due date: 2021-03-17 23:59 MSK.

Homework: Advanced Tidyverse

Homework is here: [1]. Upload your work in a repository on GitHub and put the link here [2] Due date: 2021-04-05 23:59 MSK.

Homework: From t-tests to data modeling

Homework is here: [3] html. Upload your work in a repository on GitHub and put the link here [4] Due date: 2021-06-23 18:00 MSK.

Online course assignments

Complete the following chapters on Coursera [5] course:

  • Week 1
  • Week 2
  • Week 3
  • Week 4

Final project

The project description and a link to some examples can be found here. Important dates:

  • January 24: research hypothesis
  • March 17: dataset description in Rmd, toy dataset (min. 20 observations)
  • April 14: draft dataset
  • June 10: final version of your dataset, draft final paper
  • 24 hours before the exam starts: paper submission

Course Policy

Score policy:
 The Final Score is obtained from the following formula: Final Score = 0.6 × (Homework Score) + 0.4 × (Exam Score). The student is expected to prepare the final project in a written form as electronic document. The exam is conducted in the form of oral defense of the final project. The Exam Score measures the overall quality of the final project. It is integer number from 0 to 10. Parts of the final exam data should be prepared in advance and can be used in regular homework assignments.

Academic ethics policy: you have to do your homeworks by yourself. In case of academic cheating (e.g. if you copy someone else's work, etc.), your work will receive grade 0 and the program supervisor will be notified. If you feel that you are stuck with the homework, ask instructors for advice and hints.

Late penalties: in case of late submission, your grade will be multiplied by exp(-t / 86400), where t is the number of seconds since the due date. For example, if you delay the submission by one day, your grade will be multiplied by exp(-1)=0.3678794412.

Extensions: you can ask for up to two extensions of homework due dates during the course. Each extension is one week. Extensions due to valid excuses (i.e. illness) do not count.


During this course we will use R as a programming language and RStudio as a GUI.

How to install R and RStudio?

1. Download R (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.

2. Download RStudio (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.

It is possible avoid installing anything on your PC, using (an online version of RStudio).

For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).

Online course

Some parts of MOOC (online) course is included in the program.


  • Gries, Stefan (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. HSE library link
  • Levshina, Natalia (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. HSE library link
  • Baayen, Harald (2008). Analyzing Linguistic Data: A practical introduction to statistics. Cambridge UP. link