Linguistic Data: Quantitative Analysis and Visualisation: computational linguistics
- Instructors: Ilya Schurov and Olga Lyashevskaya
Materials
Data | Topics | Links |
---|---|---|
Jan 18 | Introduction. Quantitative linguistic research and data types. R basics | Intro Slides Lab 01: intro to R |
Jan 25 | Hypothesis testing. Binomial test. R: dataframes, tydyverse | Lab 02 |
Feb 1 | Central limit theorem. Variance. Student's t-test. R: simulating data, boxplots, density plots, binomial test, t-test | |
Feb 8 | Two-sample T-test. ANOVA. Confidence intervals. Non-parametric tests | |
Tests for categorial data. Chi-squared test. Fisher exact test. Effect size. | ||
Correlations. Linear regression | ||
Multivariate linear regression. Dummy variables | ||
Dimensionality reduction. PCA. MDS. t-SNE | ||
CA, MCA. Clusterization | ||
Logistic regression. Model selection | ||
Fixed and random effects. Linear mixed-effects models | ||
Bootstrap. Decision trees. Decision forests | ||
Bayesian statistics | ||
Bayesian statistics II |
Software
During this course we will use R as a programming language and RStudio as a GUI.
How to install R and RStudio?
1. Download R (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.
2. Download RStudio (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.
It is possible avoid installing anything on your PC, using rstudio.cloud (an online version of RStudio).
For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).
Homeworks
- Homework 1 (deadline: February 16, 23:59), Chapters 1, 2, 3, and 5 of the DataCamp course "Introduction to R". Please fill in this form.
- Homework 2 (deadline: February 23, 23:59), Chapters 4 and 6 of the DataCamp course "Introduction to R".
After completing the course please provide either the Statement of Accomplishment or a screenshot of your learning progress via [link TBA]. Deadlines for Homework 1 and 2 are cancelled due to unavailability of the free version of the datacamp online course. Stay tuned!
- Homework 3 (deadline: February 9, 12:00), Hypothesis testing, binomial test, t-test. HW3 pdf html Rmd template
Final project
- Projects description link
- Projects pre-registration: link to submit your file TBA
- Final versions of project papers: link to sumbit your files TBA
References
- Gries, Stefan (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. HSE library link
- Levshina, Natalia (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. HSE library link
- Baayen, Harald (2008). Analyzing Linguistic Data: A practical introduction to statistics. Cambridge UP. pdf
- Gries, Stefan (2017). Quantitative Corpus Linguistics with R : A Practical Introduction (Vol. Second edition). Milton Park, Abingdon, Oxon: Routledge. eBook
- Empirical Bayes
- Harney, H. L. (2016). Bayesian Inference : Data Evaluation and Decisions (Vol. 2nd ed). Springer. eBook
- McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. eBook
- ggplot2
- Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. Springer. eBook
- R markdown [https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf Rmd Cheat Sheet
Course Info
This page contains the materials of the course "Linguistic Data: Quantitative Analysis and Visualisation", taught at the HSE Master's program "Computational Linguistics" in 2019-2020 academic year. Modules: 3-4.