Linguistic Data: Quantitative Analysis and Visualisation: computational linguistics: различия между версиями

Материал из MathINFO
Перейти к навигации Перейти к поиску
(не показано 46 промежуточных версий 2 участников)
Строка 8: Строка 8:
 
| Jan 18 || Introduction. Quantitative linguistic research and data types. R basics || [https://docs.google.com/presentation/d/1VUIUa3Db5n4dsD_HeA3e-mz55zK8uPrko3yu207pKUk/edit?usp=sharing Intro Slides] [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-01-18 Lab 01: intro to R]
 
| Jan 18 || Introduction. Quantitative linguistic research and data types. R basics || [https://docs.google.com/presentation/d/1VUIUa3Db5n4dsD_HeA3e-mz55zK8uPrko3yu207pKUk/edit?usp=sharing Intro Slides] [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-01-18 Lab 01: intro to R]
 
|-
 
|-
| Jan 25 || Hypothesis testing. Binomial test. R: dataframes, tydyverse || [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-01-18 Lab 02]
+
| Jan 25 || Hypothesis testing. Binomial test. R: dataframes, tydyverse || [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-01-25 Lab 02] [https://datacamp-community-prod.s3.amazonaws.com/e63a8f6b-2aa3-4006-89e0-badc294b179c tidyverse cheat sheet]
 +
|-
 +
| Feb 1 || Central limit theorem. Variance. Student's t-test. R: simulating data, boxplots, density plots, binomial test, t-test ||
 +
[https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-02-01 Lab 03: ]
 +
[https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-02-01/Lab3-ttest-binom-matrices.Rmd Rmd] [https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-02-01/Lab3-ttest-binom-matrices.html html] [https://rforpublichealth.blogspot.com/2014/02/ggplot2-cheatsheet-for-visualizing.html Viz. distributions]
 +
|-
 +
| Feb 8 || Two-sample t-test. Paired t-test. Confidence intervals. <!-- TODO: Non-parametric tests --> || [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-02-08 Lab 04: ] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-02-08/Lab4-confint-pairedttest-anova.Rmd Rmd] [https://github.com/LingData2019/LingData2020/raw/master/seminars/2020-02-08/Lab4-confint-pairedttest-anova.pdf pdf][https://agricolamz.github.io/2018-MAG_R_course/Lec_4_stats.html CI slides] [https://istats.shinyapps.io/ExploreCoverage/ CI demo]
 +
|-
 +
| Feb 15 || ANOVA. Correlations || [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-02-15 Lab 05:] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-02-15/Lab5_Corr.Rmd Rmd] [https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-02-15/Lab5_Corr.pdf pdf]
 +
|-
 +
| Feb 22 || Tests for categorial data. Chi-squared test. Fisher exact test. Effect size || [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-02-22 Lab 06:] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-02-22/Lab6-chisq-Fischer-effectsize.Rmd Rmd] [https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-02-22/Lab6-chisq-Fischer-effectsize.pdf pdf] [https://www.datacamp.com/community/tutorials/contingency-tables-r DataCamp: contingency tables]
 +
|-
 +
| Feb 29 || Linear regression. Multivariate linear regression. Dummy variables ||
 +
[https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-02-29 Lab 07:] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-02-29/Lab7-lm.Rmd Rmd] [https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-02-29/Lab7-lm.pdf pdf]
 +
[https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-02-29/Lab7-lm.html html]
 +
[https://lindeloev.github.io/tests-as-linear/linear_tests_cheat_sheet.pdf Common statistical tests & linear models ]
 +
|-
 +
| Mar 7 || Fixed and random effects. Linear mixed-effects models ||
 +
[https://agricolamz.github.io/2018-MAG_R_course/Lab_9 Lab 08. Part 1]
 +
[https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-03-07/Lab8-lmer.Rmd Lab 08. Part 2. Rmd template]
 +
[http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#model-specification LME models cheat sheet]
 +
|-
 +
| Mar 21 || Logistic regression. Model selection || [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-03-21/Lab9-logit.Rmd Lab 09 .Rmd]  [https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/master/seminars/2020-03-21/Lab9-logit.html html]
 +
[https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-03-21/Lab9-logit.pdf pdf] [https://www.youtube.com/watch?v=MkPASI_zAsg video]
 +
|-
 +
| Apr 11 || Dimensionality reduction. PCA. MDS. t-SNE || [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-04-11/Lab10-PCA.template.Rmd Lab 10 .Rmd template] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-04-11/Lab10-PCA.Rmd .Rmd code]
 +
[https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-04-11/Lab10-PCA.pdf pdf]
 +
[https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/master/seminars/2020-04-11/Lab10-PCA.html html]
 +
[https://agricolamz.github.io/2018-MAG_R_course/Lec_13_PCA_3D.html 3D example] [https://agricolamz.github.io/2018-MAG_R_course/Lec_13_PCA.html supplementary slides] [https://youtu.be/xbZzkf6Di10 video]
 +
|-
 +
|  Apr 13 || Correspondence analysis: CA, MCA || [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-04-13/Lab11-CA.Rmd Lab 11 Rmd] [https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-04-13/Lab11-CA.pdf pdf]
 +
[https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/master/seminars/2020-04-13/Lab11-CA.html html] [https://www.displayr.com/interpret-correspondence-analysis-plots-probably-isnt-way-think/ hints how to interpret CA] [https://www.displayr.com/how-correspondence-analysis-works/ more hints] [https://youtu.be/H3tsGRKGIMg video] Book: M. Greenacre, Correspondence Analysis In Practice.
 +
|-
 +
|  Apr 27 || Decision trees. Decision forests || [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-04-27/Lab12-CART-template.Rmd Lab 12 Rmd template] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-04-27/Lab12-CART.Rmd solution Rmd]
 +
[https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-04-27/Lab12-CART.pdf pdf]
 +
[https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-04-27/Lab12-CART-template.html html]
 +
[https://www.r-bloggers.com/be-aware-of-bias-in-rf-variable-importance-metrics/ hints on varimp]
 +
|-
 +
| May 16 || Cluster Analysis || [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-05-16/Lab12-Clusterization.Rmd  Lab 12 Rmd] [https://rpubs.com/AllaT/clust_aes More on aestetics] [https://rpubs.com/AllaT/clust1 Supplementary material 1] [https://rpubs.com/AllaT/clust3 More on cluster evaluation] [https://raw.githubusercontent.com/agricolamz/2019_data_analysis_for_linguists/master/data/gospel_freq_words.csv gospels dataset]
 +
|-
 +
| May 25 || Bayesian statistics ||
 +
|-
 +
| June 1 || Bayesian statistics II ||
 +
|-
 
|}
 
|}
 
  
 
== Software ==
 
== Software ==
Строка 27: Строка 69:
  
 
== Homeworks ==
 
== Homeworks ==
* Homework 1 (deadline: January 25, 12:00), Chapters 1, 2, 3, and 5 of the [https://www.datacamp.com/courses/free-introduction-to-r DataCamp] course "Introduction to R". Please fill in this [https://docs.google.com/forms/d/e/1FAIpQLSdjgKBM5JSo6D6ajhrWWfFG1ktcKgDfbdK_jQ_ZbW9GwNLzpQ/viewform form].   
+
* Homework 1 (deadline: February 16, 23:59), Chapters 1, 2, 3, and 5 of the [https://www.datacamp.com/courses/free-introduction-to-r DataCamp] course "Introduction to R". Please fill in this [https://docs.google.com/forms/d/e/1FAIpQLSdjgKBM5JSo6D6ajhrWWfFG1ktcKgDfbdK_jQ_ZbW9GwNLzpQ/viewform form].   
* Homework 2 (deadline: February 1, 12:00), Chapters 4 and 6 of the [https://www.datacamp.com/courses/free-introduction-to-r DataCamp] course "Introduction to R".   
+
* Homework 2 (deadline: February 23, 23:59), Chapters 4 and 6 of the [https://www.datacamp.com/courses/free-introduction-to-r DataCamp] course "Introduction to R".   
After completing the course please provide either the [https://support.datacamp.com/hc/en-us/articles/360001548814-How-can-I-share-my-certificate-Statement-of-Accomplishment- Statement of Accomplishment] or a screenshot of your learning progress via [link TBA].
+
* Homework 3 (deadline: February 9, 12:00), Hypothesis testing, binomial test, t-test. [https://github.com/LingData2019/LingData2020/blob/master/hw/hw-pdf/LingData-HW3-comp.pdf HW3 pdf] [https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/hw/LingData-HW3-comp.html html] [https://github.com/LingData2019/LingData2020/blob/master/hw/LingData-HW3-comp.Rmd Rmd template]
 
+
* Homework 4 (deadline: February 29, 12:00), T-test and ANOVA, reproducing some results from Leivada & Westergaard 2019 [https://github.com/LingData2019/LingData2020/blob/master/hw/hw-pdf/LingData-HW4-comp.pdf HW4 pdf] [https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/hw/hw-html/LingData-HW4-comp.html html] [https://github.com/LingData2019/LingData2020/blob/master/hw/LingData-HW4-comp.Rmd Rmd template] [https://classroom.github.com/a/mvhbEgzD link] to submit your .Rmd file
 +
* Homework 5 (deadline: March 09, 23:59), Contingency tables and tests, linear models  [https://github.com/LingData2019/LingData2020/blob/master/hw/hw-pdf/LingData-HW5-comp.pdf HW5 pdf] [https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/hw/hw-html/LingData-HW5-comp.html html] [https://github.com/LingData2019/LingData2020/blob/master/hw/LingData-HW5-comp.Rmd Rmd template] [https://classroom.github.com/a/jbzxDLJ4 link] to submit your .Rmd file
 +
* Homework 6 (due: March 28, 12:10), Mixed-effect models [https://github.com/LingData2019/LingData2020/blob/master/hw/hw-pdf/LingData-HW6-lmer.pdf HW6 pdf] [https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/hw/hw-html/LingData-HW6-lmer-comp.html html] [https://github.com/LingData2019/LingData2020/blob/master/hw/LingData-HW6-lmer-comp.Rmd Rmd template] [https://classroom.github.com/a/gostziql link] to submit your .Rmd file]
  
 
== Final project ==
 
== Final project ==
Projects description TBA  
+
* Projects description [https://github.com/LingData2019/LingData2020/blob/master/projects.pdf link]  
Projects pre-registration: link to submit your file TBA  
+
* Projects pre-registration: Due April 27, 2020. Please create the folder Project in your GitHub repository and put a pdf file there. Optionally, you can add a csv file with the preliminary version of your dataset and an Rmd file, if needed.  
Final versions of project papers: link to sumbit your files TBA
+
* Final versions of project papers: link to sumbit your files TBA
 
 
  
 
== References ==  
 
== References ==  

Версия 09:41, 25 мая 2020

  • Instructors: Ilya Schurov and Olga Lyashevskaya

Materials

Data Topics Links
Jan 18 Introduction. Quantitative linguistic research and data types. R basics Intro Slides Lab 01: intro to R
Jan 25 Hypothesis testing. Binomial test. R: dataframes, tydyverse Lab 02 tidyverse cheat sheet
Feb 1 Central limit theorem. Variance. Student's t-test. R: simulating data, boxplots, density plots, binomial test, t-test

Lab 03: Rmd html Viz. distributions

Feb 8 Two-sample t-test. Paired t-test. Confidence intervals. Lab 04: Rmd pdfCI slides CI demo
Feb 15 ANOVA. Correlations Lab 05: Rmd pdf
Feb 22 Tests for categorial data. Chi-squared test. Fisher exact test. Effect size Lab 06: Rmd pdf DataCamp: contingency tables
Feb 29 Linear regression. Multivariate linear regression. Dummy variables

Lab 07: Rmd pdf html Common statistical tests & linear models

Mar 7 Fixed and random effects. Linear mixed-effects models

Lab 08. Part 1 Lab 08. Part 2. Rmd template LME models cheat sheet

Mar 21 Logistic regression. Model selection Lab 09 .Rmd html

pdf video

Apr 11 Dimensionality reduction. PCA. MDS. t-SNE Lab 10 .Rmd template .Rmd code

pdf html 3D example supplementary slides video

Apr 13 Correspondence analysis: CA, MCA Lab 11 Rmd pdf

html hints how to interpret CA more hints video Book: M. Greenacre, Correspondence Analysis In Practice.

Apr 27 Decision trees. Decision forests Lab 12 Rmd template solution Rmd

pdf html hints on varimp

May 16 Cluster Analysis Lab 12 Rmd More on aestetics Supplementary material 1 More on cluster evaluation gospels dataset
May 25 Bayesian statistics
June 1 Bayesian statistics II

Software

During this course we will use R as a programming language and RStudio as a GUI.

How to install R and RStudio?

1. Download R (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.

2. Download RStudio (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.

It is possible avoid installing anything on your PC, using rstudio.cloud (an online version of RStudio).

For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).


Homeworks

  • Homework 1 (deadline: February 16, 23:59), Chapters 1, 2, 3, and 5 of the DataCamp course "Introduction to R". Please fill in this form.
  • Homework 2 (deadline: February 23, 23:59), Chapters 4 and 6 of the DataCamp course "Introduction to R".
  • Homework 3 (deadline: February 9, 12:00), Hypothesis testing, binomial test, t-test. HW3 pdf html Rmd template
  • Homework 4 (deadline: February 29, 12:00), T-test and ANOVA, reproducing some results from Leivada & Westergaard 2019 HW4 pdf html Rmd template link to submit your .Rmd file
  • Homework 5 (deadline: March 09, 23:59), Contingency tables and tests, linear models HW5 pdf html Rmd template link to submit your .Rmd file
  • Homework 6 (due: March 28, 12:10), Mixed-effect models HW6 pdf html Rmd template link to submit your .Rmd file]

Final project

  • Projects description link
  • Projects pre-registration: Due April 27, 2020. Please create the folder Project in your GitHub repository and put a pdf file there. Optionally, you can add a csv file with the preliminary version of your dataset and an Rmd file, if needed.
  • Final versions of project papers: link to sumbit your files TBA

References

  • Gries, Stefan (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. HSE library link
  • Levshina, Natalia (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. HSE library link
  • Baayen, Harald (2008). Analyzing Linguistic Data: A practical introduction to statistics. Cambridge UP. pdf
  • Gries, Stefan (2017). Quantitative Corpus Linguistics with R : A Practical Introduction (Vol. Second edition). Milton Park, Abingdon, Oxon: Routledge. eBook
  • Empirical Bayes
  • Harney, H. L. (2016). Bayesian Inference : Data Evaluation and Decisions (Vol. 2nd ed). Springer. eBook
  • McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. eBook
  • ggplot2
  • Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. Springer. eBook
  • R markdown [https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf Rmd Cheat Sheet

Course Info

This page contains the materials of the course "Linguistic Data: Quantitative Analysis and Visualisation", taught at the HSE Master's program "Computational Linguistics" in 2019-2020 academic year. Modules: 3-4.