Linguistic Data: Quantitative Analysis and Visualisation for computer linguists: различия между версиями

Материал из MathINFO
Перейти к навигации Перейти к поиску
(Новая страница: «==Course info== Dear students, Here will be published the materials of the course '''"Linguistic Data: Quantitative Analysis and Visualisation"''', taught at t...»)
 
 
Строка 1: Строка 1:
 
==Course info==
 
==Course info==
 +
Dear students,
  
Dear students,  
+
Here will be published the materials of the course '''"Linguistic Data: Quantitative Analysis and Visualisation"''', taught at the Master programme '''"Computational Linguistics"''' in 2018-2019 academic year.
  
Here will be published the materials of the course '''"Linguistic Data: Quantitative Analysis and Visualisation"''', taught at the Master programme '''"Computational Linguistics"''' in 2018-2019 academic year.  
+
* Instructors: Olga Lyashevskaya, George Moroz, Alla Tambovtseva and Ilya Schurov.
  
* Instructors: Olga Lyashevskaya, George Moroz, Alla Tambovtseva and Ilya Schurov.
+
* Modules: 3-4
 
 
* Modules: 3-4  
 
  
 
==Software==
 
==Software==
 +
During this course we will use R as a programming language and RStudio as a GUI.
  
During this course we will use R as a programming language and RStudio as a GUI.
+
'''How to install R and RStudio?'''
 
 
'''How to install R and RStudio?'''  
 
  
1. Download [https://ftp.acc.umu.se/mirror/CRAN/ R] (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.  
+
1. Download [https://ftp.acc.umu.se/mirror/CRAN/ R] (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.
  
2. Download [https://www.rstudio.com/products/rstudio/download/ RStudio] (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.  
+
2. Download [https://www.rstudio.com/products/rstudio/download/ RStudio] (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.
  
It is possible avoid installing anything on your PC, using online version of [https://rstudio.cloud/ RStudio].  
+
It is possible avoid installing anything on your PC, using online version of [https://rstudio.cloud/ RStudio].
  
'''How to use RStudio?'''  
+
'''How to use RStudio?'''
  
Read the instruction [http://math-info.hse.ru/f/2018-19/pep/rstudio-instruction-en.pdf here].  
+
Read the instruction [http://math-info.hse.ru/f/2018-19/pep/rstudio-instruction-en.pdf here].
  
For successful submission of assignments you should be able to create and save R code files (.R). However, it would be helpful for your own research projects to learn how to create RMarkdown files (.Rmd).  
+
For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).
  
 
==Materials==
 
==Materials==
Строка 37: Строка 35:
 
| 12.01
 
| 12.01
 
| Something about data: population vs sample, descriptive statistics
 
| Something about data: population vs sample, descriptive statistics
| [http://math-info.hse.ru/f/2018-19/ling-data/seminar1.pdf problems1] [http://rpubs.com/AllaT/ldat-rbasics r-basics]
+
| [http://math-info.hse.ru/f/2018-19/ling-data/seminar1.pdf problems1] [http://rpubs.com/AllaT/ldat-rbasics R-basics]
| RMarkdown: official [https://rmarkdown.rstudio.com/ page], [https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf cheatsheet]
+
| RMarkdown: official [https://rmarkdown.rstudio.com/ page], [https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf cheatsheet]<br>
 
 
 
|-
 
|-
 
| 19.01
 
| 19.01
| [https://raw.githubusercontent.com/LingData2019/LingData/master/2019.01.19_R_class_2.R Population and samples. Working with data in R]
+
| [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/19-01/comp/2019.01.19_R_class_2.R Population and samples. Working with data in R]
| [http://math-info.hse.ru/f/2018-19/ling-data/artists-sizes.txt artisits] [http://math-info.hse.ru/f/2018-19/ling-data/Chi.kuk.2007.csv Chi.kuk]
+
| [http://math-info.hse.ru/f/2018-19/ling-data/artists-sizes.txt artists.txt] [https://raw.githubusercontent.com/agricolamz/r_on_line_course_data/master/orientation.csv orientation.csv]
|
+
| [http://rpubs.com/AllaT/ldat-rplots_1 More] on basic graphs in R<br>
 +
|-
 +
| 26.01
 +
| [https://github.com/LingData2019/LingData/blob/master/seminars/26-01/26-01.Rmd Hypothesis testing]<br>
 +
| [http://rpubs.com/AllaT/ldat-rbinom Binomial-test] [https://raw.githubusercontent.com/LingData2019/LingData/master/data/poetry_last_in_lines.csv poetry.csv] [https://raw.githubusercontent.com/agricolamz/r_on_line_course_data/master/orientation.csv orientation.csv]
 +
| [https://raw.githubusercontent.com/LingData2019/LingData/master/data/freq_rnc_ranked.csv RNC frequency list]<br>
 +
|-
 +
| 02.02
 +
| Student's t-test. Central limit theorem: recall
 +
| [http://rpubs.com/AllaT/ldat-ttest T-test] [http://rpubs.com/AllaT/ldat-lab4 dplyr-ggplot] [http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv icelandic.csv]<br>
 +
| [http://math-info.hse.ru/f/2018-19/ling-data/dissertation.pdf asp-paper] (Coretta, 2017)<br>
 +
|-
 +
| 09.02
 +
| Confidence Interval. ANOVA
 +
| [http://rpubs.com/AllaT/ldat-conf_ints Conf-intervals][https://raw.githubusercontent.com/LingData2019/LingData/master/data/poetry_last_in_lines.csv poetry.csv] [http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv icelandic.csv]<br>[http://rpubs.com/AllaT/ldat-anova Anova]<br><br>
 +
| an interactive [https://rpsychologist.com/d3/CI/ visualization] of CI by K.Magnusson<br>[https://www.cscu.cornell.edu/news/statnews/stnews73.pdf more] on overlapping CI's (by A.Knezevic)<br><br>
 +
|-
 +
| 16.02
 +
| Data manipulation with tidyverse. Visualisation with ggplot2
 +
| [https://lingdata2019.github.io/LingData/Lec_6_tidyverse.html class materials]<br>
 +
| <br>
 +
|-
 +
| 02.03
 +
| Chi-squared and Fisher's exact tests
 +
| [http://rpubs.com/AllaT/ling-chisq Chi-squared-test] [http://math-info.hse.ru/f/2018-19/pep/socling.csv socling.csv] [https://raw.githubusercontent.com/LingData2019/LingData/master/data/elision.csv elision.csv]<br>
 +
| <br>
 +
|-
 +
| 16.03
 +
| Correlation coefficients and a simple linear regression
 +
| [http://rpubs.com/AllaT/ling-corr Corr-regression] [https://raw.githubusercontent.com/LingData2019/LingData/master/data/education.csv education.csv] [https://raw.githubusercontent.com/LingData2019/LingData/master/data/chekhov.csv chekhov.csv]<br>
 +
| [http://guessthecorrelation.com/ guess correlation game] [http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram correlograms]<br>
 +
|-
 +
| 23.03
 +
| Multiple linear regression
 +
| <br>
 +
| <br>
 +
|-
 +
| 06.04
 +
| Logistic regression
 +
| [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-04-06/Lab10-practice.Rmd Lab10]<br>
 +
| [https://cran.r-project.org/web/packages/jtools/vignettes/summ.html more] on visualising coefficients, [https://www.princeton.edu/~otorres/Regression101R.pdf more] tests<br>
 +
|-
 +
| 16.04
 +
| Linear mixed-effect models
 +
| [https://github.com/LingData2019/LingData/blob/master/seminars/2019-04-13/Lab11.pdf Lab11] [https://github.com/LingData2019/LingData/blob/master/seminars/2019-04-13/Lab11.Rmd Lab11-solutions]<br>
 +
| [http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#model-specification LME models cheat sheet ]<br>
 +
|-
 +
| 27.04
 +
| Nested effects. Decision trees and random forest.
 +
| [https://github.com/LingData2019/LingData/blob/master/seminars/2019-04-27/nested-effects.Rmd nested effects] [https://github.com/LingData2019/LingData/blob/master/seminars/2019-04-27/Lab12_class.Rmd Lab 12. Trees and forests] [https://github.com/LingData2019/LingData/blob/master/seminars/2019-04-27/Lab12.Rmd Lab12-solutions]<br>
 +
| <br>
 +
|-
 +
| 18.05
 +
| Dimension reduction. PCA, CA, MCA
 +
| [https://github.com/LingData2019/LingData/blob/master/seminars/2019-05-18/Lab13.Rmd Lab 13. PCA and MCA]<br>
 +
| [http://math-info.hse.ru/f/2015-16/ling-mag-quant/lecture-pca.html#Трёхмерный%20пример 3D example]<br>
 +
|-
 +
| 25.05
 +
| Cluster analysis
 +
| [cluster-analysis] [https://raw.githubusercontent.com/agricolamz/2019_data_analysis_for_linguists/master/data/gospel_freq_words.csv gospels.csv]<br>
 +
| [http://rpubs.com/AllaT/clust_aes more dendrograms] [http://rpubs.com/AllaT/clust1 more CA] [http://rpubs.com/AllaT/clust3 CA quality]<br>
 +
|-
 +
| 01.06
 +
| [https://lingdata2019.github.io/LingData/Lec_15_Comp_Bayes.html Bayesian statistics]
 +
| [https://docs.google.com/forms/d/e/1FAIpQLSf8lFqFkAMn01y7pWlISJXcuAC2IYMD-BIWoTRzjHYtacVEgg/viewform Lab 15]
 +
| <br>
 +
|-
 +
| 08.06
 +
| Simulation statistics
 +
| [http://math-info.hse.ru/f/2017-18/py-prog/scores2.csv scores2.csv]<br>
 +
| <br>
 +
|}
 +
===R seminars in pdf===
 +
12 January: [http://math-info.hse.ru/f/2018-19/ling-data/Rbasics_TEO-pdf.pdf R-basics], 26 January: [http://math-info.hse.ru/f/2018-19/ling-data/binom-test-pdf.pdf Binomial-test]
 +
 
 +
2 February: [http://math-info.hse.ru/f/2018-19/ling-data/Lab4-comp.pdf T-test], 9 February: [http://math-info.hse.ru/f/2018-19/ling-data/conf-ints.pdf Conf-intervals], [http://math-info.hse.ru/f/2018-19/ling-data/anova.pdf Anova]
 +
 
 +
02 March: [http://math-info.hse.ru/f/2018-19/ling-data/chisq-test.pdf Chi-squared-test], 16 March: [http://math-info.hse.ru/f/2018-19/ling-data/CorrLab.pdf Corr-regression]
 +
 
 +
===R seminars in .R and .Rmd===
 +
12 January: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-12/r-basics.R R-basics.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-12/r-basics.Rmd R-basics.Rmd],
 +
26 January: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-26/26-01.Rmd Binomial-test.Rmd]
  
|}
+
2 February: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-02/LingData-Lab4-comp.Rmd T-test.Rmd], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-02/t-test.R T-test.R], 9 February: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-09/09-02.Rmd Conf-intervals.Rmd], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-09/conf-ints.R Conf-intervals.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-09/anova.Rmd Anova.Rmd], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-09/anova.R Anova.R]
==R seminars in pdf==
 
  
12 January: [http://math-info.hse.ru/f/2018-19/ling-data/Rbasics_TEO-pdf.pdf r-basics]  
+
2 March: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-02/chisq-02-03.Rmd Chi-squared--test.Rmd], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-02/chisq-test.R Chi-squared-test.R], 16 March: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-16/corr-regression.R Corr-regression.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-16/corr-regression.Rmd Corr-regression.Rmd]
  
 
==Homeworks==
 
==Homeworks==
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW1.pdf Homework 1] (deadline: 27 January, 23:59)
+
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW1.pdf Homework 1] (deadline: 27 January, 23:59), [https://docs.google.com/forms/d/e/1FAIpQLSehhy-j0Y2LIIfen6kqlz2Za5QUYvcZQ_7m3L5PAUrQbMDXwA/viewform link] to submit
  
 
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW2-comp.pdf Homework 2] (deadline: 03 February, 23:59)
 
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW2-comp.pdf Homework 2] (deadline: 03 February, 23:59)
 +
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW3.pdf Homework 3] (deadline: 10 February, 23:59), [https://raw.githubusercontent.com/LingData2019/LingData/master/hw/LingData-HW3.Rmd Rmd-file] to fill in, [https://www.dropbox.com/request/pRLK5I77npkKvYs3GJuH link] to submit your .Rmd file
 +
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW4-comp-final.pdf Homework 4] (deadline: 22 February, 23:59), [https://raw.githubusercontent.com/LingData2019/LingData/master/hw/LingData-HW4-comp-final.Rmd Rmd-file] to fill in, [https://www.dropbox.com/request/jF9012PmjXbovPwVgfO2 link] to submit your .Rmd file
 +
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW5.pdf Homework 5] (deadline: 3 March, 23:59), [https://raw.githubusercontent.com/LingData2019/LingData/master/hw/rmd-templates/HW5-template.Rmd Rmd-file] to fill in, [https://www.dropbox.com/request/dU7Gv3M8K1acA84JXvg3 link] to submit your .Rmd file
 +
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW6.pdf Homework 6] (deadline: 15 May, 23:59), [https://raw.githubusercontent.com/LingData2019/LingData/master/hw/rmd-templates/HW6-template.Rmd Rmd-file] to fill in, [https://www.dropbox.com/request/xhfz7jCzkUp31EPOszp4 link] to submit your .Rmd file
 +
 +
==Final project==
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/projects.pdf Projects description]
 +
 +
* Project topics: [https://docs.google.com/spreadsheets/d/1QxLq2JTO9p7xJFo-KP3XyrRbexYwxQhNqTElDAGJEls/edit?usp=sharing link] to the table to fill in
 +
 +
* Projects pre-registration (deadline: 28 April, 23:59): [https://www.dropbox.com/request/I6XC3W9GkiAB3aQisxJq link] to submit your file
 +
 +
* Final versions of projects:  [https://www.dropbox.com/request/Ds4JI7vs9rAhLAG3tI6o link] to sumbit your files

Текущая версия на 04:09, 7 февраля 2020

Course info

Dear students,

Here will be published the materials of the course "Linguistic Data: Quantitative Analysis and Visualisation", taught at the Master programme "Computational Linguistics" in 2018-2019 academic year.

  • Instructors: Olga Lyashevskaya, George Moroz, Alla Tambovtseva and Ilya Schurov.
  • Modules: 3-4

Software

During this course we will use R as a programming language and RStudio as a GUI.

How to install R and RStudio?

1. Download R (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.

2. Download RStudio (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.

It is possible avoid installing anything on your PC, using online version of RStudio.

How to use RStudio?

Read the instruction here.

For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).

Materials

Date Topic of the lecture Seminar Optional
12.01 Something about data: population vs sample, descriptive statistics problems1 R-basics RMarkdown: official page, cheatsheet
19.01 Population and samples. Working with data in R artists.txt orientation.csv More on basic graphs in R
26.01 Hypothesis testing
Binomial-test poetry.csv orientation.csv RNC frequency list
02.02 Student's t-test. Central limit theorem: recall T-test dplyr-ggplot icelandic.csv
asp-paper (Coretta, 2017)
09.02 Confidence Interval. ANOVA Conf-intervalspoetry.csv icelandic.csv
Anova

an interactive visualization of CI by K.Magnusson
more on overlapping CI's (by A.Knezevic)

16.02 Data manipulation with tidyverse. Visualisation with ggplot2 class materials

02.03 Chi-squared and Fisher's exact tests Chi-squared-test socling.csv elision.csv

16.03 Correlation coefficients and a simple linear regression Corr-regression education.csv chekhov.csv
guess correlation game correlograms
23.03 Multiple linear regression

06.04 Logistic regression Lab10
more on visualising coefficients, more tests
16.04 Linear mixed-effect models Lab11 Lab11-solutions
LME models cheat sheet
27.04 Nested effects. Decision trees and random forest. nested effects Lab 12. Trees and forests Lab12-solutions

18.05 Dimension reduction. PCA, CA, MCA Lab 13. PCA and MCA
3D example
25.05 Cluster analysis [cluster-analysis] gospels.csv
more dendrograms more CA CA quality
01.06 Bayesian statistics Lab 15
08.06 Simulation statistics scores2.csv

R seminars in pdf

12 January: R-basics, 26 January: Binomial-test

2 February: T-test, 9 February: Conf-intervals, Anova

02 March: Chi-squared-test, 16 March: Corr-regression

R seminars in .R and .Rmd

12 January: R-basics.R, R-basics.Rmd, 26 January: Binomial-test.Rmd

2 February: T-test.Rmd, T-test.R, 9 February: Conf-intervals.Rmd, Conf-intervals.R, Anova.Rmd, Anova.R

2 March: Chi-squared--test.Rmd, Chi-squared-test.R, 16 March: Corr-regression.R, Corr-regression.Rmd

Homeworks

Final project

  • Project topics: link to the table to fill in
  • Projects pre-registration (deadline: 28 April, 23:59): link to submit your file
  • Final versions of projects: link to sumbit your files