Linguistic Data: Quantitative Analysis and Visualisation for theoretical linguists: различия между версиями

Материал из MathINFO
Перейти к навигации Перейти к поиску
(Новая страница: «==Course info== Dear students, Here will be published the materials of the course '''"Linguistic Data: Quantitative Analysis and Visualisation"''', taught at t...»)
 
 
Строка 1: Строка 1:
 
==Course info==
 
==Course info==
 +
Dear students,
  
Dear students,  
+
Here will be published the materials of the course '''"Linguistic Data: Quantitative Analysis and Visualisation"''', taught at the Master programme '''"Linguistic Theory and Language Description"''' in 2018-2019 academic year.
  
Here will be published the materials of the course '''"Linguistic Data: Quantitative Analysis and Visualisation"''', taught at the Master programme '''"Linguistic Theory and Language Description"''' in 2018-2019 academic year.  
+
* Instructors: Olga Lyashevskaya, George Moroz, Alla Tambovtseva and Ilya Schurov.
  
* Instructors: Olga Lyashevskaya, George Moroz, Alla Tambovtseva and Ilya Schurov.
+
* Modules: 3-4
 
 
* Modules: 3-4  
 
  
 
==Software==
 
==Software==
 +
During this course we will use R as a programming language and RStudio as a GUI.
  
During this course we will use R as a programming language and RStudio as a GUI.
+
'''How to install R and RStudio?'''
 
 
'''How to install R and RStudio?'''  
 
  
1. Download [https://ftp.acc.umu.se/mirror/CRAN/ R] (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.  
+
1. Download [https://ftp.acc.umu.se/mirror/CRAN/ R] (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.
  
2. Download [https://www.rstudio.com/products/rstudio/download/ RStudio] (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.  
+
2. Download [https://www.rstudio.com/products/rstudio/download/ RStudio] (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.
  
It is possible avoid installing anything on your PC, using online version of [https://rstudio.cloud/ RStudio].  
+
It is possible avoid installing anything on your PC, using online version of [https://rstudio.cloud/ RStudio].
  
'''How to use RStudio?'''  
+
'''How to use RStudio?'''
  
Read the instruction [http://math-info.hse.ru/f/2018-19/pep/rstudio-instruction-en.pdf here].  
+
Read the instruction [http://math-info.hse.ru/f/2018-19/pep/rstudio-instruction-en.pdf here].
  
For successful submission of assignments you should be able to create and save R code files (.R). However, it would be helpful for your own research projects to learn how to create RMarkdown files (.Rmd).  
+
For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).
  
 
==Materials==
 
==Materials==
Строка 36: Строка 34:
 
|-
 
|-
 
| 12.01
 
| 12.01
| Something about data: population vs sample, descriptive statistics
+
| Something about data: population vs sample<br>Descriptive statistics <br><br>
| [http://math-info.hse.ru/f/2018-19/ling-data/seminar1.pdf problems1] [http://rpubs.com/AllaT/ldat-rbasics r-basics]
+
| [http://math-info.hse.ru/f/2018-19/ling-data/seminar1.pdf problems1] [http://rpubs.com/AllaT/ldat-rbasics R-basics]<br>
| RMarkdown: official [https://rmarkdown.rstudio.com/ page], [https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf cheatsheet]
+
| RMarkdown: official [https://rmarkdown.rstudio.com/ page],<br>[https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf cheatsheet]<br><br>
 
 
 
|-
 
|-
 
| 19.01
 
| 19.01
| Population and samples. Working with data in R
+
| Population and samples. Working with data in R<br>
| [http://math-info.hse.ru/f/2018-19/ling-data/seminar2.pdf problems2] [http://rpubs.com/AllaT/ldat-samples samples] [http://math-info.hse.ru/f/2018-19/ling-data/artists-sizes.txt artisits]
+
| [http://math-info.hse.ru/f/2018-19/ling-data/seminar2.pdf problems2] [http://rpubs.com/AllaT/ldat-samples R-samples] [http://math-info.hse.ru/f/2018-19/ling-data/artists-sizes.txt artists.txt]<br>[http://rpubs.com/AllaT/ldat-rvectors R-vectors] [http://rpubs.com/AllaT/ldat-dataframes R-dataframes] [http://math-info.hse.ru/f/2018-19/ling-data/Chi.kuk.2007.csv orientation.csv] <br><br>
[http://rpubs.com/AllaT/ldat-rvectors r-vectors] [http://rpubs.com/AllaT/ldat-dataframes r-dataframes] [http://math-info.hse.ru/f/2018-19/ling-data/Chi.kuk.2007.csv Chi.kuk]  
+
| [http://rpubs.com/AllaT/ldat-rplots_1 more] on basic graphs in R<br>
 
 
 
 
| [more] on graphs
 
 
 
 
|-
 
|-
 
| 26.01
 
| 26.01
 
| Statistical hypotheses testing
 
| Statistical hypotheses testing
| [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/26-01/poetry_last_in_lines.csv poetry]
+
| [http://rpubs.com/AllaT/ldat-rbinom Binomial-test] [https://raw.githubusercontent.com/LingData2019/LingData/master/data/poetry_last_in_lines.csv poetry.csv]
 +
| <br>
 +
|-
 +
| 02.02
 +
| Student's t-test. Central limit theorem<br>
 +
| [http://rpubs.com/AllaT/ldat-ttest T-test] [http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv icelandic.csv]<br>
 +
| [http://math-info.hse.ru/f/2018-19/ling-data/dissertation.pdf asp-paper] (Coretta, 2017)<br>
 +
|-
 +
| 09.02
 +
| Confidence Intervals
 +
| [http://rpubs.com/AllaT/ldat-conf_ints Conf-intervals] [https://raw.githubusercontent.com/LingData2019/LingData/master/data/poetry_last_in_lines.csv poetry.csv] [http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv icelandic.csv]<br>
 +
| an interactive [https://rpsychologist.com/d3/CI/ visualization] of CI by K.Magnusson<br>[https://www.cscu.cornell.edu/news/statnews/stnews73.pdf more] on overlapping CI's (by A.Knezevic)<br><br>
 +
|-
 +
| 16.02
 +
| Data manipulation with tidyverse. Visualisation with ggplot2<br>
 +
| [https://lingdata2019.github.io/LingData/Lec_6_tidyverse.html class materials]<br>
 +
| <br>
 +
|-
 +
| 02.03
 +
| Chi-squared and Fisher's exact tests<br>
 +
| [http://rpubs.com/AllaT/ling-chisq Chi-squared-test] [https://raw.githubusercontent.com/LingData2019/LingData/master/data/elision.csv elision.csv] [http://math-info.hse.ru/f/2018-19/pep/socling.csv socling.csv]<br>
 +
| <br>
 +
|-
 +
| 16.03
 +
| Correlation coefficients and simple linear regression<br>
 +
| [http://rpubs.com/AllaT/ling-corr Corr-regression][https://raw.githubusercontent.com/LingData2019/LingData/master/data/education.csv education.csv][https://raw.githubusercontent.com/LingData2019/LingData/master/data/chekhov.csv chekhov.csv]<br>
 +
| [http://guessthecorrelation.com/ guess correlation game]<br>
 +
|-
 +
| 23.03
 +
| Multiple comparisons. ANOVA
 +
| [http://rpubs.com/AllaT/lingdat-anova-mc Anova] [http://math-info.hse.ru/f/2018-19/ling-data/icelandic.csv icelandic.csv]<br>
 +
| [http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram correlograms] [http://tylervigen.com/page?page=1 spurious correlations]<br>
 +
|-
 +
| 06.04
 +
| Multiple linear regression<br>
 +
| [http://rpubs.com/AllaT/lingdat-multreg Multiple-regression] [http://math-info.hse.ru/f/2018-19/ling-data/english.csv english.csv]<br>
 +
| [https://cran.r-project.org/web/packages/jtools/vignettes/summ.html more] on visualising coefficients, [https://www.princeton.edu/~otorres/Regression101R.pdf more] tests<br>
 +
|-
 +
| 13.04
 +
| Logistic regression
 +
| [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-04-06/Lab10-practice.Rmd Lab10] [Lab10-solutions]<br>
 +
| [https://cran.r-project.org/web/packages/jtools/vignettes/summ.html more] on visualising coefficients<br>
 +
|-
 +
| 27.04
 +
| More on model diagnostics. Mixed-effects models
 +
| [http://rpubs.com/AllaT/lingdat-me Mixed-effects] [https://raw.githubusercontent.com/LingData2019/LingData/master/data/duryagin_ReductionRussian.txt ReductionRussian.txt]<br>
 +
| [http://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#model-specification LME in R]<br>
 +
|-
 +
| 18.05
 +
| Decision trees and random forest.
 +
| [https://github.com/LingData2019/LingData/blob/master/seminars/2019-04-27/Lab12_class.Rmd Lab 12. Trees and forests] [https://github.com/LingData2019/LingData/blob/master/seminars/2019-04-27/Lab12.Rmd Code]
 +
| <br>
 +
|-
 +
| 25.05
 +
| PCA<br>
 +
| [https://lingdata2019.github.io/LingData/Lec_14_PCA.html class materials]<br>
 +
| <br>
 +
|-
 +
| 01.06
 +
| Clustering
 +
| [https://raw.githubusercontent.com/agricolamz/2018-MAG_R_course/master/data/baltic.csv swadesh.csv]
 +
| <br>
 +
|-
 +
| 08.06
 +
| NeighborNet. Simulation statistics
 +
| [http://math-info.hse.ru/f/2018-19/ling-data/prefixes.txt prefixes.txt] [http://math-info.hse.ru/f/2018-19/ling-data/08-06.R R code] [http://math-info.hse.ru/f/2017-18/py-prog/scores2.csv scores2.csv]
 +
| <br>
 +
|}
 +
===R seminars in pdf===
 +
12 January: [http://math-info.hse.ru/f/2018-19/ling-data/Rbasics_TEO-pdf.pdf R-basics], 19 January: [http://math-info.hse.ru/f/2018-19/ling-data/r-more-vectors-pdf.pdf R-vectors], [http://math-info.hse.ru/f/2018-19/ling-data/r-dataframes-pdf.pdf R-dataframes], [http://math-info.hse.ru/f/2018-19/ling-data/r-samples-pdf.pdf R-samples], 26 January: [http://math-info.hse.ru/f/2018-19/ling-data/binom-test-pdf.pdf Binomial-test]
 +
 
 +
2 February: [http://math-info.hse.ru/f/2018-19/ling-data/t-test.pdf T-test], 9 February: [http://math-info.hse.ru/f/2018-19/ling-data/conf-ints.pdf Conf-intervals]
 +
 
 +
02 March: [http://math-info.hse.ru/f/2018-19/ling-data/chisq-test.pdf Chi-squared-test], 16 March: [http://math-info.hse.ru/f/2018-19/ling-data/CorrLab.pdf Corr-regression], 23 March: [http://math-info.hse.ru/f/2018-19/ling-data/anova-theo.pdf Anova]
  
|
+
6 April: [http://math-info.hse.ru/f/2018-19/ling-data/mult-reg-pdf.pdf Multiple-regression], 27 April: [http://math-info.hse.ru/f/2018-19/ling-data/mixed-effects.pdf Mixed-effects]
  
|}
+
===R seminars in .R and .Rmd===
==R seminars in pdf==
+
12 January: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-12/r-basics.R R-basics.R],  [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-12/r-basics.Rmd R-basics.Rmd], 19 January: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-19/teo/r-more-vectors.R R-vectors.R], 
 +
[https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-19/teo/r-more-vectors.Rmd R-vectors.Rmd]
 +
[https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-19/teo/r-dataframes.R R-dataframes.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-19/teo/r-dataframes.Rmd R-dataframes.Rmd],
 +
[https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-19/teo/r-samples.Rmd R-samples.Rmd],
 +
26 January: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-01-26/26-01.Rmd Binomial-test.Rmd]
  
12 January: [http://math-info.hse.ru/f/2018-19/ling-data/Rbasics_TEO-pdf.pdf r-basics], 19 January: [http://math-info.hse.ru/f/2018-19/ling-data/r-more-vectors-pdf.pdf r-vectors], [http://math-info.hse.ru/f/2018-19/ling-data/r-dataframes-pdf.pdf r-dataframes], [http://math-info.hse.ru/f/2018-19/ling-data/r-samples-pdf.pdf r-samples]  
+
2 February: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-02/t-test.R T-test.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-02/02-02.Rmd T-test.Rmd], 9 February: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-09/09-02.Rmd Conf-intervals.Rmd]
 +
[https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-02-09/conf-ints.R Conf-intervals.R]
  
==R seminars in .R and .Rmd==
+
2 March: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-02/chisq-02-03.Rmd Chi-squared-test.Rmd], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-02/chisq-test.R Chi-squared-test.R],
 +
16 March: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-16/corr-regression.R Corr-regression.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-16/corr-regression.Rmd Corr-regression.Rmd],
 +
23 March: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-23/anova.R Anova.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-03-23/anova.Rmd Anova.Rmd]
  
12 January: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/12-01/r-basics.R r-basics.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/12-01/Rbasics-TEO.Rmd r-basics.Rmd], 19 January: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/19-01/teo/r-more-vectors.R r-vectors.R], 
+
6 April: [Multiple-regression.R], [Multiple-regression.Rmd], 27 April: [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-04-27/mixed-effects.R Mixed-models.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/2019-04-27/mixed-effects.Rmd Mixed-models.Rmd]
[https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/19-01/teo/r-more-vectors.Rmd r-vectors.Rmd]
 
[https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/19-01/teo/r-dataframes.R r-dataframes.R], [https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/19-01/teo/r-dataframes.Rmd r-dataframes.Rmd],
 
[https://raw.githubusercontent.com/LingData2019/LingData/master/seminars/19-01/teo/r-samples.Rmd r-samples.Rmd]  
 
  
 
==Homeworks==
 
==Homeworks==
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW1.pdf Homework 1] (deadline: 27 January, 23:59)  
+
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW1.pdf Homework 1] (deadline: 27 January, 23:59), [https://docs.google.com/forms/d/e/1FAIpQLSehhy-j0Y2LIIfen6kqlz2Za5QUYvcZQ_7m3L5PAUrQbMDXwA/viewform link] to submit
  
 
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW2.pdf Homework 2] (deadline: 03 February, 23:59)
 
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW2.pdf Homework 2] (deadline: 03 February, 23:59)
 +
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW3.pdf Homework 3] (deadline: 10 February, 23:59), [https://raw.githubusercontent.com/LingData2019/LingData/master/hw/LingData-HW3.Rmd Rmd-file] to fill in, [https://www.dropbox.com/request/1e7CcztPAO3WklIsN0fU link] to submit your .Rmd file
 +
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW4-teo.pdf Homework 4] (deadline: 19 February, 23:59), [https://raw.githubusercontent.com/LingData2019/LingData/master/hw/LingData-HW4-teo.Rmd Rmd-file] to fill in, [https://www.dropbox.com/request/LbUBzdF19dcwX9nnMXpk link] to submit your .Rmd file
 +
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW5.pdf Homework 5] (deadline: 3 March, 23:59), [https://raw.githubusercontent.com/LingData2019/LingData/master/hw/rmd-templates/HW5-template.Rmd Rmd-file] to fill in, [https://www.dropbox.com/request/BY9JbVrYFDwXkRVBS2ci link] to submit your .Rmd file
 +
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/LingData-HW6.pdf Homework 6] (deadline: 15 May, 23:59), [https://raw.githubusercontent.com/LingData2019/LingData/master/hw/rmd-templates/HW6-template.Rmd Rmd-file] to fill in, [https://www.dropbox.com/request/rBTOCpEsNXy6hkzO2f26 link] to submit your .Rmd file
 +
 +
==Final project==
 +
* [http://math-info.hse.ru/f/2018-19/ling-data/projects.pdf Projects description]
 +
 +
* Project topics: [https://docs.google.com/spreadsheets/d/1QxLq2JTO9p7xJFo-KP3XyrRbexYwxQhNqTElDAGJEls/edit?usp=sharing link] to the table to fill in
 +
 +
* Projects pre-registration (deadline: 28 April, 23:59): [https://www.dropbox.com/request/I6XC3W9GkiAB3aQisxJq link] to submit your file
 +
 +
* Final versions of projects:  [https://www.dropbox.com/request/Ds4JI7vs9rAhLAG3tI6o link] to sumbit your files

Текущая версия на 04:11, 7 февраля 2020

Course info

Dear students,

Here will be published the materials of the course "Linguistic Data: Quantitative Analysis and Visualisation", taught at the Master programme "Linguistic Theory and Language Description" in 2018-2019 academic year.

  • Instructors: Olga Lyashevskaya, George Moroz, Alla Tambovtseva and Ilya Schurov.
  • Modules: 3-4

Software

During this course we will use R as a programming language and RStudio as a GUI.

How to install R and RStudio?

1. Download R (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.

2. Download RStudio (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.

It is possible avoid installing anything on your PC, using online version of RStudio.

How to use RStudio?

Read the instruction here.

For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).

Materials

Date Topic of the lecture Seminar Optional
12.01 Something about data: population vs sample
Descriptive statistics

problems1 R-basics
RMarkdown: official page,
cheatsheet

19.01 Population and samples. Working with data in R
problems2 R-samples artists.txt
R-vectors R-dataframes orientation.csv

more on basic graphs in R
26.01 Statistical hypotheses testing Binomial-test poetry.csv
02.02 Student's t-test. Central limit theorem
T-test icelandic.csv
asp-paper (Coretta, 2017)
09.02 Confidence Intervals Conf-intervals poetry.csv icelandic.csv
an interactive visualization of CI by K.Magnusson
more on overlapping CI's (by A.Knezevic)

16.02 Data manipulation with tidyverse. Visualisation with ggplot2
class materials

02.03 Chi-squared and Fisher's exact tests
Chi-squared-test elision.csv socling.csv

16.03 Correlation coefficients and simple linear regression
Corr-regressioneducation.csvchekhov.csv
guess correlation game
23.03 Multiple comparisons. ANOVA Anova icelandic.csv
correlograms spurious correlations
06.04 Multiple linear regression
Multiple-regression english.csv
more on visualising coefficients, more tests
13.04 Logistic regression Lab10 [Lab10-solutions]
more on visualising coefficients
27.04 More on model diagnostics. Mixed-effects models Mixed-effects ReductionRussian.txt
LME in R
18.05 Decision trees and random forest. Lab 12. Trees and forests Code
25.05 PCA
class materials

01.06 Clustering swadesh.csv
08.06 NeighborNet. Simulation statistics prefixes.txt R code scores2.csv

R seminars in pdf

12 January: R-basics, 19 January: R-vectors, R-dataframes, R-samples, 26 January: Binomial-test

2 February: T-test, 9 February: Conf-intervals

02 March: Chi-squared-test, 16 March: Corr-regression, 23 March: Anova

6 April: Multiple-regression, 27 April: Mixed-effects

R seminars in .R and .Rmd

12 January: R-basics.R, R-basics.Rmd, 19 January: R-vectors.R, R-vectors.Rmd R-dataframes.R, R-dataframes.Rmd, R-samples.Rmd, 26 January: Binomial-test.Rmd

2 February: T-test.R, T-test.Rmd, 9 February: Conf-intervals.Rmd Conf-intervals.R

2 March: Chi-squared-test.Rmd, Chi-squared-test.R, 16 March: Corr-regression.R, Corr-regression.Rmd, 23 March: Anova.R, Anova.Rmd

6 April: [Multiple-regression.R], [Multiple-regression.Rmd], 27 April: Mixed-models.R, Mixed-models.Rmd

Homeworks

Final project

  • Project topics: link to the table to fill in
  • Projects pre-registration (deadline: 28 April, 23:59): link to submit your file
  • Final versions of projects: link to sumbit your files