Linguistic Data: Quantitative Analysis and Visualisation: computational linguistics: различия между версиями

Материал из MathINFO
Перейти к навигации Перейти к поиску
 
(не показаны 4 промежуточные версии этого же участника)
Строка 34: Строка 34:
 
|-
 
|-
 
| Mar 4
 
| Mar 4
| R: Tydyverse
+
| R: Tidyverse
 
| [https://pozdniakov.github.io/tidy_stats/tidy-intro.html html]
 
| [https://pozdniakov.github.io/tidy_stats/tidy-intro.html html]
 
| [https://zoom.us/rec/share/Rkd8Lu7Ln2_52VLkwptJ215qB7DJutFGK_d0LluZ_BgRzNCu6Bls6kCIl-D_5G3I.Yo8jdsCoO2qAHoCU?startTime=1614876153000 video]
 
| [https://zoom.us/rec/share/Rkd8Lu7Ln2_52VLkwptJ215qB7DJutFGK_d0LluZ_BgRzNCu6Bls6kCIl-D_5G3I.Yo8jdsCoO2qAHoCU?startTime=1614876153000 video]
 
|-
 
|-
 
| Mar 18
 
| Mar 18
| R: Advanced Tydiverse
+
| R: Advanced Tidyverse
| [https://pozdniakov.github.io/tidy_stats/tidyverse-advanced.html html]
+
| [https://pozdniakov.github.io/tidy_stats/tidyverse-advanced.html html]
| [video link]
+
|  
 
|-
 
|-
 
| Apr 6
 
| Apr 6
|  
+
| Visualizations: ggplot2
|  
+
| [https://pozdniakov.github.io/tidy_stats/vdesc.html#grammar_of_graphics html] [https://raw.githubusercontent.com/LingData2019/LingData2021/main/2021Lab2_ggplot2.Rmd Rmd]
 
| [video link]
 
| [video link]
 
|-
 
|-
 
| Apr 15
 
| Apr 15
 +
| Introduction to inferential statistics. Central limit theorem. Null hypothesis significance testing. P-value. Confidence interval
 +
| [https://pozdniakov.github.io/tidy_stats/infer-stats.html html] 
 
|  
 
|  
|
 
| [video link]
 
 
|-
 
|-
 
| Apr 20
 
| Apr 20
|  
+
| Tests of difference for two-sample designs: T-test, independent and paired. Nonparametric equivalents: Mann-Whitney and Wilcoxon tests. Tests for nominal data: chi-squared test, Fischer exact test. Effect size tests for big data.
|  
+
| [https://pozdniakov.github.io/tidy_stats/ttest.html html]
 
| [video link]
 
| [video link]
 
|-
 
|-
 
| Apr 29
 
| Apr 29
|  
+
| Covariance and correlation. Pearson's correlation,  Spearman's  and Kendall non-parametric correlation tests. Multiple comparisons problem.
|  
+
| [https://pozdniakov.github.io/tidy_stats/cov-cor.html html]
|  
+
| Playing with students' data [https://github.com/LingData2019/LingData2021/blob/main/cz_daria.zip zip]
 
|-
 
|-
 
| May 11
 
| May 11
|  
+
| Linear regression. Multiple linear regression. Family of Generalized and General linear models
|  
+
| [https://raw.githubusercontent.com/LingData2019/LingData2021/main/2021Lab6_LinearRegression.Rmd Rmd]
|  
+
| [https://pozdniakov.github.io/tidy_stats/glm-general.html#general_linear_model html]
 
|-
 
|-
 
| May 18
 
| May 18
|  
+
| Logistic regression. Model selection.
|  
+
| [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-03-21/Lab9-logit.Rmd Rmd]
 
|  
 
|  
 
|-
 
|-
 
| May 20
 
| May 20
|  
+
| Analysis of variance (ANOVA)
|  
+
| [https://pozdniakov.github.io/tidy_stats/anova.html html]
 
|  
 
|  
 
|-
 
|-
 
| May 25
 
| May 25
|  
+
| Mixed-effect linear models.
|  
+
| [https://raw.githubusercontent.com/LingData2019/LingData2021/main/2021Lab8_MixedEffect.Rmd Rmd]
|  
+
| [https://pozdniakov.github.io/tidy_stats/glm-general.html#lme_model html]
 
|-
 
|-
 
| Jun 3
 
| Jun 3
|  
+
| Dimensionality reduction. PCA. CA. MCA
|  
+
| [https://raw.githubusercontent.com/LingData2019/LingData2021/main/Lab10_PCA_CA_.Rmd Rmd] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-04-11/Lab10-PCA.template.Rmd Rmd template]
|  
+
| [https://www.displayr.com/interpret-correspondence-analysis-plots-probably-isnt-way-think/ hint: how to (not) interpret CA] [https://raw.githubusercontent.com/LingData2019/LingData2021/main/Lab10-MDS.Rmd see also MDS]
 
|-
 
|-
 
| Jun 8
 
| Jun 8
|  
+
| CART: decision trees and random forest. Cluster analysis
|  
+
| [https://raw.githubusercontent.com/LingData2019/LingData2021/main/Lab11-CART.Rmd Rmd: CART] [https://raw.githubusercontent.com/LingData2019/LingData2021/main/Lab12-Clustering.Rmd Rmd:Clustering]
 
|  
 
|  
 
|-
 
|-
 
| Jun 10
 
| Jun 10
|  
+
| Bayesian statistics
 
|  
 
|  
 
|  
 
|  
Строка 117: Строка 117:
 
Due date: 2021-03-17 23:59 MSK.
 
Due date: 2021-03-17 23:59 MSK.
  
=== Homework:Advanced Tidyverse===
+
=== Homework: Advanced Tidyverse ===
 
Homework is here: [https://raw.githubusercontent.com/lnpetrova/LingDataHW/main/hw_tidyverse1.Rmd]. Upload your work in a repository on GitHub and put the link here  [https://forms.gle/WM558VEhNZVdbzFu5]
 
Homework is here: [https://raw.githubusercontent.com/lnpetrova/LingDataHW/main/hw_tidyverse1.Rmd]. Upload your work in a repository on GitHub and put the link here  [https://forms.gle/WM558VEhNZVdbzFu5]
 
Due date: 2021-04-05 23:59 MSK.
 
Due date: 2021-04-05 23:59 MSK.
 +
 +
=== Homework: From t-tests to data modeling ===
 +
Homework is here: [https://raw.githubusercontent.com/LingData2019/LingData2021/main/HW2.Rmd] [http://htmlpreview.github.io/?https://github.com/LingData2019/LingData2021/blob/main/HW2.html html]. Upload your work in a repository on GitHub and put the link here  [https://docs.google.com/forms/d/e/1FAIpQLSeqgeizO9AF2Y9wzvXSz8RzeOUE_uHwj0hvOdWp-ej7vVfbsA/viewform]
 +
Due date: 2021-06-23 18:00 MSK.
  
 
=== Online course assignments ===
 
=== Online course assignments ===
Строка 136: Строка 140:
 
* April 14: draft dataset   
 
* April 14: draft dataset   
 
* June 10: final version of your dataset, draft final paper   
 
* June 10: final version of your dataset, draft final paper   
* 24 hours before the exam starts: paper submission
+
* 24 hours before the exam starts: paper submission
  
 
== Course Policy ==
 
== Course Policy ==

Текущая версия на 17:46, 13 июня 2021

  • Instructors: Olga Lyashevskaya and Ivan Pozdnyakov
  • Assistant: Lidia Ostyakova
  • HSE Course [syllabus: Link * Group in Telegram

Materials

Data Topics Links video
Jan 11 Introduction to R. R and R Studio. R basic: functions, variables, types html practice
Jan 18 Data analysis in linguistics. Research design. Types of variables pdf data to practice with read Gries Chapter 1.3
Jan 21 R: vectors, implicit and explicit coercion, recycling rule, missing values html video
Feb 4 R: matrices, arrays, lists, data.frames. Packages. Data import and export html [video link]
Feb 18 R: Conditions and loops. Functions. Apply family. html html [video link]
Mar 4 R: Tidyverse html video
Mar 18 R: Advanced Tidyverse html
Apr 6 Visualizations: ggplot2 html Rmd [video link]
Apr 15 Introduction to inferential statistics. Central limit theorem. Null hypothesis significance testing. P-value. Confidence interval html
Apr 20 Tests of difference for two-sample designs: T-test, independent and paired. Nonparametric equivalents: Mann-Whitney and Wilcoxon tests. Tests for nominal data: chi-squared test, Fischer exact test. Effect size tests for big data. html [video link]
Apr 29 Covariance and correlation. Pearson's correlation, Spearman's and Kendall non-parametric correlation tests. Multiple comparisons problem. html Playing with students' data zip
May 11 Linear regression. Multiple linear regression. Family of Generalized and General linear models Rmd html
May 18 Logistic regression. Model selection. Rmd
May 20 Analysis of variance (ANOVA) html
May 25 Mixed-effect linear models. Rmd html
Jun 3 Dimensionality reduction. PCA. CA. MCA Rmd Rmd template hint: how to (not) interpret CA see also MDS
Jun 8 CART: decision trees and random forest. Cluster analysis Rmd: CART Rmd:Clustering
Jun 10 Bayesian statistics
Jun 17
Exam

Assignment #1

Research hypothesis: formulate your pilot research hypothesis. Fill in the form Due date: 2021-01-24 23:00 MSK.

Assignment #2

Data description. Create a repository on GitHub and put the link in the same table form Due date: 2021-03-17 23:59 MSK.

Homework: Advanced Tidyverse

Homework is here: [1]. Upload your work in a repository on GitHub and put the link here [2] Due date: 2021-04-05 23:59 MSK.

Homework: From t-tests to data modeling

Homework is here: [3] html. Upload your work in a repository on GitHub and put the link here [4] Due date: 2021-06-23 18:00 MSK.

Online course assignments

Complete the following chapters on Coursera [5] course:

  • Week 1
  • Week 2
  • Week 3
  • Week 4

Final project

The project description and a link to some examples can be found here. Important dates:

  • January 24: research hypothesis
  • March 17: dataset description in Rmd, toy dataset (min. 20 observations)
  • April 14: draft dataset
  • June 10: final version of your dataset, draft final paper
  • 24 hours before the exam starts: paper submission

Course Policy

Score policy:
 The Final Score is obtained from the following formula: Final Score = 0.6 × (Homework Score) + 0.4 × (Exam Score). The student is expected to prepare the final project in a written form as electronic document. The exam is conducted in the form of oral defense of the final project. The Exam Score measures the overall quality of the final project. It is integer number from 0 to 10. Parts of the final exam data should be prepared in advance and can be used in regular homework assignments.

Academic ethics policy: you have to do your homeworks by yourself. In case of academic cheating (e.g. if you copy someone else's work, etc.), your work will receive grade 0 and the program supervisor will be notified. If you feel that you are stuck with the homework, ask instructors for advice and hints.

Late penalties: in case of late submission, your grade will be multiplied by exp(-t / 86400), where t is the number of seconds since the due date. For example, if you delay the submission by one day, your grade will be multiplied by exp(-1)=0.3678794412.

Extensions: you can ask for up to two extensions of homework due dates during the course. Each extension is one week. Extensions due to valid excuses (i.e. illness) do not count.


Software

During this course we will use R as a programming language and RStudio as a GUI.

How to install R and RStudio?

1. Download R (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.

2. Download RStudio (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.

It is possible avoid installing anything on your PC, using rstudio.cloud (an online version of RStudio).

For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).

Online course

Some parts of MOOC (online) course is included in the program.

References

  • Gries, Stefan (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. HSE library link
  • Levshina, Natalia (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. HSE library link
  • Baayen, Harald (2008). Analyzing Linguistic Data: A practical introduction to statistics. Cambridge UP. link