Разница между страницами «Наука о данных» и «Linguistic Data: Quantitative Analysis and Visualisation: computational linguistics»

Материал из MathINFO
(Различия между страницами)
Перейти к навигации Перейти к поиску
 
 
Строка 1: Строка 1:
* Курс ведёт Илья Щуров.
+
* Instructors: Ilya Schurov and Olga Lyashevskaya
  
== Материалы ==
+
== Materials ==
{|class='wikitable'
+
{| class="wikitable"
 
|-
 
|-
! дата !! тема !! конспекты !! видео !! дополнительные материалы !! ДЗ
+
! Data !! Topics !! Links
 
|-
 
|-
| 10 января
+
| Jan 18 || Introduction. Quantitative linguistic research and data types. R basics || [https://docs.google.com/presentation/d/1VUIUa3Db5n4dsD_HeA3e-mz55zK8uPrko3yu207pKUk/edit?usp=sharing Intro Slides] [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-01-18 Lab 01: intro to R]
| Первое знакомство. Python как калькулятор
 
| [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%201.ipynb конспект с комментариями], [https://gist.github.com/ischurov/2740543f1a6d6afd8ab02ced90789ecf сырой ноутбук с занятия] (без комментариев)
 
| [https://www.youtube.com/watch?v=5Y5tKPKhurA видео]
 
| {{PT}} [http://pythontutor.ru/lessons/int_and_float/ вычисления], [http://pythontutor.com визуализатор Python]
 
|rowspan=2| [http://nbviewer.jupyter.org/url/python.math-hse.info/static/assignments_release/nes-datascience2020/ps01/ps01.ipynb ДЗ№1]
 
 
|-
 
|-
| 14 января
+
| Jan 25 || Hypothesis testing. Binomial test. R: dataframes, tydyverse || [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-01-25 Lab 02] [https://datacamp-community-prod.s3.amazonaws.com/e63a8f6b-2aa3-4006-89e0-badc294b179c tidyverse cheat sheet]
| Списки
 
| [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%202.ipynb конспект с комментариями] (мы прошли до раздела «Присвоение и копирование списков», не включая его), [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%203.ipynb#Ввод-вывод-списков split и join], [https://gist.github.com/65ca76fd47c32f4f2520060149a97574 сырой ноутбук с занятия]
 
| [https://www.youtube.com/watch?v=kBu3g-ITjY4 видео]
 
| {{PT}} [http://pythontutor.ru/lessons/lists/ списки]
 
 
|-
 
|-
| 21 января
+
| Feb 1 || Central limit theorem. Variance. Student's t-test. R: simulating data, boxplots, density plots, binomial test, t-test ||
| Списки и цикл <code>for</code>
+
[https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-02-01 Lab 03: ]
| [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%202.ipynb конспект с комментариями] (начиная с раздела « Присвоение и копирование списков»), [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%203.ipynb#Нумерация-элементов-списка enumerate], [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%205.ipynb#Создание-словарей-и-функция-zip() zip] (часть про словари можно пропустить), [https://gist.github.com/9cbca8b16c7744dcd94113a52676f260 сырой ноутбук с занятия].
+
[https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-02-01/Lab3-ttest-binom-matrices.Rmd Rmd] [https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-02-01/Lab3-ttest-binom-matrices.html html] [https://rforpublichealth.blogspot.com/2014/02/ggplot2-cheatsheet-for-visualizing.html Viz. distributions]
| [https://youtu.be/kBu3g-ITjY4?t=2301 видео]
 
| {{PT}} [http://pythontutor.ru/lessons/for_loop/ цикл for]
 
| [https://nbviewer.jupyter.org/url/python.math-hse.info/static/assignments_release/nes-datascience2020/ps02/ps02.ipynb ДЗ№2]
 
 
|-
 
|-
| 24 января
+
| Feb 8 || Two-sample t-test. Paired t-test. Confidence intervals. <!-- TODO: Non-parametric tests --> || [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-02-08 Lab 04: ] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-02-08/Lab4-confint-pairedttest-anova.Rmd Rmd] [https://github.com/LingData2019/LingData2020/raw/master/seminars/2020-02-08/Lab4-confint-pairedttest-anova.pdf pdf][https://agricolamz.github.io/2018-MAG_R_course/Lec_4_stats.html CI slides] [https://istats.shinyapps.io/ExploreCoverage/ CI demo]
| Проверка условий. Цикл while.
 
| [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%203.ipynb#Проверка-условий проверка условий], [https://gist.github.com/ischurov/8d6309c42b91b269dc2fa7de3bd0b558 сырой ноутбук с занятия]
 
| [https://youtu.be/uzgaCV8KZA0?t=1353 проверка условий]
 
| {{PT}}: [http://pythontutor.ru/lessons/ifelse/ проверка условий], [http://pythontutor.ru/lessons/while/ цикл while]
 
|
 
 
|-
 
|-
| 28 января
+
| Feb 15 || ANOVA. Correlations || [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-02-15 Lab 05:] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-02-15/Lab5_Corr.Rmd Rmd] [https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-02-15/Lab5_Corr.pdf pdf]
| Функции. Словари
 
| [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%204.ipynb функции], [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%205.ipynb#Словари словари], [https://gist.github.com/861c593afb172ac4926bec5f758d544c сырой ноутбук с занятия]
 
| [https://www.youtube.com/watch?v=NYrYSFyCg4w функции], [https://www.youtube.com/watch?v=z8bu_b5BboI словари]
 
| {{PT}}: [http://pythontutor.ru/lessons/functions/ функции], [http://pythontutor.ru/lessons/dicts/ словари]
 
| [https://nbviewer.jupyter.org/url/python.math-hse.info/static/assignments_release/nes-datascience2020/ps03/ps03.ipynb ДЗ№3]
 
 
|-
 
|-
| 31 января
+
| Feb 22 || Tests for categorial data. Chi-squared test. Fisher exact test. Effect size || [https://github.com/LingData2019/LingData2020/tree/master/seminars/2020-02-22 Lab 06:] [https://raw.githubusercontent.com/LingData2019/LingData2020/master/seminars/2020-02-22/Lab6-chisq-Fischer-effectsize.Rmd Rmd] [https://github.com/LingData2019/LingData2020/blob/master/seminars/2020-02-22/Lab6-chisq-Fischer-effectsize.pdf pdf] [https://www.datacamp.com/community/tutorials/contingency-tables-r DataCamp: contingency tables]
| Ещё о словарях. Множества. Списковые включения (и не только). Сортировка
 
| [https://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%205.ipynb#Словари словари и списковые включения], [http://nbviewer.math-hse.info/github/ischurov/pythonhse/blob/master/Lecture%207.ipynb#Множества множества], [http://nbviewer.math-hse.info/github/ischurov/pythonhse/blob/master/Lecture%206.ipynb#Сортировка сортировка], [https://nbviewer.jupyter.org/gist/ischurov/5ecb2b35d13edd2582fa5ad4bf056b46 сырой ноутбук с занятия]
 
| [https://www.youtube.com/watch?v=z8bu_b5BboI словари], [https://www.youtube.com/watch?v=1w0NG-pfcsg&feature=youtu.be&t=9m17s сортировка]
 
| [https://docs.python.org/3/howto/sorting.html Sorting howto] (англ.)
 
|
 
 
|-
 
|-
| 4 февраля
+
| Feb 29 || Linear regression. Multivariate linear regression. Dummy variables || [https://lindeloev.github.io/tests-as-linear/linear_tests_cheat_sheet.pdf Common statistical tests & linear models ]
| Ещё о сортировке. <code>kwargs</code>. <code>lambda</code>-функции. Чтение файлов
 
| [http://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%207.ipynb#Файловый-ввод-вывод работа с файлами], [https://nbviewer.jupyter.org/gist/ischurov/1e332fcfd1c082ecf77497e1cbb8f0d0 сырой ноутбук с занятия]
 
| [https://youtu.be/KaWGNPgUOHo?t=2816 файлы]
 
|
 
|
 
 
|-
 
|-
| 7 февраля
+
| || Dimensionality reduction. PCA. MDS. t-SNE ||  
| Запись файлов. Объектно-ориентированное программирование
 
| [http://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%207.ipynb#Файловый-ввод-вывод работа с файлами], [https://nbviewer.jupyter.org/gist/3095c3884df7129e014dd4134d26d121 сырой ноутбук с занятия]
 
| [https://youtu.be/KaWGNPgUOHo?t=2816 файлы]
 
| [https://docs.python.org/3/tutorial/classes.html классы в Python] (англ., официальная документация)
 
| [http://nbviewer.jupyter.org/url/python.math-hse.info/static/assignments_release/nes-datascience2020/ps04/ps04.ipynb ДЗ№4]
 
 
|-
 
|-
| 11 февраля
+
| || CA, MCA. Clusterization ||
| Наследование. Итераторы и генераторы
+
|-
| [https://nbviewer.jupyter.org/gist/ischurov/8fe367fcfc38cd9d4fd4e546165fdeea сырой конспект]
+
| || Logistic regression. Model selection ||
|  
+
|-
| [https://docs.python.org/3/tutorial/classes.html классы в Python] (англ., официальная документация), [https://twitter.com/ilyaschurov/status/945727980688625665 твиттер-тред про Python] (начало как раз про итераторы)
+
| || Fixed and random effects. Linear mixed-effects models ||
|
+
|-
 +
| || Bootstrap. Decision trees. Decision forests ||
 +
|-
 +
|  || Bayesian statistics ||
 +
|-
 +
|  || Bayesian statistics II ||  
 
|-
 
|-
| 14 февраля
 
| Библиотека <code>numpy</code> и немножко <code>matplotlib</code>
 
| [http://nbviewer.jupyter.org/github/ischurov/pythonhse/blob/master/Lecture%2011.ipynb конспект с комментариями про numpy], [https://nbviewer.jupyter.org/gist/ischurov/54aa0a96c12e3e39c2f930e1754e0c60 сырой ноутбук]
 
| [http://www.youtube.com/watch?v=A84rlgoVnMY numpy]
 
| [https://docs.scipy.org/doc/numpy-dev/user/quickstart.html numpy quickstart], [http://matplotlib.org/users/pyplot_tutorial.html pyplot tutorial], [http://matplotlib.org/gallery.html matplotlib gallery]
 
| [http://nbviewer.jupyter.org/url/python.math-hse.info/static/assignments_release/nes-datascience2020/ps05/ps05.ipynb ДЗ№5]
 
 
|}
 
|}
  
== Программное обеспечение ==  
+
== Software ==
* [https://www.anaconda.com/distribution/ Anaconda] — вам нужна версия с Python 3.7.
+
During this course we will use R as a programming language and RStudio as a GUI.
* Чтобы открыть ipynb-файл в Jupyter Notebook, проще всего его загрузить в рабочий каталог с помощью функции ''upload'' самого Jupyter Notebook. Аналогично, чтобы вытащить файл из Jupyter Notebook, можно использовать функцию ''Download → ipynb''.
+
 
 +
How to install R and RStudio?
 +
 
 +
1. Download [https://cran.r-project.org/ R] (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.
 +
 
 +
2. Download [https://rstudio.com/products/rstudio/ RStudio] (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.
 +
 
 +
It is possible avoid installing anything on your PC, using [https://rstudio.cloud rstudio.cloud] (an online version of RStudio).
 +
 
 +
For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).
 +
 
 +
 
 +
== Homeworks ==
 +
* Homework 1 (deadline: February 16, 23:59), Chapters 1, 2, 3, and 5 of the [https://www.datacamp.com/courses/free-introduction-to-r DataCamp] course "Introduction to R". Please fill in this [https://docs.google.com/forms/d/e/1FAIpQLSdjgKBM5JSo6D6ajhrWWfFG1ktcKgDfbdK_jQ_ZbW9GwNLzpQ/viewform form]. 
 +
* Homework 2 (deadline: February 23, 23:59), Chapters 4 and 6 of the [https://www.datacamp.com/courses/free-introduction-to-r DataCamp] course "Introduction to R". 
 +
After completing the course please provide either the [https://support.datacamp.com/hc/en-us/articles/360001548814-How-can-I-share-my-certificate-Statement-of-Accomplishment- Statement of Accomplishment] or a screenshot of your learning progress via [link TBA]. 
 +
Deadlines for Homework 1 and 2 are cancelled due to unavailability of the free version of the datacamp online course. Stay tuned!
 +
* Homework 3 (deadline: February 9, 12:00), Hypothesis testing, binomial test, t-test. [https://github.com/LingData2019/LingData2020/blob/master/hw/hw-pdf/LingData-HW3-comp.pdf HW3 pdf] [https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/hw/LingData-HW3-comp.html html] [https://github.com/LingData2019/LingData2020/blob/master/hw/LingData-HW3-comp.Rmd Rmd template]
 +
* Homework 4 (deadline: February 29, 12:00), T-test and ANOVA, reproducing some results from Leivada & Westergaard 2019 [https://github.com/LingData2019/LingData2020/blob/master/hw/hw-pdf/LingData-HW4-comp.pdf HW4 pdf] [https://htmlpreview.github.io/?https://github.com/LingData2019/LingData2020/blob/master/hw/hw-html/LingData-HW4-comp.html html] [https://github.com/LingData2019/LingData2020/blob/master/hw/LingData-HW4-comp.Rmd Rmd template]
 +
* Homework 5
 +
 
 +
== Final project ==
 +
* Projects description [https://github.com/LingData2019/LingData2020/blob/master/projects.pdf link] 
 +
* Projects pre-registration: link to submit your file TBA 
 +
* Final versions of project papers: link to sumbit your files TBA 
 +
 
 +
 
 +
== References ==
 +
* Gries, Stefan (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. [http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsebk&AN=604318 HSE library link]
 +
* Levshina, Natalia (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. [http://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=nlebk&AN=1093048 HSE library link]
 +
* Baayen, Harald (2008). Analyzing Linguistic Data: A practical introduction to statistics. Cambridge UP. [http://www.sfs.uni-tuebingen.de/~hbaayen/publications/baayenCUPstats.pdf pdf]
 +
 
 +
* Gries, Stefan (2017). Quantitative Corpus Linguistics with R : A Practical Introduction (Vol. Second edition). Milton Park, Abingdon, Oxon: Routledge. eBook
 +
* Empirical Bayes
 +
* Harney, H. L. (2016). Bayesian Inference : Data Evaluation and Decisions (Vol. 2nd ed). Springer. eBook 
 +
* McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. eBook
 +
* ggplot2
 +
* Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. Springer. eBook
 +
* R markdown [https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf Rmd Cheat Sheet
 +
 
 +
== Course Info ==
 +
 
 +
This page contains the materials of the course "Linguistic Data: Quantitative Analysis and Visualisation", taught at the HSE Master's program "Computational Linguistics" in 2019-2020 academic year. Modules: 3-4.

Версия 20:10, 22 февраля 2020

  • Instructors: Ilya Schurov and Olga Lyashevskaya

Materials

Data Topics Links
Jan 18 Introduction. Quantitative linguistic research and data types. R basics Intro Slides Lab 01: intro to R
Jan 25 Hypothesis testing. Binomial test. R: dataframes, tydyverse Lab 02 tidyverse cheat sheet
Feb 1 Central limit theorem. Variance. Student's t-test. R: simulating data, boxplots, density plots, binomial test, t-test

Lab 03: Rmd html Viz. distributions

Feb 8 Two-sample t-test. Paired t-test. Confidence intervals. Lab 04: Rmd pdfCI slides CI demo
Feb 15 ANOVA. Correlations Lab 05: Rmd pdf
Feb 22 Tests for categorial data. Chi-squared test. Fisher exact test. Effect size Lab 06: Rmd pdf DataCamp: contingency tables
Feb 29 Linear regression. Multivariate linear regression. Dummy variables Common statistical tests & linear models
Dimensionality reduction. PCA. MDS. t-SNE
CA, MCA. Clusterization
Logistic regression. Model selection
Fixed and random effects. Linear mixed-effects models
Bootstrap. Decision trees. Decision forests
Bayesian statistics
Bayesian statistics II

Software

During this course we will use R as a programming language and RStudio as a GUI.

How to install R and RStudio?

1. Download R (you can choose another mirror here if you wish) and install it on your computer. Make sure you did it before installing RStudio.

2. Download RStudio (you need RStudio Desktop Open Source License) and install it on your computer. It is recommended to create a shortcut for RStudio during installation.

It is possible avoid installing anything on your PC, using rstudio.cloud (an online version of RStudio).

For successful submission of assignments you should be able to create and save R code files (.R) and RMarkdown files (.Rmd).


Homeworks

  • Homework 1 (deadline: February 16, 23:59), Chapters 1, 2, 3, and 5 of the DataCamp course "Introduction to R". Please fill in this form.
  • Homework 2 (deadline: February 23, 23:59), Chapters 4 and 6 of the DataCamp course "Introduction to R".

After completing the course please provide either the Statement of Accomplishment or a screenshot of your learning progress via [link TBA]. Deadlines for Homework 1 and 2 are cancelled due to unavailability of the free version of the datacamp online course. Stay tuned!

  • Homework 3 (deadline: February 9, 12:00), Hypothesis testing, binomial test, t-test. HW3 pdf html Rmd template
  • Homework 4 (deadline: February 29, 12:00), T-test and ANOVA, reproducing some results from Leivada & Westergaard 2019 HW4 pdf html Rmd template
  • Homework 5

Final project

  • Projects description link
  • Projects pre-registration: link to submit your file TBA
  • Final versions of project papers: link to sumbit your files TBA


References

  • Gries, Stefan (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. HSE library link
  • Levshina, Natalia (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. HSE library link
  • Baayen, Harald (2008). Analyzing Linguistic Data: A practical introduction to statistics. Cambridge UP. pdf
  • Gries, Stefan (2017). Quantitative Corpus Linguistics with R : A Practical Introduction (Vol. Second edition). Milton Park, Abingdon, Oxon: Routledge. eBook
  • Empirical Bayes
  • Harney, H. L. (2016). Bayesian Inference : Data Evaluation and Decisions (Vol. 2nd ed). Springer. eBook
  • McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. eBook
  • ggplot2
  • Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. Springer. eBook
  • R markdown [https://rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf Rmd Cheat Sheet

Course Info

This page contains the materials of the course "Linguistic Data: Quantitative Analysis and Visualisation", taught at the HSE Master's program "Computational Linguistics" in 2019-2020 academic year. Modules: 3-4.