From: Benjamin Mako Hill Date: Wed, 2 Jul 2014 18:22:48 +0000 (-0700) Subject: initial version of homepage X-Git-Url: https://projects.mako.cc/source/redirect-tools/commitdiff_plain/9d33a112f7d061be0f582dfe80b07ffb804952ee initial version of homepage --- diff --git a/docs/edits_over_pages.png b/docs/edits_over_pages.png new file mode 100644 index 0000000..4da8d50 Binary files /dev/null and b/docs/edits_over_pages.png differ diff --git a/docs/homepage.rst b/docs/homepage.rst new file mode 100644 index 0000000..206afa8 --- /dev/null +++ b/docs/homepage.rst @@ -0,0 +1,139 @@ +Consider the Redirect +======================= + +__ https://en.wikipedia.org/wiki/Main_Page + +.. figure:: seattle_redirect.png + :align: right + :figwidth: 614px + + Example of a redirect displayed in the `Seattle, Washington page on + English Wikipedia`__ which is a redirect to `Seattle`__. Note the + small redirect notice below the title. + +__ https://en.wikipedia.org/wiki/Seattle,_Washington +__ https://en.wikipedia.org/wiki/Seattle + +In wikis, **redirects** are special pages in that silently take readers +from the page they are visiting to another page in in the wiki. In the +`English Wikipedia`__, redirects make up more than half of all article +pages. + +Different data sources of handle redirects differently. For example, +`the MediaWiki API`__ will automatically "follow" redirects but the `XML +database dumps`__ treat redirects like normal articles. In both cases, +redirects are often invisible to reseachers. + +__ https://www.mediawiki.org/wiki/API:Main_page +__ https://meta.wikimedia.org/wiki/Data_dumps + +Because redirects constitute a majority of all pages and see a large +portion of all traffic, Wikipedia reseachers need to take redirects into +account or their findings may be incomplete or incorrect. For example, +the histogram on this page shows the distribution fo edits across pages +in Wikipedia for every page, and for non-redirects only. Because +redirects are almost never edited, the distributions are very different. +Similarly, because redirects are viewed but almost never edited, any +study of views over articles should also take redirects into account. + +.. figure:: edits_over_pages.png + :align: right + :figwidth: 614px + + Histograms of pages in the article namespace based on the number of + edits on a log scale. The red histogram includes all pages in the + article namespace (N = 9,729,989). The blue graphs includes only + pages that were not redirects at the point of data collection (N = + 5,327,561). + +Because redirects can change over time, the snapshots of redirects +stored by Wikimedia and published by Wikimedia Foundation are +incomplete. Taking redirects into account fully involves looking at the +content of every single revision of every article to determine both +when and where pages redirect. + +Much more detail can be found in `Consider the Redirect: A Missing +Dimension of Wikipedia Research`__ — a short paper that we have written +to acccompany this dataset and these tools. If you use this software or +these data, we would appreciate if you cite the paper: + + *Hill, Benjamin Mako and Aaron Shaw. "Consider the Redirect: A Missing + Dimension of Wikipedia Research." In Proceedings of the 10th + International Symposium on Open Collaboration (OpenSym 2014). ACM + Press, 2014.* + +__ http://mako.cc/academic/hill_shaw-consider_the_redirect.pdf + +Generating Redirect Spells +============================= + +Generating redirect spells from an MediaWiki XML dump involves two steps: + +1. Searching the full text of every revision of every page in a dump to + determine if any given revision is a redirect. + +2. Using the results of (1) to generate a list of "spells" that describe + periods of time that articles in a wiki redirect to other articles. + +We have `publicly released software in Python and R to do these two +steps`__ under the `GNU GPL version 3`__. The software is designed for +people already comfortable with working with MediaWiki XML dumps and the +tools and software necessary to do this. We have provided +`documentation`__ on how to use these tools. + +__ http://projects.mako.cc/source/?p=redirect-tools +__ http://www.gnu.org/licenses/gpl-3.0.html +__ README.html + +You can download the software from our git repository like:: + + git clone git://projects.mako.cc/redirect-tools + +Detailed documentation on how to use the software is in available in our +README file. + +Redirect Spell Data +========================= + +In our paper `Consider the Redirect`__, we present an analysis of +redirect data from English Wikipedia in the dump created in October +2012. You can download `the dump files we used`__ from `the Wikimedia +Foundation dataset archive`__. Because generating these dumps can be +computationally intense, we have published the output of the software +above run on the this dump. This includes 9,277,563 redirect spells that +our software identified and is the dataset used in the paper. + +You can download the dataset in the following formats: + +- `RData (240MB)`__ — Suitable for use in GNU R +- `bzip2 compressed tab seperated values (178MB)`__ — Suitable for use + in other languages and statistical packages. + +__ http://mako.cc/academic/hill_shaw-consider_the_redirect.pdf +__ http://dumps.wikimedia.org/enwiki/20121001/ +__ http://dumps.wikimedia.org/ +__ enwiki_201210-redirect_spells-v1.RData +__ enwiki_201210-redirect_spells-v1.tsv.bz2 + +More Information +================== + +For details about the dataset, why it is important, and for examples on +how it can be used to come to better findings in Wikipedia research, +please read `the companion paper`__. + +__ http://mako.cc/academic/hill_shaw-consider_the_redirect.pdf + +If you notice issues or bugs in our data or `code`__, contact `Benjamin +Mako Hill`__ or `Aaron Shaw`__. + +__ http://projects.mako.cc/source/?p=redirect-tools +__ http://mako.cc/contact/ +__ http://aaronshaw.org/ + +Patches and improvements are welcome! Details on `how to produce and send +a patch using git are online`__. + +__ http://projects.mako.cc/source/ + + diff --git a/docs/seattle_redirect.png b/docs/seattle_redirect.png new file mode 100644 index 0000000..3f27066 Binary files /dev/null and b/docs/seattle_redirect.png differ diff --git a/homepage.rst b/homepage.rst deleted file mode 100644 index 5e532b1..0000000 --- a/homepage.rst +++ /dev/null @@ -1,107 +0,0 @@ -Consider the Redirect -======================= - -In wikis, **redirects** are special pages in that silently take readers -from the page they are visiting to another page in in the wiki. In the -`English Wikipedia`__, redirects make up more than half of all article -pages. - -__ https://en.wikipedia.org/wiki/Main_Page - -.. image:: example_image - -Different data sources of handle redirects differently. `The MediaWiki -API`__ will automatically "follow" redirects but the `XML database -dumps`__ treat redirects like normal articles. In both cases, redirects -are often invisible to reseachers. - -__ https://www.mediawiki.org/wiki/API:Main_page -__ https://meta.wikimedia.org/wiki/Data_dumps - -Because redirects constitute a majority of all pages and see a large -portion of all traffic, Wikipedia reseachers need to take redirects into -account or their findings may be incomplete or incorrect. For example, -the following image shows the distribution fo edits across pages in -Wikipedia for every page, and for non-redirects only. - -.. image:: redirects_whatever.png - -Because redirects are almost never edited, the distributions are -very different. Because redirects are viewed but almost never edited, -any study of views over articles should also take redirects into -account. - -Because redirects can change over time, the snapshots of redirects -stored by Wikimedia and published by Wikimedia Foundation are -incomplete. Taking redirects into account fully involves looking at the -content of every single revision of every article to determine both -when and where pages redirect. - -Much more detail can be found in `Consider the Redirect: A Missing -Dimension of Wikipedia Research`__ — a short paper that we have written -to acccompany this dataset and these tools. If you use this software or -these data, we would appreciate if you cite the paper: - - *Hill, Benjamin Mako and Aaron Shaw. "Consider the Redirect: A Missing - Dimension of Wikipedia Research." In Proceedings of the 10th - International Symposium on Open Collaboration (OpenSym 2014). ACM - Press, 2014.* - -__ hill_shaw-consider_the_redirect.pdf - -Generating Redirect Spells -============================= - -Generating redirect spells from an MediaWiki XML dump involves two steps: - -1. Searching the full text of every revision of every page in a dump to - determine if any given revision is a redirect. - -2. Using the results of (1) to generate a list of "spells" that describe - periods of time that articles in a wiki redirect to other articles. - -We have software in Python and R to do these two steps under the `GNU GPL -version 3`__. The software is designed for people already comfortable -with working with MediaWiki XML dumps and the tools and software -necessary to do this. - -__ gpl - -You can download the software from our git repository like:: - - git clone WHATEVER - -Detailed documentation on how to use the software is in available in our -README file. - -Redirect Spell Data -========================= - -In Consider the Redirect, we present an analysis of redirect data from -English Wikipedia in the dump created on DATE. You can download the dump -files from HERE. Because generating these dumps can be computationally -intense, we have published the output of the software above run on the -this dump. This includes 9,277,563 redirect spells that our software -identified and is the dataset used in the paper. - -You can download the dataset in the following formats: - -- RData (240MB) — Suitable for use in GNU R -- bzip2 compressed tab seperated values — Suitable - -More Information -================== - -For details about the dataset, why it is important, and for examples on -how it can be used to come to better findings in Wikipedia research, -read the paper. - -If you notice issues or bugs in the data or script, contact `Benjamin -Mako Hill`__ or `Aaron Shaw`__. - -__ http://mako.cc/contact/ -__ aarono - -Patches and improvements are welcome! Details on how to produce and send -a patch using git are online here. -