From: Benjamin Mako Hill Date: Mon, 30 Jun 2014 20:11:02 +0000 (-0700) Subject: added homepage and readme X-Git-Url: https://projects.mako.cc/source/redirect-tools/commitdiff_plain/14977b34053a82232ac5aa1323e3ec894af627a0 added homepage and readme --- diff --git a/README.rst b/README.rst index e0c456a..f1dcc81 100644 --- a/README.rst +++ b/README.rst @@ -1,24 +1,25 @@ MediaWiki Redirect Tools ======================================================================= -| Author: Benjamin Mako Hill -| Homepage: http://networkcollectiv.es/wiki-redirects/ -| License: GNU GPLv3 or any later version (see COPYING) + | **Author:** `Benjamin Mako Hill`__ + | **Homepage:** http://networkcollectiv.es/wiki-redirects/ + | **License:** `GNU GPLv3 or any later version`__ (see COPYING) + | **Description:** Tools to to generate a redirect spells dataset from "raw" MediaWiki XML dumps like those published by the Wikimedia foundation. -If you use this software for research, please cite the following paper in any -resulting publication: +__ http://mako.cc/ +__ http://www.gnu.org/copyleft/gpl.html - Hill, Benjamin Mako and Aaron Shaw. "Consider the Redirect: A Missing - Dimension of Wikipedia Research." In Proceedings of the 10th International - Symposium on Open Collaboration (OpenSym 2014). ACM Press, 2014. +If you use this software for research, please **cite the following +paper** in any resulting publication: -Overview: + *Hill, Benjamin Mako and Aaron Shaw. "Consider the Redirect: A Missing + Dimension of Wikipedia Research." In Proceedings of the 10th International + Symposium on Open Collaboration (OpenSym 2014). ACM Press, 2014.* - These tools help you generate a redirect spells dataset from "raw" MediaWiki - XML dumps like those published by the Wikimedia foundation. +**Overview:** -To uese these tools, you will need to a MediaWiki dump file. For Wikimedia -Foundation projects, you can download them all from: +To use these tools, you will need need to start with a MediaWiki dump +file. For Wikimedia Foundation projects, you can download them all from: http://dumps.wikimedia.org/ Wikis from Wikia.com and other Wikimedia projects all use the same XML format @@ -31,14 +32,13 @@ Wikipedia`__ that I downloaded with the following command:: __ https://simple.wikipedia.org/ -Before you start, you may also want to change the default directories for -writing intermediate output files. - -The default directories for writing and reading files are at the top of the -file `redirect_tools.R` and can be changed by editing that file. By default, -all files will be written to the subdirectory "./output" in the local -directory. If you want to use the default directories, you will still need to -create them with a command like this:: +Before you start, you may also want to change the default directories +for writing intermediate output files. The default directories for +writing and reading files are at the top of the file `redirect_tools.R` +and can be changed by editing that file. By default, all files will be +written to the subdirectory "./output" in the local directory. If you +want to use the default directories, you will still need to create them +with a command like this:: mkdir output/redir output/spells @@ -62,11 +62,15 @@ Run the file `01-extract_redirects.py` to build a dataset of revisions or edits that marks every revisions as either containinig a revision, or not. The script `01-extract_redirects.py` takes a MediaWiki dump file on STDIN and -outputs a TSV file on STDOUT of the following form: +outputs a TSV file on STDOUT of the following form. + ++---------+-------------+--------------------------------+------------+---------+----------+--------------------+ +| page.id | revision.id | page.title | timestamp | deleted | redirect | target | ++=========+=============+================================+============+=========+==========+====================+ +| 1935456 | 17563584 | Mikhail Alekseevich Lavrentiev | 1116962833 | FALSE | FALSE | NA | +| 1935456 | 22034930 | Mikhail Alekseevich Lavrentiev | 1125245577 | FALSE | TRUE | Mikhail Lavrentyev | ++---------+-------------+--------------------------------+------------+---------+----------+--------------------+ -> page.id revision.id page.title timestamp deleted redirect target -> 1935456 17563584 Mikhail Alekseevich Lavrentiev 1116962833 FALSE FALSE NA -> 1935456 22034930 Mikhail Alekseevich Lavrentiev 1125245577 FALSE TRUE Mikhail Lavrentyev In this case, the first revision of the article "Mikhail Alekseevich Lavrentiev" was not a redirect but the second is a redirect to "Mikhail diff --git a/homepage.rst b/homepage.rst new file mode 100644 index 0000000..5e532b1 --- /dev/null +++ b/homepage.rst @@ -0,0 +1,107 @@ +Consider the Redirect +======================= + +In wikis, **redirects** are special pages in that silently take readers +from the page they are visiting to another page in in the wiki. In the +`English Wikipedia`__, redirects make up more than half of all article +pages. + +__ https://en.wikipedia.org/wiki/Main_Page + +.. image:: example_image + +Different data sources of handle redirects differently. `The MediaWiki +API`__ will automatically "follow" redirects but the `XML database +dumps`__ treat redirects like normal articles. In both cases, redirects +are often invisible to reseachers. + +__ https://www.mediawiki.org/wiki/API:Main_page +__ https://meta.wikimedia.org/wiki/Data_dumps + +Because redirects constitute a majority of all pages and see a large +portion of all traffic, Wikipedia reseachers need to take redirects into +account or their findings may be incomplete or incorrect. For example, +the following image shows the distribution fo edits across pages in +Wikipedia for every page, and for non-redirects only. + +.. image:: redirects_whatever.png + +Because redirects are almost never edited, the distributions are +very different. Because redirects are viewed but almost never edited, +any study of views over articles should also take redirects into +account. + +Because redirects can change over time, the snapshots of redirects +stored by Wikimedia and published by Wikimedia Foundation are +incomplete. Taking redirects into account fully involves looking at the +content of every single revision of every article to determine both +when and where pages redirect. + +Much more detail can be found in `Consider the Redirect: A Missing +Dimension of Wikipedia Research`__ — a short paper that we have written +to acccompany this dataset and these tools. If you use this software or +these data, we would appreciate if you cite the paper: + + *Hill, Benjamin Mako and Aaron Shaw. "Consider the Redirect: A Missing + Dimension of Wikipedia Research." In Proceedings of the 10th + International Symposium on Open Collaboration (OpenSym 2014). ACM + Press, 2014.* + +__ hill_shaw-consider_the_redirect.pdf + +Generating Redirect Spells +============================= + +Generating redirect spells from an MediaWiki XML dump involves two steps: + +1. Searching the full text of every revision of every page in a dump to + determine if any given revision is a redirect. + +2. Using the results of (1) to generate a list of "spells" that describe + periods of time that articles in a wiki redirect to other articles. + +We have software in Python and R to do these two steps under the `GNU GPL +version 3`__. The software is designed for people already comfortable +with working with MediaWiki XML dumps and the tools and software +necessary to do this. + +__ gpl + +You can download the software from our git repository like:: + + git clone WHATEVER + +Detailed documentation on how to use the software is in available in our +README file. + +Redirect Spell Data +========================= + +In Consider the Redirect, we present an analysis of redirect data from +English Wikipedia in the dump created on DATE. You can download the dump +files from HERE. Because generating these dumps can be computationally +intense, we have published the output of the software above run on the +this dump. This includes 9,277,563 redirect spells that our software +identified and is the dataset used in the paper. + +You can download the dataset in the following formats: + +- RData (240MB) — Suitable for use in GNU R +- bzip2 compressed tab seperated values — Suitable + +More Information +================== + +For details about the dataset, why it is important, and for examples on +how it can be used to come to better findings in Wikipedia research, +read the paper. + +If you notice issues or bugs in the data or script, contact `Benjamin +Mako Hill`__ or `Aaron Shaw`__. + +__ http://mako.cc/contact/ +__ aarono + +Patches and improvements are welcome! Details on how to produce and send +a patch using git are online here. +