Jump to content

Wikipedia:Database download

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Angela (talk | contribs) at 17:56, 23 July 2004 (fix link to Help:Downloading pages). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

All Wikipedia text is licensed under the GNU Free Documentation License; see Wikipedia:Copyrights for more info.

See also Wikipedia:MediaWiki to get the software to run the wiki. Another page has just the database schema or layout.

Weekly Database dumps

SQL database dumps on download.wikimedia.org are updated approximately weekly. The status of the download server is discussed in Wikipedia talk:Database download. These can be read into a MySQL relational database for leisurely analysis, testing of the Wikipedia software, and with appropriate preprocessing, perhaps offline reading. There is also a fuller archive of database dumps, containing tables other than cur and old.

The database schema is explained in schema.doc. The cur tables contain the current revisions of all pages; the old tables contain the prior edit history. Approximate file sizes are given for the compressed dumps; uncompressed they'll be significantly larger.

Windows users may not have a bzip2 decompressor on hand; a command-line Windows version of bzip2 is available for free under a BSD license. A GUI file archiver, 7-zip, that is also able to open bz2 compressed files is available for free. MacOS X ships with the command-line bzip2 tool as well.

Currently a full database dump is 15483MB (633MB for just current revisions). If you thought that's 15.12 gigabytes, you're absolutely correct. On a 5kb/sec modem connection, it will take you only 42 days!

Images and uploaded files

Unlike the article text, many images are not released under GFDL or the public domain. These images are owned by external parties who may not have consented to their use in Wikipedia. Wikipedia uses such images under the doctrine of fair use under United States law. Use of such images outside the context of Wikipedia or similar works may be illegal. Also, many images legally require a credit or other attached copyright information, and this copyright information is contained within the text dumps available from download.wikimedia.org. Some images may be restricted to non-commercial use, or may even be licensed exclusively to Wikipedia. Hence, download these images at your own risk.

Currently only files from the English wikipedia are available to download. Others may follow later. The file archives, available at download.wikimedia.org#images are split into 1.9 GB chunks. But currently (2004-07-19) the image archive is unavailable for unknown reasons (the links points to non-existent files).

Static HTML tree dumps for mirroring or CD distribution

Terodump is an alpha quality wikipedia to static html dumper, made from wikipedia code. Static html dump (beta quality) wikipedia-terodump-0.1.tar.bz. This dump is made of a database that is some months old. - User:Tero

The wiki2static script is an experimental program to generate html dumps, inclusive of images, search function and alphabetical index. Here are some examples: English (text only) (151MB), German (complete) (260MB), Spanish (complete) (50MB). More examples at the script's site. Alfio

If you'd like to help set up an automatic dump-to-static function, please drop us a note on the developers' mailing list.

see also Wikipedia:TomeRaider database

Possible problems during local import

See Wikipedia:Database dump import problems.

Please do not use a web crawler

Please do not use a web crawler to download large amounts of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia. Our robots.txt restricts bots to one page per second and blocks many ill-behaved bots.

Sample blocked crawler email

IP address nnn.nnn.nnn.nnn was retrieving up to 50 pages per second from wikipedia.org addresses. Robots.txt has a rate limit of one per second set using the Crawl-delay setting. Please respect that setting or, if you must exceed it a little, do so only during the least busy times shown in our site load graphs at /media/stats/live/org.wikimedia.all.squid.requests-hits.html . It's worth noting that to crawl the whole site at one hit per second will take several weeks. The originating IP is now blocked or will be shortly. Please contact us if you want it unblocked. Please don't try to circumvent it - we'll just block your whole IP range.

If you want information on how to get our content more efficiently, we offer a variety of methods, including weekly database dumps which you can load into MySQL and crawl locally at any rate you find convenient. Tools are also available which will do that for you as often as you like once you have the infrastructure in place. More details are available at http://en.wikipedia.org/wiki/Wikipedia:Database_download.

Instead of an email reply you may prefer to visit #mediawiki at irc.freenode.net to discuss your options with our team.

See also