Jump to content

User:Broken Viking/Project:Tetwix

From Wikipedia, the free encyclopedia

Project TetWix

[edit]

Project status: Initial proposal | Talk page | Proposal at the Village Pump

Project Administrators: Broken Viking

Advisory 2025-06-14: Owing to ongoing illness, BV's progress on this proposal and responses to input might be impeded or subject to delay.

Brief overview

[edit]

TetWix is a data storage and distribution model proposed by Broken Viking intended to streamline the distribution of Wikimedia content dumps to end users in a manner that is more efficient and less data intensive than the present model. It proposes a storage model that makes the updating of individual article content possible, and is heavily targeted toward Bittorrent-based distribution.

It's name is a portmanteau of Tetris and Wiki, reflects the fact that article content is stored in specific positions in a logical space, and that articles - Like the blocks in a game of Tetris - Are fitted in among and around other articles in that same space.

Perceived problems with the existing model

[edit]

The existing mechanism for the bulk distribution of Wikimedia content - Particularly the Wikipedia database dumps - Relies on all articles in a particular projects namespace being concatenated together in an XML framework that runs to significant length (109GB for the English language Wikipedia enwiki-pages-articles-multistream dump created on 2025-05-01) before this is compressed into a shorter length using BZip2, latterly offering a multistream version which allows for extraction of specific parts of compressed data from the dump for decompression by using BZip2 in a mode where the compression algorithm resets periodically. The multistream dump type requires an index file (in itself hundreds of MiB in size) that links article names to specific offsets of compressed data in the dump, and is often difficult to use by those without existing experience in data processing or comprehension of concepts familiar mostly to Linux users.

This results in a circumstance where users wanting to keep an up-to-date offline copy of the English Wikipedia may be downloading the latest dump on a fortnightly basis, equating (Based on the ~24GB size of the recent dumps) 48GB/month or 576GB/year. Many users forget or neglect to use the mirrors, meaning the majority of this demand and the data export costs associated to it - Where this isn't handed-off to mirrors by Wikimedia using HTTP redirects - Is potentially placed upon the Wikimedia Foundation's equipment, connectivity and data transport costs. A secondary concern is also failed downloads, which are more likely in the case of users (Like BV himself) who are reliant on cellular connectivity for their Internet access; In these cases failed download attempts do not only deplete Wikimedia's resources without gainful outcome, but also deplete the users metered data allowance in a fashion that often causes resentment for the fact the failure (Often attributable to equipment or data link failure at the ISP end) depletes data allowances that cannot be recovered.

The format of the Wikipedia database dumps - Which mean that an article located at offset x in one dump will be located at a completely different offset (Which might be several GiB away) in the next - Means that it is technologically impossible for a locally stored dump to be directly patched with changed data from a newer dump without the end-user decompressing the entire dump, updating the contents, and then compressing it again. This also means that Bittorrent-based distribution only works for complete copies of the original Wikimedia-produced dumps, not for user-updated dumps to a specific version that will differ from one user/machine to another.

Although it is possible for users to download only the parts of the dumps that they need on demand - By using the index file to generate a targeted HTTP content-range request that pulls the specific part of the dump from Wikimedia's servers for local decompression and extraction when required - This requires a level of technical understanding and additional steps which most users will either not be able to comprehend, or will regard as exhaustive and inconvenient compared to simply clicking „Download“, as well as defeating the object of keeping an offline copy of the database to guard against e.g. connection or server unavailability. The continual expansion of consumer data connectivity in the Western world - Presently; Fixed line connections of around 1gbps are readily available in many parts of the United Kingdom - Means more users will tend to prefer the easier act of downloading the most up-to-date dumps in full on a regular basis, rather than making conservative use of any dumps held locally and/or downloading updated sections of the latest dump to update locally held data.

Consequences of the existing model

[edit]
  • Many users - Often driven by the Survivalist mentality - Are encouraged to keep the most up to date version of the database available for offline access, and are likely to be downloading each new dump as it becomes available.
  • A large percentage of those users will download these dumps directly from Wikimedias servers partly for perceived authenticity and partly for some mistrust in mirror sites. (Many might not realise that downloading the checksums from Wikimedia and the dumps from a mirror would assure 100% authenticity while placing the data service burden on a mirror rather than Wikimedia.)
  • Many users might bear the misplaced notion that data service and equipment operation is free for the Wikimedia Foundation being a United States 501(c)(3) non-profit organisation and/or might not care about the burdens their access and frequent downloads place upon the Foundations resources.
  • The above factors almost certainly cause significant data service volumes and associated costs to the Foundation, as well as having an impact on equipment wear/tear and reduced duty cycles.

Loose overview of TetWix and how it addresses these challenges

[edit]

Note: This section is entirely speculative at this stage and lays down project concepts that may or may not be adopted in any actual implementation of the project.

TetWix attempts to meet these challenges by storing and organising Wikimedia content in such a way that a locally kept database can be updated to the current version by downloading and updating only the articles that have changed since the previous dump, in such a way that databases downloaded to separate machines and updated to a given version at different points in time will still be an exact binary match to one other. This significantly reduces bandwidth and data transfer demands connected to updating locally kept copies of Wikimedia content databases, allows entire repositories of Wikimedia content to be shared and updated using Bittorrent, allows all copies of a given content database and version to be globally restored quickly and exactly provided that at least one complete copy of that database exists and is online, and allows for a significant degree of load-shifting of database/dump related traffic from Wikimedia resources to a userbase of volunteers for whom the per-head cost of participating in this distribution is significantly lower and more incidental compared to the cost of downloads taken directly from Wikimedia's servers.

One core concept that is central to TetWix function is the keeping and maintenance of a Centrally managed filesystem on each participating client, effectively meaning that the articles kept in the database and their locations within are centrally determined by the Wikimedia foundation, rather than by the end-user. This is essential to the function of distributed copies of the same database being exact binary matches to one another - This considerably improves compatibility with Bittorrent, maintains global resilience and ready recovery of lost database information through other users without any demands on the Wikimedia Foundation, and is unlikely to be an issue for most users - And also means that database storage will need to be on clients that can support sparse files.

Another core and necessary concept is the ability for users to download, store and update specific sub-sets (Namespaces) of the overall Wikimedia content database, meaning that users wishing to keep a local copy of the Wikipedia database in their own language are not burdened with the storage and distribution of Wikipedia in other languages if they do not wish or cannot spare the disk space to permit this, bearing in mind that a complete copy of the Wikimedia content database (i.e: All Wikipedia versions and languages, all namespaces, all Wikimedia content including data from Commons etc) would require over 400TiB of disk space at point of writing. For this reason parts of the database (And the update files distributed for them) would need to be tagged with the Wikimedia project, language and namespaces to which they relate, allowing users keeping the English, French and German Wikipedias (pages-articles only) to share their data with those clients only requiring one or two of those languages and/or sharing the pages-meta-current subsets, and vice-versa.