Draft:BGZF

BGZF (Blocked GNU Zip Format)
Filename extension	.gz
Internet media type	application/gzip
Magic number	`\x1f\x8b\x08\x04` (initial bytes, standard Gzip magic. BGZF adds specific extra fields in the header of each block.)
Developed by	SAMtools project / HTSlib
Initial release	c. 2009 (along with SAM/BAM format specification)
Type of format	Compressed file format, Indexed file format
Container for	Commonly used for bioinformatics data like SAM, BAM, VCF records
Extended from	Gzip
Standard	https://samtools.github.io/hts-specs/SAMv1.pdf#page=13.12
Website	www.htslib.org/doc/bgzip.html

Review waiting, please be patient.

This may take 5 weeks or more, since drafts are reviewed in no specific order. There are 547 pending submissions waiting for review.

If the submission is accepted, then this page will be moved into the article space.
If the submission is declined, then the reason will be posted here.
In the meantime, you can continue to improve this submission by editing normally.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Reviewer tools

Instructions · What links here · BGZF (talk: + · bio) · (log) · Copyvios report · reFill · Citation Bot · (Search: Google, Wikipedia) · Submitted 84 seconds ago by Rare Moon (talk: D · +) · Last edited 5 seconds ago by Rare Moon

Submission declined on 11 June 2025 by SafariScribe (talk).

This draft's references do not show that the subject qualifies for a Wikipedia article. In summary, the draft needs multiple published sources that are:

in-depth (not just passing mentions about the subject)
reliable
secondary
independent of the subject

Make sure you add references that meet these criteria before resubmitting. Learn about mistakes to avoid when addressing this issue. If no additional references exist, the subject is not suitable for Wikipedia.

If you would like to continue working on the submission, click on the "Edit" tab at the top of the window.
If you have not resolved the issues listed above, your draft will be declined again and potentially deleted.
If you need extra help, please ask us a question at the AfC Help Desk or get live help from experienced editors.
Please do not remove reviewer comments or this notice until the submission is accepted.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Declined by SafariScribe 2 days ago. Last edited by Rare Moon 5 seconds ago. Reviewer: Inform author.

This draft has been resubmitted and is currently awaiting re-review.

Submission declined on 4 June 2025 by CSMention269 (talk).

This draft's references do not show that the subject qualifies for a Wikipedia article. In summary, the draft needs multiple published sources that are:

in-depth (not just passing mentions about the subject)
reliable
secondary
independent of the subject

Make sure you add references that meet these criteria before resubmitting. Learn about mistakes to avoid when addressing this issue. If no additional references exist, the subject is not suitable for Wikipedia.

Declined by CSMention269 9 days ago.

Submission declined on 10 May 2025 by Caleb Stanford (talk).

This submission provides insufficient context for those unfamiliar with the subject matter. Please see the guide to writing better articles for information on how to better format your submission.

Declined by Caleb Stanford 33 days ago.

Submission declined on 1 April 2025 by Jlwoodwa (talk).

This draft's references do not show that the subject qualifies for a Wikipedia article. In summary, the draft needs multiple published sources that are:

in-depth (not just passing mentions about the subject)
reliable
secondary
independent of the subject

Make sure you add references that meet these criteria before resubmitting. Learn about mistakes to avoid when addressing this issue. If no additional references exist, the subject is not suitable for Wikipedia.

Declined by Jlwoodwa 2 months ago.

Submission declined on 1 April 2025 by Liance (talk).

This submission is not adequately supported by reliable sources. Reliable sources are required so that information can be verified. If you need help with referencing, please see Referencing for beginners and Citing sources.

This draft's references do not show that the subject qualifies for a Wikipedia article. In summary, the draft needs multiple published sources that are:

in-depth (not just passing mentions about the subject)
reliable
secondary
independent of the subject

Make sure you add references that meet these criteria before resubmitting. Learn about mistakes to avoid when addressing this issue. If no additional references exist, the subject is not suitable for Wikipedia.

Declined by Liance 2 months ago.

Comment: Please see my helpdesk post for discussion of sources regarding notability. Thank you! Rare Moon (talk) 17:18, 13 June 2025 (UTC)
Comment: Re: previous review(s) — I have added additional citations that discuss BGZF and build upon the features of algorithsm. The GNG comment appears more generic and reflects a lack of WP:AFCPURPOSE – please note that this is a file format and algorithm (used widely in the field) as supported by citations. The key technical algorithm information obviously comes from the original spec, but that does not mean it's not notable. The use of file format / algorithm is discussed in many contexts and those manuscripts do not need to provide algorithm discussion again (as is the norm, to cite the original manuscript). Consider for example gzip article. Thank you! Rare Moon (talk) 20:49, 8 June 2025 (UTC)

Comment: Sources lack qualifying WP:GNG. ☮️Counter-Strike:Mention 269🕉️^{(🗨️ ● ✉️ ● 📔)} 06:31, 4 June 2025 (UTC)

Comment: The article uses a large amount of technical jargon and does not provide a general introduction to the subject. I am not sure if the topic meets GNG. Caleb Stanford (talk) 20:40, 10 May 2025 (UTC)

Comment: Publications by the developers of BGZF are not independent. jlwoodwa (talk) 23:28, 1 April 2025 (UTC)

Blocked GNU Zip Format (BGZF) is a variant of gzip file format that uses block compression. BGZF format is used in the field of bioinformatics for genomic data compression with common file formats such as BAM, Binary VCF (BCF), or compressed FASTA files.^[1]^[2]^[3] Block compression design provides efficient storage, random access with indexed queries,^[4]^[5] and parallel processing;^[6] allowing large-scale data processing. BGZF file can be decompressed by any gunzip compatible utility. BGZF was developed as part of SAM/BAM specification and SAMtools,^[6] primarily by Bob Handsake.^[7]

Uses

BGZF is widely utilized in bioinformatics for the compression of large datasets where efficient random access is a crucial requirement.^[1]^[8] Due to large sizes of next-generation sequencing data formats like SAM files,^[9] they are compressed into binary BAM format utilizing BGZF compression.^[2]^[10] The format is also extensively employed for compressing variant call files (VCF) along with their associated Tabix indexes,^[10]^[11] and similarly for other substantial genomic data files such as BED, GFF/GTF, and occasionally FASTQ when indexed access is necessary.^[3]

A broad range of bioinformatics software packages are equipped to read and write BGZF-compressed files; these include well-known tools like SAMtools, HTSlib, Picard tools, the GATK, and libraries such as Biopython.^[12]^[13] The standard command-line utility for creating BGZF-compressed files and their corresponding .gzi indexes is bgzip, which is distributed as part of HTSlib.^[7]

BGZF has been adapted for development of more efficient data-specific compression methods and algorithms leveraging its block based design.^[14]

Design schema

A BGZF file consists of a series of concatenated BGZF blocks. Each block, whether in its compressed or uncompressed state, is limited to a maximum size of 64 kilobytes. Each BGZF block is itself a fully compliant gzip archive, adhering to the specifications outlined in RFC 1952.^[15]

Each BGZF block contains a standard gzip file header with the following standard-compliant extensions:

The F.EXTRA bit in the header is set to indicate that extra fields are present.
The extra field used by BGZF uses the two subfield ID values 66 and 67 (ASCII ‘BC’).
The length of the BGZF extra field payload (field LEN in the gzip specification) is 2 (two bytes of payload).
The payload of the BGZF extra field is a 16-bit unsigned integer in little-endian format. This integer gives the size of the containing BGZF block minus one.

This block design allows use of an associated index file (storing offsets of each BGZF block) to fetch and decompress only the block of data that pertains to the query, thus avoiding the computational overhead of reading and decompressing all BGZF blocks.^[15] However, due to block design, BGZF files tend to be larger than equivalent GZIP compressed files.^[16]

Random access

EOF marker

End-of-file marker for BGZF enables detection of erroneously truncated files and generate warnings or errors for the user. The EOF marker block is an empty (data block of length zero) BGZF block encoded with the default zlib compression level settings, and consists of the following 28 hexadecimal bytes:

1f 8b 08 04 00 00 00 00 00 ff 06 00 42 43 02 00 1b 00 03 00 00 00 00 00 00 00 00 00

The presence of an EOF marker by itself does not signal an end of the file, however, an EOF marker present at the end of a BGZF file indicates that the immediately following physical EOF is the end of the file as intended by the program that wrote it.^[17]

References

^ ^a ^b Carpentieri, Bruno (2019-11-01). "Next Generation Sequencing Data and its Compression". IOP Conference Series: Earth and Environmental Science. 362 (1): 012059. Bibcode:2019E&ES..362a2059C. doi:10.1088/1755-1315/362/1/012059. ISSN 1755-1307.
^ ^a ^b Hernaez, Mikel; Pavlichin, Dmitri; Weissman, Tsachy; Ochoa, Idoia (2019-07-20). "Genomic Data Compression". Annual Review of Biomedical Data Science. 2: 19–37. doi:10.1146/annurev-biodatasci-072018-021229. ISSN 2574-3414.
^ ^a ^b Bonfield, James K; Marshall, John; Danecek, Petr; Li, Heng; Ohan, Valeriu; Whitwham, Andrew; Keane, Thomas; Davies, Robert M (2021-02-01). "HTSlib: C library for reading/writing high-throughput sequencing data". GigaScience. 10 (2): giab007. doi:10.1093/gigascience/giab007. ISSN 2047-217X. PMC 7931820. PMID 33594436.
^ Danecek, Petr; Bonfield, James K; Liddle, Jennifer; Marshall, John; Ohan, Valeriu; Pollard, Martin O; Whitwham, Andrew; Keane, Thomas; McCarthy, Shane A; Davies, Robert M; Li, Heng (2021-02-01). "Twelve years of SAMtools and BCFtools". GigaScience. 10 (2): giab008. doi:10.1093/gigascience/giab008. ISSN 2047-217X. PMC 7931819. PMID 33590861. [..] both formats can be either plain (uncompressed) or block-compressed with BGZF for random access and compact size.
^ Yamada, Taiju (2020-04-01). "7bgzf: Replacing samtools bgzip deflation for archiving and real-time compression". Computational Biology and Chemistry. 85: 107207. doi:10.1016/j.compbiolchem.2020.107207. ISSN 1476-9271. PMID 32092548.
^ ^a ^b Li, Heng; Handsaker, Bob; Wysoker, Alec; Fennell, Tim; Ruan, Jue; Homer, Nils; Marth, Gabor; Abecasis, Goncalo; Durbin, Richard; 1000 Genome Project Data Processing Subgroup (2009-08-15). "The Sequence Alignment/Map format and SAMtools". Bioinformatics. 25 (16): 2078–2079. doi:10.1093/bioinformatics/btp352. ISSN 1367-4803. PMC 2723002. PMID 19505943.{{cite journal}}: CS1 maint: numeric names: authors list (link)
^ ^a ^b "bgzip(1) manual page". www.htslib.org. Retrieved 2025-06-03.
^ Lan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (2021-08-25). "Genozip: a universal extensible genomic data compressor". Bioinformatics (Oxford, England). 37 (16): 2225–2230. doi:10.1093/bioinformatics/btab102. ISSN 1367-4811. PMC 8388020. PMID 33585897. BGZF-block level indexing that is common in standard indexes of genomic file formats
^ Weeks, N. T. (2018). "Openmp task parallelism for faster genomic data processing" (PDF). Reading, decoding, sorting, encoding, and writing large sequence alignment files (tens or hundreds of GBs) can be time-consuming and resource intensive.
^ ^a ^b Sadikin, Rifki; Arisal, Andria; Omar, Rofithah; Mazni, Nur Hidayah (November 2017). "Processing next generation sequencing data in map-reduce framework using hadoop-BAM in a computer cluster". 2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE). pp. 421–425. doi:10.1109/ICITISEE.2017.8285542. ISBN 978-1-5386-0658-2.
^ "bgzip man - Linux Command Library". Retrieved 3 June 2025. It is commonly used in bioinformatics to compress files like SAM, BAM, VCF, and BED files, especially when random access is needed, often in conjunction with indexing tools like Tabix.
^ "Bio.bgzf module — Biopython 1.85 documentation". Biopython. Retrieved 3 June 2025.
^ "BlockCompressedOutputStream (htsjdk 2.8.1 API)". SAMtools/HTSlib. Retrieved 3 June 2025.
^ Li, Miaoxin; Li, Jiang; Li, Mulin Jun; Pan, Zhicheng; Hsu, Jacob Shujui; Liu, Dajiang J.; Zhan, Xiaowei; Wang, Junwen; Song, Youqiang; Sham, Pak Chung (2017-05-19). "Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework". Nucleic Acids Research. 45 (9): e75. doi:10.1093/nar/gkx019. ISSN 1362-4962. PMC 5435951. PMID 28115622.
^ ^a ^b Li, Heng (2011-03-01). "Tabix: fast retrieval of sequence features from generic TAB-delimited files". Bioinformatics. 27 (5): 718–719. doi:10.1093/bioinformatics/btq671. ISSN 1367-4803. PMC 3042176. PMID 21208982.
^ Cock, Peter (2011-11-08). "Blasted Bioinformatics!?: BGZF - Blocked, Bigger & Better GZIP!". Blasted Bioinformatics!?. Retrieved 2025-04-10.
^ "HTS format specifications". samtools.github.io. Retrieved 2025-04-01.

[:2-1] Carpentieri, Bruno (2019-11-01). "Next Generation Sequencing Data and its Compression". IOP Conference Series: Earth and Environmental Science. 362 (1): 012059. Bibcode:2019E&ES..362a2059C. doi:10.1088/1755-1315/362/1/012059. ISSN 1755-1307.

[:5-2] Hernaez, Mikel; Pavlichin, Dmitri; Weissman, Tsachy; Ochoa, Idoia (2019-07-20). "Genomic Data Compression". Annual Review of Biomedical Data Science. 2: 19–37. doi:10.1146/annurev-biodatasci-072018-021229. ISSN 2574-3414.

[:6-3] Bonfield, James K; Marshall, John; Danecek, Petr; Li, Heng; Ohan, Valeriu; Whitwham, Andrew; Keane, Thomas; Davies, Robert M (2021-02-01). "HTSlib: C library for reading/writing high-throughput sequencing data". GigaScience. 10 (2): giab007. doi:10.1093/gigascience/giab007. ISSN 2047-217X. PMC 7931820. PMID 33594436.

[4] Danecek, Petr; Bonfield, James K; Liddle, Jennifer; Marshall, John; Ohan, Valeriu; Pollard, Martin O; Whitwham, Andrew; Keane, Thomas; McCarthy, Shane A; Davies, Robert M; Li, Heng (2021-02-01). "Twelve years of SAMtools and BCFtools". GigaScience. 10 (2): giab008. doi:10.1093/gigascience/giab008. ISSN 2047-217X. PMC 7931819. PMID 33590861. [..] both formats can be either plain (uncompressed) or block-compressed with BGZF for random access and compact size.

[5] Yamada, Taiju (2020-04-01). "7bgzf: Replacing samtools bgzip deflation for archiving and real-time compression". Computational Biology and Chemistry. 85: 107207. doi:10.1016/j.compbiolchem.2020.107207. ISSN 1476-9271. PMID 32092548.

[:0-6] Li, Heng; Handsaker, Bob; Wysoker, Alec; Fennell, Tim; Ruan, Jue; Homer, Nils; Marth, Gabor; Abecasis, Goncalo; Durbin, Richard; 1000 Genome Project Data Processing Subgroup (2009-08-15). "The Sequence Alignment/Map format and SAMtools". Bioinformatics. 25 (16): 2078–2079. doi:10.1093/bioinformatics/btp352. ISSN 1367-4803. PMC 2723002. PMID 19505943.{{cite journal}}: CS1 maint: numeric names: authors list (link)

[:3-7] "bgzip(1) manual page". www.htslib.org. Retrieved 2025-06-03.

[8] Lan, Divon; Tobler, Ray; Souilmi, Yassine; Llamas, Bastien (2021-08-25). "Genozip: a universal extensible genomic data compressor". Bioinformatics (Oxford, England). 37 (16): 2225–2230. doi:10.1093/bioinformatics/btab102. ISSN 1367-4811. PMC 8388020. PMID 33585897. BGZF-block level indexing that is common in standard indexes of genomic file formats

[9] Weeks, N. T. (2018). "Openmp task parallelism for faster genomic data processing" (PDF). Reading, decoding, sorting, encoding, and writing large sequence alignment files (tens or hundreds of GBs) can be time-consuming and resource intensive.

[:4-10] Sadikin, Rifki; Arisal, Andria; Omar, Rofithah; Mazni, Nur Hidayah (November 2017). "Processing next generation sequencing data in map-reduce framework using hadoop-BAM in a computer cluster". 2017 2nd International conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE). pp. 421–425. doi:10.1109/ICITISEE.2017.8285542. ISBN 978-1-5386-0658-2.

[samtools_history_man2-11] "bgzip man - Linux Command Library". Retrieved 3 June 2025. It is commonly used in bioinformatics to compress files like SAM, BAM, VCF, and BED files, especially when random access is needed, often in conjunction with indexing tools like Tabix.

[biopython_bgzf2-12] "Bio.bgzf module — Biopython 1.85 documentation". Biopython. Retrieved 3 June 2025.

[htsjdk_blockcompressedoutputstream2-13] "BlockCompressedOutputStream (htsjdk 2.8.1 API)". SAMtools/HTSlib. Retrieved 3 June 2025.

[14] Li, Miaoxin; Li, Jiang; Li, Mulin Jun; Pan, Zhicheng; Hsu, Jacob Shujui; Liu, Dajiang J.; Zhan, Xiaowei; Wang, Junwen; Song, Youqiang; Sham, Pak Chung (2017-05-19). "Robust and rapid algorithms facilitate large-scale whole genome sequencing downstream analysis in an integrative framework". Nucleic Acids Research. 45 (9): e75. doi:10.1093/nar/gkx019. ISSN 1362-4962. PMC 5435951. PMID 28115622.

[:1-15] Li, Heng (2011-03-01). "Tabix: fast retrieval of sequence features from generic TAB-delimited files". Bioinformatics. 27 (5): 718–719. doi:10.1093/bioinformatics/btq671. ISSN 1367-4803. PMC 3042176. PMID 21208982.

[16] Cock, Peter (2011-11-08). "Blasted Bioinformatics!?: BGZF - Blocked, Bigger & Better GZIP!". Blasted Bioinformatics!?. Retrieved 2025-04-10.

[17] "HTS format specifications". samtools.github.io. Retrieved 2025-04-01.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

Uses

Design schema

Random access

EOF marker

See also

References