Draft:TabPFN
![]() | Review waiting, please be patient.
This may take 2 months or more, since drafts are reviewed in no specific order. There are 1,599 pending submissions waiting for review.
Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
Reviewer tools
|
Submission declined on 5 June 2025 by Caleb Stanford (talk). Neologisms are not considered suitable for Wikipedia unless they receive substantial use and press coverage; this requires strong evidence in independent, reliable, published sources. Links to sites specifically intended to promote the neologism itself do not establish its notability.
Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
This draft has been resubmitted and is currently awaiting re-review. | ![]() |
Submission declined on 3 June 2025 by Old-AgedKid (talk). This submission is not adequately supported by reliable sources. Reliable sources are required so that information can be verified. If you need help with referencing, please see Referencing for beginners and Citing sources. Declined by Old-AgedKid 3 days ago. | ![]() |
Submission declined on 2 June 2025 by Sophisticatedevening (talk). This submission is not adequately supported by reliable sources. Reliable sources are required so that information can be verified. If you need help with referencing, please see Referencing for beginners and Citing sources. Declined by Sophisticatedevening 4 days ago. | ![]() |
Comment: Please do not use Wikipedia to advertise your latest research result. Caleb Stanford (talk) 01:31, 5 June 2025 (UTC)
Comment: Will leave for another reviewer. There are book references but they do not seem to talk too in-depth about the model. The draft is also written too technical as opposed to an encyclopedia entry. You also still have an unreliable source used (Science Direct) which needs replaced. CNMall41 (talk) 16:33, 4 June 2025 (UTC)
Comment: Almost all of your sources are listed at Wikipedia:Reliable sources/Perennial sources. Sophisticatedevening🍷(talk) 16:01, 2 June 2025 (UTC)
Comment: In accordance with the Wikimedia Foundation's Terms of Use, I disclose that I have been paid by my employer for my contributions to this article. AlessandrobonettoPL (talk) 14:18, 2 June 2025 (UTC)
Developer(s) | Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, Frank Hutter, Leo Grinsztajn, Klemens Flöge, Oscar Key & Sauraj Gambhir [1] |
---|---|
Initial release | September 16, 2023[2][3] |
Written in | Python [3] |
Operating system | Linux, macOS, Microsoft Windows[3] |
Type | Machine learning |
License | Apache License 2.0 |
Website | github |
TabPFN (Tabular Prior-data Fitted Network) is a deep learning model based on a transformer architecture, designed for supervised classification and regression tasks on small to medium-sized tabular datasets. It distinguishes itself by being pre-trained once on a vast collection of synthetically generated datasets, enabling it to make predictions on new, unseen tabular data in seconds without requiring dataset-specific hyperparameter tuning.[1][2] Developed by researchers now associated with Prior Labs, its capabilities, especially those of the recent version TabPFN v2, were detailed in the journal Nature.[1]
TabPFN addresses persistent challenges in modeling tabular data. Traditional machine learning models like Gradient-Boosted Decision Trees (GBDTs) have long dominated this area but often require extensive, time-consuming hyperparameter tuning and may struggle to generalize, especially for small datasets. Early deep learning attempts did not consistently outperform these methods.[4][5] Furthermore, Large Language Models (LLMs), despite their success with unstructured text, face difficulties with structured tabular data, as they are not inherently designed for the two-dimensional relationships and precise numerical reasoning tables require . TabPFN bridges these gaps by leveraging a transformer architecture pre-trained on diverse synthetic tabular structures.[1][2] This approach is analogous to the rise of foundation models in NLP, aiming to create a more universal, pre-trained model for tabular data.[1]
Technical Overview
[edit]
TabPFN's mechanism is rooted in the Prior-Data Fitted Network (PFN) paradigm[6] and a transformer architecture adapted for in-context learning on tabular data.[1][2]
PFN Paradigm: TabPFN is trained offline once on an extensive corpus of synthetic datasets. This pre-training aims to approximate Bayesian inference across diverse data structures, resulting in a network whose weights encapsulate a general predictive capability.[6][7]
In-Context Learning (ICL): At inference, TabPFN takes a new dataset (training examples and unlabeled test examples) as an input and processes it in one forward pass to yield predictions for the test examples, without updating its pre-trained weights. It dynamically adapts based on the provided labeled data within the input context.[1][2]
Transformer-Based Architecture: It uses a standard transformer encoder architecture. Features and labels are embedded, and the model processes the entire collection of samples simultaneously. An attention mechanism tailored for tables by alternatingly attending across rows and columns, allows data cells to contextualize each other.[7] TabPFN v2 also introduced randomized feature tokens, transforming each data instance into a matrix of comparable tokens and eliminating the need for dataset-specific feature token learning.[7]
Key Features and Capabilities
- Speed and Efficiency: Predictions for an entire new dataset within seconds, significantly faster than traditional approaches requiring hyperparameter optimization (HPO).
- No Hyperparameter Tuning: Operates out-of-the-box due to extensive pre-training.[2]
- Performance on Small Datasets: Strong predictive performance on datasets up to 1,000 samples (v1) and 10,000 samples (v2).
- Versatile Data Handling (v2): Natively processes numerical and categorical features, manages missing values, and is robust to uninformative features and outliers.
- Expanded Task Capabilities (v2): Supports supervised classification, regression, and generative tasks like fine-tuning, synthetic data generation, density estimation, and learning reusable embeddings. An adaptation, TabPFN-TS, shows promise for time series forecasting.[8]
- Data Efficiency (v2): Can achieve performance comparable to strong baselines like CatBoost using only half of the training data.[1]
Model Training
[edit]TabPFN's pre-training exclusively uses synthetically generated datasets, avoiding benchmark contamination and the costs of curating real-world data.[2] TabPFN v2 was pre-trained on approximately 130 million such datasets, each serving as a "meta-datapoint".[1]
The synthetic datasets are primarily drawn from a prior distribution embodying causal reasoning principles, using Structural Causal Models (SCMs) or Bayesian Neural Networks (BNNs). Random inputs are passed through these models to generate outputs, with a bias towards simpler causal structures. The process generates diverse datasets that simulate real-world imperfections like missing values, imbalanced data and noise. During pre-training, TabPFN predicts the masked target values of new data points given training data points and their known targets, effectively learning a generic learning algorithm that is executed by running a neural network forward pass.[1]
Performance
[edit]Comparison to Traditional Models: TabPFN, especially v2, often shows superior predictive accuracy (e.g., ROC AUC) on small tabular datasets compared to well-tuned boosted trees (XGBoost, CatBoost) but in seconds versus hours spent tuning the tree-based models.[2] TabPFN v2 can outperform an ensemble of strong baseline models tuned for hours in seconds and achieve comparable accuracy to CatBoost with half the training data.[9]
TabPFN v1 vs v2: V2 significantly increased scalability (up to 10,000 samples, 500 features vs. v1's ~1,000 samples, 100 numerical features). V2 added native regression, categorical feature/missing value handling, and generative capabilities, becoming a more versatile foundation model.[1]
Limitations:
- Scalability to Large Datasets: TabPFN is primarily designed for small to medium datasets; models like XGBoost may be better for substantially larger ones due to the TabPFN's transformer's quadratic complexity.[7]
- Number of Classes: V1 had limits on class numbers in multi-class classification[2]; v2 and extensions aim to address this
- Early Stage of Development: A fully understanding of the inner workings of TabPFN is still evolving in the community, with active research on extensions[10]
Applications and Use Cases
[edit]This section draws on various applications demonstrated or suggested for TabPFN and its variants:
- Time Series Forecasting: TabPFN-TS[8] extension for financial forecasting, demand planning.
- Chemoproteomics[11]
- Health insurance classification[12]
- Fault classification in machinery[13]
- Early detection of still birth[14]
- Prostate cancer diagnosis[15]
- Outcomes after anterior cervical corpectomy[16]
- Predicting River Algal Blooms[17]
- Metagenomics[18]
- Immunotherapy predictions for patients with cancer[19]
- Predicting dementia in Parkinson's disease[20]
- Pricing models in actuarial science[21]
- Diagnostic prediction of minimal change disease[22]
- Predicting wildfire propagation[23]
- Glucose monitoring[9]
- Classifying lunar meteorite minearls[24]
- Prognosis of distal medium vessel occlusion[25]
History
[edit]TabPFN's academic roots trace to work on PFNs.[6] TabPFN v1 was introduced via a paper ("TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second")[2], with a pre-print in July 2022 and formal presentation at ICLR 2023. The core innovation was applying transformer architecture and ICL to tabular data, pre-trained on synthetic datasets from SCMs.[2] This evolved into TabPFN v2, detailed in Nature in January 2025, offering improved scalability and broader capabilities.[7] Prior Labs was co-founded in late 2024 by key contributors to TabPFN to commercialize the research.[10]
See also
[edit]References
[edit]- ^ a b c d e f g h i j k Hollmann, N., Müller, S., Purucker, L. et al. Accurate predictions on small data with a tabular foundation model. Nature 637, 319–326 (2025) https://doi.org/10.1038/s41586-024-08328-6 (also https://pubmed.ncbi.nlm.nih.gov/39780007/)
- ^ a b c d e f g h i j k Hollmann, Noah, et al. "Tabpfn: A transformer that solves small tabular classification problems in a second." https://iclr.cc/virtual/2023/oral/12541 (also https://neurips.cc/virtual/2022/58545)
- ^ a b c Python Package Index (PyPI) - tabpfn https://pypi.org/project/tabpfn/
- ^ Shwartz-Ziv, Ravid, and Amitai Armon. "Tabular data: Deep learning is not all you need." Information Fusion 81 (2022) https://www.sciencedirect.com/science/article/pii/S1566253521002360
- ^ Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. 2022. Why do tree-based models still outperform deep learning on typical tabular data? In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22). Curran Associates Inc., Red Hook, NY, USA, Article 37, 507–520. https://dl.acm.org/doi/10.5555/3600270.3600307
- ^ a b c Müller, Samuel, et al. "Transformers can do bayesian inference" (published as a conference paper at ICLR 2022) https://openreview.net/pdf?id=KSugKcbNf9
- ^ a b c d e Duncan C. McElfresh, The AI tool that can interpret any spreadsheet instantly. Nature, 08 January 2025. https://www.nature.com/articles/d41586-024-03852-x
- ^ a b TabPFN Time Series https://github.com/PriorLabs/tabpfn-time-series
- ^ a b Bender C, Vestergaard P, Cichosz SL. The History, Evolution and Future of Continuous Glucose Monitoring (CGM). Diabetology. 2025; 6(3):17. https://doi.org/10.3390/diabetology6030017
- ^ a b Jeremy Kahn: "AI has struggled to analyze tables and spreadsheets. This German startup thinks its breakthrough is about to change that." https://fortune.com/2025/02/05/prior-labs-9-million-euro-preseed-funding-tabular-data-ai/
- ^ Fabian Offensperger et al. ,Large-scale chemoproteomics expedites ligand discovery and predicts ligand behavior in cells. https://www.science.org/doi/abs/10.1126/science.adk5864
- ^ J. Z. K. Chu, J. C. M. Than and H. S. Jo, "Deep Learning for Cross-Selling Health Insurance Classification," 2024 International Conference on Green Energy, Computing and Sustainable Technology (GECOST), Miri Sarawak, Malaysia, 2024. https://ieeexplore.ieee.org/document/10475046
- ^ L. Magadán, J. Roldán-Gómez, J. C. Granda and F. J. Suárez, "Early Fault Classification in Rotating Machinery With Limited Data Using TabPFN," in IEEE Sensors Journal, vol. 23, no. 24, pp. 30960-30970, 15 Dec.15, 2023. https://ieeexplore.ieee.org/abstract/document/10318062
- ^ Sarah A. Alzakari, Asma Aldrees, Muhammad Umer, Lucia Cascone, Nisreen Innab, Imran Ashraf, "Artificial intelligence-driven predictive framework for early detection of still birth", SLAS Technology, Volume 29, Issue 6, 2024, 100203, ISSN 2472-6303, https://doi.org/10.1016/j.slast.2024.100203. (also https://www.sciencedirect.com/science/article/pii/S2472630324000852)
- ^ El-Melegy, M., Mamdouh, A., Ali, S., Badawy, M., El-Ghar, M. A., Alghamdi, N. S., & El-Baz, A. (2024). Prostate Cancer Diagnosis via Visual Representation of Tabular Data and Deep Transfer Learning. Bioengineering, 11(7), 635. https://doi.org/10.3390/bioengineering11070635
- ^ Karabacak M, Schupper A, Carr M, Margetis K. A machine learning-based approach for individualized prediction of short-term outcomes after anterior cervical corpectomy. Asian Spine J. 2024 Aug;18(4):541-549. doi: 10.31616/asj.2024.0048. Epub 2024 Aug 8. PMID: 39113482; PMCID: PMC11366553. https://pmc.ncbi.nlm.nih.gov/articles/PMC11366553/
- ^ Yang, H., & Park, J. (2024). Comparing the Performance of a Deep Learning Model (TabPFN) for Predicting River Algal Blooms with Varying Data Composition. Journal of Wetlands Research, 26(3), 197–203. https://doi.org/10.17663/JWR.2024.26.3.197
- ^ Perciballi, G., Granese, F., Fall, A., Zehraoui, F., Prifti, E., & Zucker, J.-D. (2024). Adapting TabPFN for Zero-Inflated Metagenomic Data. NeurIPS 2024 Workshop TRL. https://openreview.net/pdf?id=3I0bVvUj25
- ^ Dyikanov, D., Zaitsev, A., Vasileva, T., Luginbuhl, A. J., Ataullakhanov, R. I., & Goldberg, M. F. (2024). Comprehensive peripheral blood immunoprofiling reveals five immunotypes with immunotherapy response characteristics in patients with cancer. Cancer Cell, 42(5), 759–779.e12. https://doi.org/10.1016/j.ccell.2024.04.009
- ^ Tran VQ, Byeon H. Predicting dementia in Parkinson’s disease on a small tabular dataset using hybrid LightGBM–TabPFN and SHAP. DIGITAL HEALTH. 2024;10. https://journals.sagepub.com/doi/full/10.1177/20552076241272585
- ^ Brauer, A. Enhancing actuarial non-life pricing models via transformers. Eur. Actuar. J. 14, 991–1012 (2024). https://doi.org/10.1007/s13385-024-00388-2
- ^ Noda, R., Ichikawa, D. & Shibagaki, Y. Machine learning-based diagnostic prediction of minimal change disease: model development study. Sci Rep 14, 23460 (2024). https://doi.org/10.1038/s41598-024-73898-4
- ^ Sadegh Khanmohammadi, Miguel G. Cruz, Daniel D.B. Perrakis, Martin E. Alexander, Mehrdad Arashpour, Using AutoML and generative AI to predict the type of wildfire propagation in Canadian conifer forests, Ecological Informatics, Volume 82, 2024, 102711, ISSN 1574-9541, https://doi.org/10.1016/j.ecoinf.2024.102711. (https://www.sciencedirect.com/science/article/pii/S157495412400253X)
- ^ Eloy Peña-Asensio, Josep M. Trigo-Rodríguez, Jordi Sort, Jordi Ibáñez-Insa, Albert Rimola, Machine learning applications on lunar meteorite minerals: From classification to mechanical properties prediction, International Journal of Mining Science and Technology, Volume 34, Issue 9, 2024, Pages 1283-1292, ISSN 2095-2686, https://doi.org/10.1016/j.ijmst.2024.08.001. (https://www.sciencedirect.com/science/article/pii/S2095268624001010)
- ^ Mert Karabacak, Burak Berksu Ozkara, Tobias D. Faizy, Trevor Hardigan, Jeremy J. Heit, Dhairya A. Lakhani, Konstantinos Margetis, J Mocco, Kambiz Nael, Max Wintermark, Vivek S. Yedavalli. Data-Driven Prognostication in Distal Medium Vessel Occlusions Using Explainable Machine Learning American Journal of Neuroradiology Oct 2024, ajnr.A8547. https://www.ajnr.org/content/early/2024/10/28/ajnr.A8547