Jump to content

TabPFN

From Wikipedia, the free encyclopedia
TabPFN
Developer(s)Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, Frank Hutter, Leo Grinsztajn, Klemens Flöge, Oscar Key & Sauraj Gambhir [1]
Initial releaseSeptember 16, 2023; 21 months ago (2023-09-16)[2][3]
Written inPython [3]
Operating systemLinux, macOS, Microsoft Windows[3]
TypeMachine learning
LicenseApache License 2.0
Websitegithub.com/PriorLabs/TabPFN

TabPFN (Tabular Prior-data Fitted Network) is a machine learning model that uses a transformer architecture for supervised classification and regression tasks on small to medium-sized tabular datasets, e.g., up to 10,000 samples.[1]

Overview

[edit]

First developed in 2022, TabPFN v2 was published in 2025 in Nature (journal) by Hollmann and co-authors.[1] The source code is published on GitHub under a modified Apache License and on PyPi.[4]

TabPFN v1 was introduced in a 2022 pre-print and presented at ICLR 2023.[2] Prior Labs, founded in 2024, aims to commercialize TabPFN.[5]

TabPFN supports classification, regression and generative tasks,[1] and its TabPFN-TS extension adds time series forecasting.[6]

Training

[edit]

TabPFN does not require extensive hyperparameter optimization, and is pre-trained on synthetic datasets.[7]

TabPFN addresses challenges in modeling tabular data.[8][9]

Related to Prior-Data Fitted Networks,[10] TabPFN uses a transformer pre-trained on synthetic tabular datasets.[2][11]

It is pre-trained once on around 130 million synthetic datasets generated using Structural Causal Models or Bayesian Neural Networks, simulating real-world data characteristics like missing values or noise.[1] This enables TabPFN to process new datasets in a single forward pass, adapting to the input without retraining.[2] The model’s transformer encoder processes features and labels by alternating attention across rows and columns, capturing relationships within the data.[7] TabPFN v2, an updated version, handles numerical and categorical features, missing values, and supports tasks like regression and synthetic data generation.[1]

TabPFN's pre-training exclusively uses synthetically generated datasets, avoiding benchmark contamination and the costs of curating real-world data.[2] TabPFN v2 was pre-trained on approximately 130 million such datasets, each serving as a "meta-datapoint".[1]

The synthetic datasets are primarily drawn from a prior distribution embodying causal reasoning principles, using Structural Causal Models (SCMs) or Bayesian Neural Networks (BNNs). Random inputs are passed through these models to generate outputs, with a bias towards simpler causal structures. The process generates diverse datasets that simulate real-world imperfections like missing values, imbalanced data and noise. During pre-training, TabPFN predicts the masked target values of new data points given training data points and their known targets, effectively learning a generic learning algorithm that is executed by running a neural network forward pass.[1]

Performance

[edit]

TabPFN v2 has been found to outperform tuned tree-based models like XGBoost or CatBoost in accuracy and speed on small tabular datasets.[2][failed verification] For one application, it matched the accuracy of CatBoost with less training data.[12][failed verification] According to the Nature commentary, traditional models may be more efficient on larger datasets.[7][failed verification] The original v1 release of TabPFN imposed restrictions on multi-class classification tasks, a shortcoming that v2 partially addresses.[2][failed verification]

Understanding and explaining the behavior and performance of TabPFN is an active area of research.[5][failed verification]

Other research and applications

[edit]

Applications for TabPFN have been investigated for domains such as Time Series Forecasting,[6] chemoproteomics,[13] insurance risk classification,[14] medical diagnostics,[15][16][17][18] metagenomics,[19] wildfire propagation modeling,[20] and others.[21]

See also

[edit]

References

[edit]
  1. ^ a b c d e f g h Hollmann, N.; Müller, S.; Purucker, L. (2025). "Accurate predictions on small data with a tabular foundation model". Nature. 637 (8045): 319–326. Bibcode:2025Natur.637..319H. doi:10.1038/s41586-024-08328-6. PMC 11711098. PMID 39780007.
  2. ^ a b c d e f g Hollmann, Noah (2023). TabPFN: A transformer that solves small tabular classification problems in a second. International Conference on Learning Representations (ICLR).
  3. ^ a b c Python Package Index (PyPI) - tabpfn https://pypi.org/project/tabpfn/
  4. ^ PriorLabs/TabPFN, Prior Labs, 2025-06-22, retrieved 2025-06-23
  5. ^ a b Kahn, Jeremy (5 February 2025). "AI has struggled to analyze tables and spreadsheets. This German startup thinks its breakthrough is about to change that". Fortune.
  6. ^ a b "TabPFN Time Series". GitHub.
  7. ^ a b c McElfresh, Duncan C. (8 January 2025). "The AI tool that can interpret any spreadsheet instantly". Nature. 637 (8045): 274–275. Bibcode:2025Natur.637..274M. doi:10.1038/d41586-024-03852-x. PMID 39780000.
  8. ^ Shwartz-Ziv, Ravid; Armon, Amitai (2022). "Tabular data: Deep learning is not all you need". Information Fusion. 81: 84–90. arXiv:2106.03253. doi:10.1016/j.inffus.2021.11.011.
  9. ^ Grinsztajn, Léo; Oyallon, Edouard; Varoquaux, Gaël (2022). Why do tree-based models still outperform deep learning on typical tabular data?. Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22). pp. 507–520.
  10. ^ Müller, Samuel (2022). Transformers can do Bayesian inference. International Conference on Learning Representations (ICLR).
  11. ^ McCarter, Calvin (May 7, 2024). "What exactly has TabPFN learned to do? | ICLR Blogposts 2024". iclr-blogposts.github.io. Retrieved 2025-06-22.
  12. ^ Bender, C.; Vestergaard, P.; Cichosz, S.L. (2025). "The History, Evolution and Future of Continuous Glucose Monitoring (CGM)". Diabetology. 6 (3): 17. doi:10.3390/diabetology6030017.
  13. ^ Offensperger, Fabian; Tin, Gary; Duran-Frigola, Miquel; Hahn, Elisa; Dobner, Sarah; Ende, Christopher W. am; Strohbach, Joseph W.; Rukavina, Andrea; Brennsteiner, Vincenth; Ogilvie, Kevin; Marella, Nara; Kladnik, Katharina; Ciuffa, Rodolfo; Majmudar, Jaimeen D.; Field, S. Denise; Bensimon, Ariel; Ferrari, Luca; Ferrada, Evandro; Ng, Amanda; Zhang, Zhechun; Degliesposti, Gianluca; Boeszoermenyi, Andras; Martens, Sascha; Stanton, Robert; Müller, André C.; Hannich, J. Thomas; Hepworth, David; Superti-Furga, Giulio; Kubicek, Stefan; Schenone, Monica; Winter, Georg E. (26 April 2024). "Large-scale chemoproteomics expedites ligand discovery and predicts ligand behavior in cells". Science. 384 (6694): eadk5864. Bibcode:2024Sci...384k5864O. doi:10.1126/science.adk5864. PMID 38662832.
  14. ^ Chu, Jasmin Z. K.; Than, Joel C. M.; Jo, Hudyjaya Siswoyo (2024). "Deep Learning for Cross-Selling Health Insurance Classification". 2024 International Conference on Green Energy, Computing and Sustainable Technology (GECOST). pp. 453–457. doi:10.1109/GECOST60902.2024.10475046. ISBN 979-8-3503-5790-5.
  15. ^ Alzakari, Sarah A.; Aldrees, Asma; Umer, Muhammad; Cascone, Lucia; Innab, Nisreen; Ashraf, Imran (December 2024). "Artificial intelligence-driven predictive framework for early detection of still birth". SLAS Technology. 29 (6): 100203. doi:10.1016/j.slast.2024.100203. PMID 39424101.
  16. ^ El-Melegy, Moumen; Mamdouh, Ahmed; Ali, Samia; Badawy, Mohamed; El-Ghar, Mohamed Abou; Alghamdi, Norah Saleh; El-Baz, Ayman (21 June 2024). "Prostate Cancer Diagnosis via Visual Representation of Tabular Data and Deep Transfer Learning". Bioengineering. 11 (7): 635. doi:10.3390/bioengineering11070635. PMC 11274351. PMID 39061717.
  17. ^ Karabacak, Mert; Schupper, Alexander; Carr, Matthew; Margetis, Konstantinos (August 2024). "A machine learning-based approach for individualized prediction of short-term outcomes after anterior cervical corpectomy". Asian Spine Journal. 18 (4): 541–549. doi:10.31616/asj.2024.0048. PMC 11366553. PMID 39113482.
  18. ^ Liu, Yanqing; Su, Zhenyi; Tavana, Omid; Gu, Wei (June 2024). "Understanding the complexity of p53 in a new era of tumor suppression". Cancer Cell. 42 (6): 946–967. doi:10.1016/j.ccell.2024.04.009. PMC 11190820. PMID 38729160.
  19. ^ Perciballi, Giulia; Granese, Federica; Fall, Ahmad; Zehraoui, Farida; Prifti, Edi; Zucker, Jean-Daniel (10 October 2024). Adapting TabPFN for Zero-Inflated Metagenomic Data. Table Representation Learning Workshop at NeurIPS 2024.
  20. ^ Khanmohammadi, Sadegh; Cruz, Miguel G.; Perrakis, Daniel D.B.; Alexander, Martin E.; Arashpour, Mehrdad (September 2024). "Using AutoML and generative AI to predict the type of wildfire propagation in Canadian conifer forests". Ecological Informatics. 82: 102711. doi:10.1016/j.ecoinf.2024.102711.
  21. ^ Peña-Asensio, Eloy; Trigo-Rodríguez, Josep M.; Sort, Jordi; Ibáñez-Insa, Jordi; Rimola, Albert (September 2024). "Machine learning applications on lunar meteorite minerals: From classification to mechanical properties prediction". International Journal of Mining Science and Technology. 34 (9): 1283–1292. Bibcode:2024IJMST..34.1283P. doi:10.1016/j.ijmst.2024.08.001.