Encoding of pretrained large language models mirrors the genetic architectures of human psychological traits.

Abstract

Recent advances in large language models (LLMs) have prompted a frenzy in utilizing them as universal translators for biomedical terms. However, the black box nature of LLMs has forced researchers to rely on artificially designed benchmarks without understanding what exactly LLMs encode. We demonstrate that pretrained LLMs can already explain up to 51% of the genetic correlation between items from a psychometrically-validated neuroticism questionnaire, without any fine-tuning. For psychiatric diagnoses, we found disorder names aligned better with genetic relationships than diagnostic descriptions. Our results indicate the pretrained LLMs have encodings mirroring genetic architectures. These findings highlight LLMs’ potential for validating phenotypes, refining taxonomies, and integrating textual and genetic data in mental health research.

Competing Interest Statement

Dr. Paulus advises Spring Care, Inc., receives royalties from an article on methamphetamine in UpToDate, and has a compensated consulting agreement with Boehringer Ingelheim International GmbH.

Funding Statement

This work was partly funded by The William K. Warren Foundation, the National Institute of General Medical Sciences Center (Grant 2 P20GM121312, MPP, RK, KLF), the National Institute on Drug Abuse (U01DA050989, MPP), and the National Institute of Mental Health (R01MH122688, R01MH128959, CCF).

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

All the data used in this study are available in the public domain.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data produced in the present work are contained in the manuscript

Comments (0)

No login
gif