From owner-chemistry@ccl.net Sun Jan 2 23:36:00 2022 From: "Anthony Scemama scemama[]irsamc.ups-tlse.fr" To: CCL Subject: CCL: Quantum chemistry interoperability library Message-Id: <-54566-211231141449-5653-QluyFHLPkH394n21KjIFWQ###server.ccl.net> X-Original-From: "Anthony Scemama" Date: Fri, 31 Dec 2021 14:14:47 -0500 Sent to CCL by: "Anthony Scemama" [scemama%a%irsamc.ups-tlse.fr] Dear all, I am very pleased to see this discussion on file formats! In the TREX project, we are working on the definition of a common format and library for exchanging wave functions: TREXIO, (https://trex-coe.github.io/trexio/index.html). In other words, we would like to make "the jpeg of wave functions". It is important to have such a format for making calculations reproducible. This project has started a few months ago, so it is still at an early stage. It is an open-source project so contributions of any kind are welcome: feedback, new ideas, interfacing codes, etc. It is hosted on GitHub under the BSD 3-clause license: https://github.com/trex-coe/trexio TREXIO took inspiration from Q5Cost. However, it differs in some important aspects: - The reader of the file should not need to know which code created it. So we define very strong restrictions on how the data should be stored. For example, we impose atomic units everywhere, the ordering for atomic orbitals (alphabetical for cartesian: xx, xy, xz, yy, yz, zz, or +0, +1, -1, +2, -2, .. for spherical), etc. It can be a strong constraint for the codes writing the data, but it simplifies a lot the work for all the readers. - The files should contain all the data required to evaluate the wave function and does not require external knowledge. For instance, we don't store "cc- pVDZ", but we store explicitly all the basis set parameters. The normalization coefficients for the primitives and atomic orbitals are also stored explicitly so that a program reading a file doesn't need to compute overlap integrals to compute these coefficients. - TREXIO is written in C, so any language can use it. Today, it is used in C, C++, Fortran, Python and Julia. Contributions for bindings in other languages are welcome. - Similarly to Q5Cost, TREXIO uses HDF5 as the default back end. But it also provides a text back-end. The text back end is useful for debugging (making diffs between files), and for storing data in git repositories where text files are preferred. TREXIO can also be configured without HDF5 support if necessary, so that only a C compiler is needed to compile TREXIO. In this way, TREXIO does not contaminate users' codes with a long list of dependencies. - For the Fortran binding we provide an ISO-C-binding interface source file, instead of a compiled module (.mod files). This file behaves as a C header, to be compiled with the user's source code. This allows to compile once and for all the TREXIO library, and use it with different Fortran compilers, because .mod files are not compatible between different compilers, or even between different versions of gfortran. - Everything in TREXIO is stored in the highest possible precision: all integers and floats are always 64-bit. Precision is sometimes reduced in the back-end when it can be done safely. For example, when the number of orbitals is less than 256, the orbital indices can be stored on one byte instead of 8. - We introduce a particular type for array indices: some languages use 1- based (Fortran) or 0-based (C,Python,...) indexing. If we store an array of indices, it will be contain values consistent with the conventions of the language. - TREXIO files are immutable: the data is written once, and can't be overwritten. This might seem inconvenient, but it guarantees some consistency between the data. For instance, if a file contains canonical molecular orbitals and the corresponding electron repulsion integrals, over-writing the MOs with natural orbitals would make the integrals inconsistent. Instead, the user is expected to create a new file containing the new set of MOs. - The library is automatically generated from its documentation: an org-mode file produces a JSON dictionnary which is then used to generate the functions of the library from templates. This makes it very simple to non-experts to add new functions in TREXIO, or even to start the definition of a totally new file format for another community using the same machinery. Today, TREXIO can store: - Metadata - The molecular geometry - Basis set - Expansion from Basis to AOs (ordering of functions + normalization factors) - MO coefficients - Pseudopotentials - One- and two-electron integrals required for the energy - One- and two-electron reduced density matrices We are now working on the implementation of CI/CSF coefficients. We have also set up a repo for tools manipulating TEXIO files: https://github.com/TREX-CoE/trexio_tools which will contain simple scripts for conversion between formats (molden, fcidump, ...), checking the consistency of the files (comparing numerically evaluated overlap integrals with integrals stored in the file, or computing the energy using integrals and density matrices), conversion from cartesian to spherical basis sets, conversion from CSF to determinant reresentation, ... Note: The people involved in the TREX project work on expensive methods like QMC, CI, CC, ... So the goal of TREXIO is not to store MD trajectories or observables, but to store a single-point wave function which can be large. Any comments, suggestions, feedback, ... will be appreciated. Please get in touch with us if you are interested in this project! Anthony -- Anthony Scemama IRSAMC / Laboratoire de Chimie et Physique Quantiques UMR5626 CNRS/UPS Toulouse