From owner-chemistry@ccl.net Thu Aug 16 20:01:00 2018 From: "Andrew Dalke dalke=-=dalkescientific.com" To: CCL Subject: CCL: chemfp 1.5 and the chemfp benchmark Message-Id: <-53439-180816195649-24787-omq1S9lBvWE7CUbkuSiEKA*server.ccl.net> X-Original-From: Andrew Dalke Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=us-ascii Date: Fri, 17 Aug 2018 01:56:39 +0200 Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Sent to CCL by: Andrew Dalke [dalke|dalkescientific.com] Dear CCL subscribers, chemfp 1.5 is now available from http://dalkescientific.com/releases/chemfp-1.5.tar.gz and from PyPI (the Python package index) through "pip install chemfp". The software is available in source code form under the MIT license. For more information see the home page at http://chemfp.com/ or the documentation page at https://chemfp.readthedocs.io/en/chemfp-1.5/ . Chemfp is a set of command-line tools and a Python library for working with cheminformatics fingerprints. It can use OEChem/OEGraphSim, RDKit, or Open Babel to create fingerprints in the FPS format, and it implements a high-speed Tanimoto search. As far as I can tell, chemfp 1.5 is the fastest free/open source fingerprint search system for the CPU. (Some proprietary/commercial toolkits are faster, including the commercial version of chemfp, and GPU-based search is usually faster than the CPU.) The main changes for this release are: - 10% faster performance for k-nearest search - fixed a bug in symmetric k-nearest neighbor when multiple fingerprints have no bits set - improved the use of chemfp as a baseline benchmark for similarity search tools ## Similarity search performance benchmark Concerning the last point, I have assembled a data set which can be used to benchmark similarity search performance for several different search types, fingerprint types, and scoring functions. This includes pre-computed fingerprints and expected search results, as well as timing numbers for several different versions of chemfp. My hope is that it evolves into a standard benchmark that help evaluate search tools - bearing in mind that performance is only one of many factors that go into selecting a tool. The benchmark files are at https://bitbucket.org/dalke/chemfp_benchmark . Those files which fall under copyright are distributed under the MIT license. Many thinks to ChEMBL, OpenEye, PubChem, Open Babel, RDKit, and Daniel Lemire for providing the data and resources for putting this benchmark together. Best regards, Andrew dalke*_*dalkescientific.com