PaDELPy’s practical guide to building ML models

0

A machine learning model can be defined as the result of a rigorous training process and presented as the mathematical representation of a real process. A machine learning model is trained to recognize and detect certain patterns present in a data set. The created model is trained on a set of data, providing it with an algorithm that it can use to reason and learn from the provided data. Once the model is properly trained, it can be used to reason about data the model has not seen before and make predictions about that data. Suppose you want to build an app that can recognize a user’s emotions based on their facial expressions. In this case, a model can be easily trained by providing her with images of faces that are each marked with a certain type of emotion. Then this trained model can be included in an application capable of recognizing the emotion of any user. Each machine learning model is categorized in two ways, supervised or unsupervised. A supervised model can then be subdivided into a regression or classification model.

Machine learning algorithms help find the model in a training dataset, which is then used to approximate the target function. In addition, the algorithms are responsible for mapping the inputs to the outputs of the available dataset or the dataset being processed. In terms of machine learning, an algorithm can be defined as running on data to create a machine learning model. The early stages of machine learning, also known as ML for short, saw experiments that only involved theories of computers recognizing patterns from data and learning from it. Today, after the construction and advancement of these foundational experiences, machine learning becomes more and more complex. While machine learning algorithms have been around for a long time, the ability and possibilities to apply them to complex big data applications have evolved faster and more efficiently with recent developments. Putting them in an application perspective while retaining a certain degree of sophistication can give an organization a leg up on its competition.

Register for our Workshop>>

Machine learning algorithms also have the ability to perform pattern recognition when algorithms learn from data or adapt to a set of data. Many machine learning algorithms have been developed as classification algorithms, such as k-nearest neighbors or regression algorithms, such as linear regression or clustering, just like k-means. Machine learning algorithms can be described and defined in a model using math and pseudocode. The efficiency and accuracy of the algorithm and model can also be analyzed and calculated. Machine learning algorithms can be implemented with any of the modern programming languages ​​such as Python or R. These languages ​​provide a wide range of libraries that can be used to create complex and layered algorithms that practitioners can use. on their projects under development. .

What is PaDELPy?

PaDELPy is an open source library that provides a Python wrapper for the PaDEL-Descriptor and molecular descriptor calculation software. The descriptor can be defined as a mathematical logic which describes the properties of a molecule on the basis of the correlation between the structure of the compound and its biological activity. The PaDEL-Descriptor can be used to work with scientific data to help calculate the molecular footprint of specific molecules used to build scientific machine learning models. The PaDEL-Descriptor is Java software that first requires running a Java file to help run and create scientific models. But using the PaDELPy library makes it easier to calculate molecular fingerprints using Python. There is no need to run the jar file, which reduces the lengthy installation process and facilitates quick implementation.

Getting started with the code

In this article, we will try to create and implement a scientific machine learning model using the PaDELPy library to calculate the molecular fingerprint and using Random Forest, we will predict the molecular activity of the drug from it. HCV Drug Data Set. The following code is inspired by the creators of the PaDELPy library, whose Github repository can be accessed from the link here. If you want to download the raw dataset used in this implementation, you can use the following connect.

Let’s get started!

Library installation

Our first step here will be to install the PaDELPy library. To do this, you can run the following line of code,

#installing the library
!pip install padelpy
Loading PaDELPy files

Now we will first load and properly configure our calculator model using the necessary PaDELPy files available only as an XML format.

#Downloading the XML data files
!wget https://github.com/dataprofessor/padel/raw/main/fingerprints_xml.zip
!unzip fingerprints_xml.zip
#listing and sorting the downloaded files
import glob
xml_files = glob.glob("*.xml")
xml_files.sort()
xml_files

Production:

['AtomPairs2DFingerprintCount.xml',
 'AtomPairs2DFingerprinter.xml',
 'EStateFingerprinter.xml',
 'ExtendedFingerprinter.xml',
 'Fingerprinter.xml',
 'GraphOnlyFingerprinter.xml',
 'KlekotaRothFingerprintCount.xml',
 'KlekotaRothFingerprinter.xml',
 'MACCSFingerprinter.xml',
 'PubchemFingerprinter.xml',
 'SubstructureFingerprintCount.xml',
 'SubstructureFingerprinter.xml']
#Creating a list of present files
FP_list = ['AtomPairs2DCount',
 'AtomPairs2D',
 'EState',
 'CDKextended',
 'CDK',
 'CDKgraphonly',
 'KlekotaRothCount',
 'KlekotaRoth',
 'MACCS',
 'PubChem',
 'SubstructureCount',
 'Substructure']

We are now going to create a data dictionary with all the loaded and available data files in order to obtain a key-value pair,

#Creating Data Dictionary
fp = dict(zip(FP_list, xml_files))
fp

Production:

{'AtomPairs2D': 'AtomPairs2DFingerprinter.xml',
 'AtomPairs2DCount': 'AtomPairs2DFingerprintCount.xml',
 'CDK': 'Fingerprinter.xml',
 'CDKextended': 'ExtendedFingerprinter.xml',
 'CDKgraphonly': 'GraphOnlyFingerprinter.xml',
 'EState': 'EStateFingerprinter.xml',
 'KlekotaRoth': 'KlekotaRothFingerprinter.xml',
 'KlekotaRothCount': 'KlekotaRothFingerprintCount.xml',
 'MACCS': 'MACCSFingerprinter.xml',
 'PubChem': 'PubchemFingerprinter.xml',
 'Substructure': 'SubstructureFingerprinter.xml',
 'SubstructureCount': 'SubstructureFingerprintCount.xml'}
Loading the dataset

With all the necessary PaDELPy files configured for the calculation, we will then load our dataset to calculate.

#Loading the dataset
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/dataprofessor/data/master/HCV_NS5B_Curated.csv')
 
#Loading data head
df.head()
#Loading data tail
df.tail(2)

To calculate the molecular descriptor using PaDEL, we will necessarily prepare the data by concatenating the two relevant columns of the dataset, which will serve as input to our model.

#Concatenating necessary columns
df2 = pd.concat( [df['CANONICAL_SMILES'],df['CMPD_CHEMBLID']], axis=1 )
df2.to_csv('molecule.smi', sep='t', index=False, header=False)
df2

There are 12 types of fingerprints present in PaDEL from which to calculate. To calculate the 12, we will adjust the input argument of the descriptor types to one of those of the dictionary variable fp,

#listing the dictionary pairs
fp




{'AtomPairs2D': 'AtomPairs2DFingerprinter.xml',
 'AtomPairs2DCount': 'AtomPairs2DFingerprintCount.xml',
 'CDK': 'Fingerprinter.xml',
 'CDKextended': 'ExtendedFingerprinter.xml',
 'CDKgraphonly': 'GraphOnlyFingerprinter.xml',
 'EState': 'EStateFingerprinter.xml',
 'KlekotaRoth': 'KlekotaRothFingerprinter.xml',
 'KlekotaRothCount': 'KlekotaRothFingerprintCount.xml',
 'MACCS': 'MACCSFingerprinter.xml',
 'PubChem': 'PubchemFingerprinter.xml',
 'Substructure': 'SubstructureFingerprinter.xml',
 'SubstructureCount': 'SubstructureFingerprintCount.xml'}

We want to calculate the molecular footprint. Therefore, we will now load the necessary file; we use the PubChem

#Importing PubChem
fp['PubChem']

Implementation of the molecular footprint calculation module,

#Setting the fingerprint module
 
from padelpy import padeldescriptor
 
fingerprint="Substructure"
 
fingerprint_output_file="".join([fingerprint,'.csv']) #Substructure.csv
fingerprint_descriptortypes = fp[fingerprint]
 
padeldescriptor(mol_dir="molecule.smi", 
                d_file=fingerprint_output_file, #'Substructure.csv'
                #descriptortypes="SubstructureFingerprint.xml", 
                descriptortypes= fingerprint_descriptortypes,
                detectaromaticity=True,
                standardizenitro=True,
                standardizetautomers=True,
                threads=2,
                removesalt=True,
                log=True,
                fingerprints=True)

Display calculated fingerprints,

See also

descriptors = pd.read_csv(fingerprint_output_file)
descriptors

We will now try to create a classification model from the processed data and the random forest,

X = descriptors.drop('Name', axis=1)
y = df['Activity'] #feature being predicted

#removing the low variance features
from sklearn.feature_selection import VarianceThreshold
 
def remove_low_variance(input_data, threshold=0.1):
    selection = VarianceThreshold(threshold)
    selection.fit(input_data)
    return input_data[input_data.columns[selection.get_support(indices=True)]]
 
X = remove_low_variance(X, threshold=0.1)
X

As we can see, the data seems quite sorted and shows us the most effective drugs. So with that, let’s make predictions from this processed data using our model.

#Splitting into Train And Test
from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#Printing Shape
X_train.shape, X_test.shape


((462, 18), (116, 18))


#Implementing Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import matthews_corrcoef
 
model = RandomForestClassifier(n_estimators=500, random_state=42)
model.fit(X_train, y_train)

Production:

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features="auto",
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

Make predictions on the molecular activity of the drug from the model created,

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

Calculation of the performance metrics of the train division using the Matthew correlation coefficient,

mcc_train = matthews_corrcoef(y_train, y_train_pred)
mcc_train

0.833162800019916

Calculation of split test performance metrics using Matthew’s correlation coefficient,

mcc_test = matthews_corrcoef(y_test, y_test_pred)
mcc_test

0.5580628933757674

#performing cross validation
from sklearn.model_selection import cross_val_score
 
rf = RandomForestClassifier(n_estimators=500, random_state=42)
cv_scores = cross_val_score(rf, X_train, y_train, cv=5)
cv_scores

array([0.83870968, 0.80645161, 0.86956522, 0.86956522, 0.81521739])

#calcutating mean from the five fold 
mcc_cv = cv_scores.mean()
mcc_cv

0.8399018232819074

#implementing metric test in a single dataframe
model_name = pd.Series(['Random forest'], name="Name")
mcc_train_series = pd.Series(mcc_train, name="MCC_train")
mcc_cv_series = pd.Series(mcc_cv, name="MCC_cv")
mcc_test_series = pd.Series(mcc_test, name="MCC_test")
 
performance_metrics = pd.concat([model_name, mcc_train_series, mcc_cv_series, mcc_test_series], axis=1)
performance_metrics

As observed from performance measurements of the random forest model created to predict the molecular activity of the drug, it appears to perform well on the predefined data set. You can also use other algorithms to test the performance!

End Notes

In this article, we have understood what exactly a machine learning model means and its essence. We also implemented a scientific learning model using the PaDELPy library to calculate the molecular footprint using the PaDEL descriptor and tested the performance metrics of the model. The following implementation can be found in the form of a Colab Notepad, which you can access using the following link here.

The references


Join our Telegram group. Be part of an engaging online community. Join here.

Subscribe to our newsletter

Receive the latest updates and relevant offers by sharing your email.


Source link

Leave A Reply

Your email address will not be published.