약물 발견 업무를 하시는 분들을 위한 팁 2

🧬 분자 데이터베이스 & 검색 (HGET/TGET 관련)

ZINC20 (2020버전)

기존 ZINC15는 2억개인데, ZINC20은 10억개 이상 purchasable compounds
wget으로 bulk download 가능: tranches/ 디렉토리
실전 팁: lead-like, fragment-like subset만 받으면 80% 빠름
Docking 전용 3D conformer도 제공 (MOL2 format)

wget -r -l1 -np -nd -A "*.mol2.gz" http://files.docking.org/catalogs/

SureChEMBL vs PatCID

SureChEMBL: 특허 내 화학구조 1억개 (API: https://www.surechembl.org/api/)
PatCID: PubChem의 특허 subset, CID 기반 cross-reference
조합기술: SureChEMBL로 구조 찾고 → PatCID로 생물활성 데이터 매핑

# SureChEMBL API 예제
import requests
r = requests.get(f"https://www.surechembl.org/api/compounds/{schembl_id}")

ChemSpider API (무료지만 숨은기능)

InChIKey로 여러 DB 동시 검색 (ChEMBL, PubChem, DrugBank 통합)
실시간 상업성 체크: commercialAvailability endpoint

# 숨은 엔드포인트
GET /v2/compounds/{id}/details/suppliers

MolPort API

800개 이상 벤더 실시간 가격/재고 확인
Building block screening (합성 가능성 예측)
무료 tier: 월 1000 requests

COCONUT (Natural Products)

천연물 40만개, SMILES + 3D 구조
기존 UNPD보다 깔끔한 데이터
MongoDB dump 제공 - 로컬 구축 가능

# pymongo로 직접 쿼리
collection.find({"molecular_weight": {"$lt": 500}})

🎯 타겟 발견 & 검증 (TGET 관련)

AlphaFold Protein Structure Database

2억개 단백질 구조 예측 (전체 UniProt)
직접 다운: https://ftp.ebi.ac.uk/pub/databases/alphafold/
pLDDT score로 신뢰도 필터링

# pLDDT > 90인 영역만 추출
from Bio.PDB import PDBParser
parser = PDBParser()
structure = parser.get_structure('protein', 'AF-P12345.pdb')
high_conf_residues = [r for r in structure.get_residues() 
                      if r['CA'].get_bfactor() > 90]

OpenTargets Platform API

질병-타겟 연관성 스코어 (genetic evidence 기반)
GraphQL endpoint로 복잡한 쿼리 가능

{
  target(ensemblId: "ENSG00000139618") {
    associatedDiseases {
      rows {
        disease { name }
        score
        datatypeScores { id score }
      }
    }
  }
}

Pharos / TCRD

NIH의 타겟 개발 수준 분류 DB
Tdark (unexplored targets) 찾기 최고
API: https://pharos.nih.gov/idg/api/v1/
실전: 경쟁 적은 타겟 스크리닝

# Tdark & druggable targets
params = {'tdl': 'Tdark', 'facet': 'Target Development Level'}

STRING DB (Protein Interaction)

PPI 네트워크 bulk download
Combined score > 0.7 필터링 추천

# 타겟 주변 상호작용 단백질 찾기
import requests
r = requests.get('https://string-db.org/api/json/network',
                 params={'identifiers': 'TP53', 'species': 9606})

DepMap Portal (Cancer Dependency)

암세포주별 필수 유전자 데이터
CRISPR 스크리닝 결과 (1000+ cell lines)
CSV bulk download로 로컬 분석
조합: 높은 dependency + 낮은 정상세포 발현 = 이상적 타겟

💊 Virtual Screening & Docking (HGET 관련)

GNINA (CNN-based docking)

Vina보다 정확한 스코어링
GPU 가속 (CUDA)

gnina -r protein.pdb -l ligand.sdf --cnn_scoring rescore \
      --gpu 0 --exhaustiveness 8

DiffDock (Diffusion Model)

최신 딥러닝 도킹 (2023)
Blind docking 가능 (pocket 몰라도 됨)
GitHub: gcorso/DiffDock

python inference.py --protein_path protein.pdb \
                    --ligand ligand.sdf

Smina (Vina fork with custom scoring)

Vinardo scoring function (더 정확)
커스텀 scoring 함수 추가 가능

smina -r receptor.pdb -l ligand.mol2 \
      --scoring vinardo --exhaustiveness 32

LeDock (중국산)

무료인데 성능 Glide급
매우 빠름 (Vina 대비 5배)
Windows/Linux 버전

rDock

Open-source, cavity detection 자동
높은 throughput screening 특화

rbcavity -r protein.prm -was  # cavity 자동 탐지
rbdock -r protein.prm -i ligands.sd -n 100

P2Rank (Pocket Prediction)

ML 기반 binding site 예측
AlphaFold 구조에도 작동

prank predict protein.pdb

🧪 ADMET & Toxicity 예측 (OGET 관련)

ADMETlab 2.0 CLI

웹버전 말고 Python API 있음
50+ ADMET properties

from admetlab import ADMETlab
predictor = ADMETlab()
results = predictor.predict_smiles("CCO")

SwissADME API (비공식)

Lipinski, Veber, Egan 룰 자동 체크

import requests
url = "http://www.swissadme.ch/predictions"
# POST로 SMILES 배치 전송 가능

pkCSM API

무료 REST API
ADMET + toxicity 30+ endpoints

import requests
r = requests.post('http://biosig.unimelb.edu.au/pkcsm/prediction',
                  data={'smiles': 'CCO'})

Pred-hERG

hERG 독성 예측 전문
QSAR model (정확도 90%+)
GitHub: mol-net/pred-herg

OCHEM (Online CHEmical Modeling)

QSAR 모델 직접 학습/배포
30+ pre-trained models
ADMET, toxicity, activity 다 가능
API endpoint로 배치 예측

vNN-ADMET (Virtual Neural Network)

그래프 신경망 기반
Transfer learning으로 적은 데이터로도 학습

from vnn_admet import VNNPredictor
model = VNNPredictor.load('herg_blocker')
predictions = model.predict_smiles(['CCO', 'c1ccccc1'])

ADMET AI (NCATS)

NIH 개발, transformer 기반
PubChem 전체로 학습
pip install 가능

pip install admet-ai

🧬 Peptide 특화 (RGET 관련)

PepFun (Peptide Function Prediction)

서열만으로 기능 예측
Anticancer, antimicrobial, antihypertensive 등

CAMPR3 (Antimicrobial Peptides)

2만개 AMP 서열
SVM/RF 기반 activity 예측
Web API 있음

CPPsite 2.0 (Cell-Penetrating Peptides)

CPP 데이터베이스 + 예측기
서열 → CPP 확률 점수

HELM Notation Tools

Linear notation보다 명확한 펩타이드 표기
Python library: helm-notation

from helm_notation import HELMParser
helm = "PEPTIDE1{A.C.D.E}$$$$"
parser = HELMParser(helm)

PeptideBuilder (Python)

서열 → 3D 구조 자동 생성
백본 angle 커스터마이징

from PeptideBuilder import Geometry
import PeptideBuilder
structure = PeptideBuilder.make_structure("ACDEFGHIKLMNPQRSTVWY")

ESMFold (Meta AI)

PepFold3보다 100배 빠름
정확도 유사
API: https://api.esmatlas.com/

import requests
r = requests.post('https://api.esmatlas.com/foldSequence/v1/pdb/',
                  data={'sequence': 'MKFLILLFNI'})

modlAMP

Peptide descriptor 계산
400+ physicochemical properties

from modlamp.descriptors import PeptideDescriptor
desc = PeptideDescriptor('GLFDIVKKVVGALGSL', 'eisenberg')
desc.calculate_global()

📊 Molecular Descriptors & Fingerprints

mordred (2600+ descriptors)

RDKit 확장
3D, topological, quantum chemical

from mordred import Calculator, descriptors
calc = Calculator(descriptors, ignore_3D=False)
df = calc.pandas([mol1, mol2, mol3])

molfeat (Unified API)

모든 fingerprint 통합
Transformer embeddings도 가능

from molfeat.trans import MoleculeTransformer
transformer = MoleculeTransformer('MACCS', n_jobs=-1)
fps = transformer.transform(['CCO', 'c1ccccc1'])

FPSim2 (Fast Similarity Search)

Tanimoto 검색 초고속 (백만개 1초)
GPU 버전도 있음

from FPSim2 import FPSim2Engine
fp_engine = FPSim2Engine('chembl.h5')
results = fp_engine.similarity('CCO', 0.7, n_workers=4)

MHFP (MinHash Fingerprint)

Tanimoto보다 빠른 유사도 계산
Shingling 기반

from mhfp.encoder import MHFPEncoder
encoder = MHFPEncoder()
fp = encoder.encode('CCO')

🤖 AI/ML 라이브러리 (딥러닝 특화)

TorchDrug

GNN for drug discovery
Pre-trained models 다수

from torchdrug import datasets, models
dataset = datasets.BACE()
model = models.GIN(input_dim=dataset.node_feature_dim,
                   hidden_dims=[256, 256, 256])

DeepChem

생각보다 강력함
MolGAN, MPNN, GraphConv 다 있음

import deepchem as dc
featurizer = dc.feat.MolGraphConvFeaturizer()
model = dc.models.GraphConvModel(n_tasks=1, mode='regression')

MOLFEAT + Hugging Face

Transformer embeddings
ChemBERTa, MolFormer 등

from molfeat.trans.pretrained import PretrainedHFTransformer
transformer = PretrainedHFTransformer(kind='ChemBERTa-77M-MLM')
embeddings = transformer(['CCO'])

DGL-LifeSci

Deep Graph Library for life sciences
MPNN, AttentiveFP, GCN

from dgllife.model import GCN
from dgllife.utils import smiles_to_bigraph
g = smiles_to_bigraph('CCO')
model = GCN(in_feats=74, hidden_feats=[64, 64])

Chemprop

Message Passing Neural Network
ADMET 예측에 최적화

from chemprop.train import make_predictions
predictions = make_predictions(args, smiles=['CCO'])

🔬 Molecular Dynamics & Simulation

OpenMM-ML

ML force fields (ANI, MACE)
기존 MM보다 정확하고 빠름

from openmmml import MLPotential
potential = MLPotential('ani2x')
simulation.context.setPositions(positions)

TorchMD-NET

그래프 신경망 MD
GPU 가속 (100배 빠름)

torchmd-train --conf config.yaml

ProLIF (Protein-Ligand Interaction Fingerprints)

MD trajectory 분석
상호작용 자동 추출

from prolif import Fingerprint, Molecule
fp = Fingerprint(['Hydrophobic', 'HBDonor', 'HBAcceptor'])
fp.run(trajectory, protein, ligand)
df = fp.to_dataframe()

PLUMED-NEST

Enhanced sampling 레시피 저장소
Metadynamics, umbrella sampling

GetContacts (Ligand Interactions)

MD trajectory → 상호작용 분석
Flare plot 생성

get_dynamic_contacts.py --topology top.pdb \
                        --trajectory traj.dcd \
                        --output contacts.tsv

📈 시각화 & 분석

py3Dmol + NGLview

Jupyter 내 3D 시각화

import py3Dmol
view = py3Dmol.view(query='pdb:1ATP')
view.setStyle({'cartoon': {'color': 'spectrum'}})

ProLIF + Plotly

인터랙션 timeline

import prolif as plf
fp = plf.Fingerprint()
fp.run(...)
plf.plot.barcode(fp.to_dataframe())

rdkit-utils (숨은 gem)

MCS (Maximum Common Substructure)

from rdkit.Chem import rdFMCS
mcs = rdFMCS.FindMCS([mol1, mol2, mol3],
                     timeout=10,
                     completeRingsOnly=True)

mols2grid

분자 그리드 시각화 (Jupyter)
interactive filtering

import mols2grid
mols2grid.display(df, smiles_col='SMILES',
                  subset=['Name', 'MW', 'LogP'])

RDKit Molecule Drawing Options (숨은기능)

from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.drawOptions.addStereoAnnotation = True

🔍 패턴 & 특허 분석

SureChEMBL + RDKit Cartridge

로컬 특허 화학구조 DB 구축

CREATE INDEX molidx ON surechembl USING gist(mol);
SELECT * FROM surechembl WHERE mol @> 'c1ccccc1'::qmol;

PatCID Bulk Download

wget ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Patent.gz

Google Patents Public Datasets (BigQuery)

SQL로 특허 쿼리
화학 분류코드 필터링

SELECT * FROM `patents-public-data.patents.publications`
WHERE application_number LIKE '%/2023%'
AND cpc_code LIKE 'A61K%'

Espacenet OPS API

유럽 특허청 API
화학식, 서열 검색

🗄️ Database 로컬 구축

PostgreSQL + RDKit Cartridge

CREATE EXTENSION rdkit;
CREATE TABLE molecules (
    id SERIAL PRIMARY KEY,
    smiles TEXT,
    mol MOL
);

CREATE INDEX molidx ON molecules USING gist(mol);

-- Substructure search
SELECT * FROM molecules WHERE mol @> 'c1ccccc1'::qmol;

-- Similarity search
SELECT id, smiles, tanimoto_sml(mol, 'CCO'::mol) AS similarity
FROM molecules
WHERE mol % 'CCO'::mol
ORDER BY similarity DESC
LIMIT 10;

MongoDB for Flexible Schema

from pymongo import MongoClient
client = MongoClient()
db = client.drugdiscovery

# Insert with flexible schema
db.compounds.insert_many([
    {"smiles": "CCO", "properties": {"MW": 46}, "assays": [...]},
    {"smiles": "c1ccccc1", "metadata": {...}}
])

# Index for fast queries
db.compounds.create_index([("properties.MW", 1)])

DuckDB for Analytics (초고속)

import duckdb
con = duckdb.connect('compounds.db')

# Parquet 지원 (압축률 좋음)
con.execute("COPY (SELECT * FROM chembl) TO 'chembl.parquet'")

# SQL aggregations 엄청 빠름
result = con.execute("""
    SELECT molecular_weight_bucket, COUNT(*)
    FROM compounds
    GROUP BY FLOOR(molecular_weight/100)*100 AS molecular_weight_bucket
""").fetchdf()

🧰 실전 Workflow 조합

Hit Discovery Pipeline

# 1. ZINC20에서 fragment library 다운
# 2. RDKit로 Lipinski 필터링
from rdkit import Chem
from rdkit.Chem import Descriptors

def lipinski_filter(smiles):
    mol = Chem.MolFromSmiles(smiles)
    mw = Descriptors.MolWt(mol)
    logp = Descriptors.MolLogP(mol)
    hbd = Descriptors.NumHDonors(mol)
    hba = Descriptors.NumHAcceptors(mol)
    return mw<=500 and logp<=5 and hbd<=5 and hba<=10

# 3. FPSim2로 similarity clustering
# 4. GNINA로 docking
# 5. ProLIF로 상호작용 분석

ADMET Prediction Ensemble

from admetlab import ADMETlab
from pkcsm import pkCSM
import numpy as np

def ensemble_admet(smiles):
    pred1 = ADMETlab().predict(smiles)
    pred2 = pkCSM().predict(smiles)
    # Ensemble averaging
    return np.mean([pred1, pred2], axis=0)

Patent Freedom-to-Operate Check

# 1. SureChEMBL로 유사 구조 검색
# 2. RDKit MCS로 공통 구조 추출
# 3. PatCID로 생물활성 확인
# 4. Espacenet으로 특허 상태 확인

🚀 Performance 최적화

Parallel Processing

from joblib import Parallel, delayed
from rdkit import Chem

def process_smiles(smi):
    mol = Chem.MolFromSmiles(smi)
    return Descriptors.MolWt(mol)

# 멀티코어 활용
results = Parallel(n_jobs=-1)(
    delayed(process_smiles)(smi) for smi in smiles_list
)

Dask for Big Data

import dask.dataframe as dd

# Pandas보다 100배 빠름 (큰 데이터셋)
df = dd.read_csv('chembl_30.csv')
filtered = df[df['molecular_weight'] < 500].compute()

Cython for Speed

# similarity.pyx
import numpy as np
cimport numpy as np

def tanimoto_cython(np.ndarray[np.uint8_t, ndim=1] fp1,
                    np.ndarray[np.uint8_t, ndim=1] fp2):
    cdef int c = np.sum(fp1 & fp2)
    cdef int a = np.sum(fp1)
    cdef int b = np.sum(fp2)
    return c / float(a + b - c)

📚 추가 자료

Awesome Lists (GitHub)

awesome-cheminformatics
awesome-drug-discovery
awesome-molecular-generation

Papers with Code (Drug Discovery)

최신 논문 + 구현 코드
Leaderboards

RDKit Cookbook (비공식)

Greg Landrum의 숨은 예제들
https://github.com/rdkit/rdkit/wiki