🧬 분자 데이터베이스 & 검색 (HGET/TGET 관련)
ZINC20 (2020버전)
- 기존 ZINC15는 2억개인데, ZINC20은 10억개 이상 purchasable compounds
wget으로 bulk download 가능: tranches/ 디렉토리
- 실전 팁:
lead-like, fragment-like subset만 받으면 80% 빠름
- Docking 전용 3D conformer도 제공 (MOL2 format)
wget -r -l1 -np -nd -A "*.mol2.gz" http://files.docking.org/catalogs/
SureChEMBL vs PatCID
- SureChEMBL: 특허 내 화학구조 1억개 (API:
https://www.surechembl.org/api/)
- PatCID: PubChem의 특허 subset, CID 기반 cross-reference
- 조합기술: SureChEMBL로 구조 찾고 → PatCID로 생물활성 데이터 매핑
# SureChEMBL API 예제
import requests
r = requests.get(f"https://www.surechembl.org/api/compounds/{schembl_id}")
ChemSpider API (무료지만 숨은기능)
- InChIKey로 여러 DB 동시 검색 (ChEMBL, PubChem, DrugBank 통합)
- 실시간 상업성 체크:
commercialAvailability endpoint
# 숨은 엔드포인트
GET /v2/compounds/{id}/details/suppliers
MolPort API
- 800개 이상 벤더 실시간 가격/재고 확인
- Building block screening (합성 가능성 예측)
- 무료 tier: 월 1000 requests
COCONUT (Natural Products)
- 천연물 40만개, SMILES + 3D 구조
- 기존 UNPD보다 깔끔한 데이터
- MongoDB dump 제공 - 로컬 구축 가능
# pymongo로 직접 쿼리
collection.find({"molecular_weight": {"$lt": 500}})
🎯 타겟 발견 & 검증 (TGET 관련)
AlphaFold Protein Structure Database
- 2억개 단백질 구조 예측 (전체 UniProt)
- 직접 다운:
https://ftp.ebi.ac.uk/pub/databases/alphafold/
- pLDDT score로 신뢰도 필터링
# pLDDT > 90인 영역만 추출
from Bio.PDB import PDBParser
parser = PDBParser()
structure = parser.get_structure('protein', 'AF-P12345.pdb')
high_conf_residues = [r for r in structure.get_residues()
if r['CA'].get_bfactor() > 90]
OpenTargets Platform API
- 질병-타겟 연관성 스코어 (genetic evidence 기반)
- GraphQL endpoint로 복잡한 쿼리 가능
{
target(ensemblId: "ENSG00000139618") {
associatedDiseases {
rows {
disease { name }
score
datatypeScores { id score }
}
}
}
}
Pharos / TCRD
- NIH의 타겟 개발 수준 분류 DB
- Tdark (unexplored targets) 찾기 최고
- API:
https://pharos.nih.gov/idg/api/v1/
- 실전: 경쟁 적은 타겟 스크리닝
# Tdark & druggable targets
params = {'tdl': 'Tdark', 'facet': 'Target Development Level'}
STRING DB (Protein Interaction)
- PPI 네트워크 bulk download
- Combined score > 0.7 필터링 추천
# 타겟 주변 상호작용 단백질 찾기
import requests
r = requests.get('https://string-db.org/api/json/network',
params={'identifiers': 'TP53', 'species': 9606})
DepMap Portal (Cancer Dependency)
- 암세포주별 필수 유전자 데이터
- CRISPR 스크리닝 결과 (1000+ cell lines)
- CSV bulk download로 로컬 분석
- 조합: 높은 dependency + 낮은 정상세포 발현 = 이상적 타겟
💊 Virtual Screening & Docking (HGET 관련)
GNINA (CNN-based docking)
- Vina보다 정확한 스코어링
- GPU 가속 (CUDA)
gnina -r protein.pdb -l ligand.sdf --cnn_scoring rescore \
--gpu 0 --exhaustiveness 8
DiffDock (Diffusion Model)
- 최신 딥러닝 도킹 (2023)
- Blind docking 가능 (pocket 몰라도 됨)
- GitHub:
gcorso/DiffDock
python inference.py --protein_path protein.pdb \
--ligand ligand.sdf
Smina (Vina fork with custom scoring)
- Vinardo scoring function (더 정확)
- 커스텀 scoring 함수 추가 가능
smina -r receptor.pdb -l ligand.mol2 \
--scoring vinardo --exhaustiveness 32
LeDock (중국산)
- 무료인데 성능 Glide급
- 매우 빠름 (Vina 대비 5배)
- Windows/Linux 버전
rDock
- Open-source, cavity detection 자동
- 높은 throughput screening 특화
rbcavity -r protein.prm -was # cavity 자동 탐지
rbdock -r protein.prm -i ligands.sd -n 100
P2Rank (Pocket Prediction)
- ML 기반 binding site 예측
- AlphaFold 구조에도 작동
prank predict protein.pdb
🧪 ADMET & Toxicity 예측 (OGET 관련)
ADMETlab 2.0 CLI
- 웹버전 말고 Python API 있음
- 50+ ADMET properties
from admetlab import ADMETlab
predictor = ADMETlab()
results = predictor.predict_smiles("CCO")
SwissADME API (비공식)
- Lipinski, Veber, Egan 룰 자동 체크
import requests
url = "http://www.swissadme.ch/predictions"
# POST로 SMILES 배치 전송 가능
pkCSM API
- 무료 REST API
- ADMET + toxicity 30+ endpoints
import requests
r = requests.post('http://biosig.unimelb.edu.au/pkcsm/prediction',
data={'smiles': 'CCO'})
Pred-hERG
- hERG 독성 예측 전문
- QSAR model (정확도 90%+)
- GitHub:
mol-net/pred-herg
OCHEM (Online CHEmical Modeling)
- QSAR 모델 직접 학습/배포
- 30+ pre-trained models
- ADMET, toxicity, activity 다 가능
- API endpoint로 배치 예측
vNN-ADMET (Virtual Neural Network)
- 그래프 신경망 기반
- Transfer learning으로 적은 데이터로도 학습
from vnn_admet import VNNPredictor
model = VNNPredictor.load('herg_blocker')
predictions = model.predict_smiles(['CCO', 'c1ccccc1'])
ADMET AI (NCATS)
- NIH 개발, transformer 기반
- PubChem 전체로 학습
- pip install 가능
pip install admet-ai
🧬 Peptide 특화 (RGET 관련)
PepFun (Peptide Function Prediction)
- 서열만으로 기능 예측
- Anticancer, antimicrobial, antihypertensive 등
CAMPR3 (Antimicrobial Peptides)
- 2만개 AMP 서열
- SVM/RF 기반 activity 예측
- Web API 있음
CPPsite 2.0 (Cell-Penetrating Peptides)
- CPP 데이터베이스 + 예측기
- 서열 → CPP 확률 점수
HELM Notation Tools
- Linear notation보다 명확한 펩타이드 표기
- Python library:
helm-notation
from helm_notation import HELMParser
helm = "PEPTIDE1{A.C.D.E}$$$$"
parser = HELMParser(helm)
PeptideBuilder (Python)
- 서열 → 3D 구조 자동 생성
- 백본 angle 커스터마이징
from PeptideBuilder import Geometry
import PeptideBuilder
structure = PeptideBuilder.make_structure("ACDEFGHIKLMNPQRSTVWY")
ESMFold (Meta AI)
- PepFold3보다 100배 빠름
- 정확도 유사
- API:
https://api.esmatlas.com/
import requests
r = requests.post('https://api.esmatlas.com/foldSequence/v1/pdb/',
data={'sequence': 'MKFLILLFNI'})
modlAMP
- Peptide descriptor 계산
- 400+ physicochemical properties
from modlamp.descriptors import PeptideDescriptor
desc = PeptideDescriptor('GLFDIVKKVVGALGSL', 'eisenberg')
desc.calculate_global()
📊 Molecular Descriptors & Fingerprints
mordred (2600+ descriptors)
- RDKit 확장
- 3D, topological, quantum chemical
from mordred import Calculator, descriptors
calc = Calculator(descriptors, ignore_3D=False)
df = calc.pandas([mol1, mol2, mol3])
molfeat (Unified API)
- 모든 fingerprint 통합
- Transformer embeddings도 가능
from molfeat.trans import MoleculeTransformer
transformer = MoleculeTransformer('MACCS', n_jobs=-1)
fps = transformer.transform(['CCO', 'c1ccccc1'])
FPSim2 (Fast Similarity Search)
- Tanimoto 검색 초고속 (백만개 1초)
- GPU 버전도 있음
from FPSim2 import FPSim2Engine
fp_engine = FPSim2Engine('chembl.h5')
results = fp_engine.similarity('CCO', 0.7, n_workers=4)
MHFP (MinHash Fingerprint)
- Tanimoto보다 빠른 유사도 계산
- Shingling 기반
from mhfp.encoder import MHFPEncoder
encoder = MHFPEncoder()
fp = encoder.encode('CCO')
🤖 AI/ML 라이브러리 (딥러닝 특화)
TorchDrug
- GNN for drug discovery
- Pre-trained models 다수
from torchdrug import datasets, models
dataset = datasets.BACE()
model = models.GIN(input_dim=dataset.node_feature_dim,
hidden_dims=[256, 256, 256])
DeepChem
- 생각보다 강력함
- MolGAN, MPNN, GraphConv 다 있음
import deepchem as dc
featurizer = dc.feat.MolGraphConvFeaturizer()
model = dc.models.GraphConvModel(n_tasks=1, mode='regression')
MOLFEAT + Hugging Face
- Transformer embeddings
- ChemBERTa, MolFormer 등
from molfeat.trans.pretrained import PretrainedHFTransformer
transformer = PretrainedHFTransformer(kind='ChemBERTa-77M-MLM')
embeddings = transformer(['CCO'])
DGL-LifeSci
- Deep Graph Library for life sciences
- MPNN, AttentiveFP, GCN
from dgllife.model import GCN
from dgllife.utils import smiles_to_bigraph
g = smiles_to_bigraph('CCO')
model = GCN(in_feats=74, hidden_feats=[64, 64])
Chemprop
- Message Passing Neural Network
- ADMET 예측에 최적화
from chemprop.train import make_predictions
predictions = make_predictions(args, smiles=['CCO'])
🔬 Molecular Dynamics & Simulation
OpenMM-ML
- ML force fields (ANI, MACE)
- 기존 MM보다 정확하고 빠름
from openmmml import MLPotential
potential = MLPotential('ani2x')
simulation.context.setPositions(positions)
TorchMD-NET
- 그래프 신경망 MD
- GPU 가속 (100배 빠름)
torchmd-train --conf config.yaml
ProLIF (Protein-Ligand Interaction Fingerprints)
- MD trajectory 분석
- 상호작용 자동 추출
from prolif import Fingerprint, Molecule
fp = Fingerprint(['Hydrophobic', 'HBDonor', 'HBAcceptor'])
fp.run(trajectory, protein, ligand)
df = fp.to_dataframe()
PLUMED-NEST
- Enhanced sampling 레시피 저장소
- Metadynamics, umbrella sampling
GetContacts (Ligand Interactions)
- MD trajectory → 상호작용 분석
- Flare plot 생성
get_dynamic_contacts.py --topology top.pdb \
--trajectory traj.dcd \
--output contacts.tsv
📈 시각화 & 분석
py3Dmol + NGLview
import py3Dmol
view = py3Dmol.view(query='pdb:1ATP')
view.setStyle({'cartoon': {'color': 'spectrum'}})
ProLIF + Plotly
import prolif as plf
fp = plf.Fingerprint()
fp.run(...)
plf.plot.barcode(fp.to_dataframe())
rdkit-utils (숨은 gem)
- MCS (Maximum Common Substructure)
from rdkit.Chem import rdFMCS
mcs = rdFMCS.FindMCS([mol1, mol2, mol3],
timeout=10,
completeRingsOnly=True)
mols2grid
- 분자 그리드 시각화 (Jupyter)
- interactive filtering
import mols2grid
mols2grid.display(df, smiles_col='SMILES',
subset=['Name', 'MW', 'LogP'])
RDKit Molecule Drawing Options (숨은기능)
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
IPythonConsole.drawOptions.addAtomIndices = True
IPythonConsole.drawOptions.addStereoAnnotation = True
🔍 패턴 & 특허 분석
SureChEMBL + RDKit Cartridge
CREATE INDEX molidx ON surechembl USING gist(mol);
SELECT * FROM surechembl WHERE mol @> 'c1ccccc1'::qmol;
PatCID Bulk Download
wget ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Patent.gz
Google Patents Public Datasets (BigQuery)
SELECT * FROM `patents-public-data.patents.publications`
WHERE application_number LIKE '%/2023%'
AND cpc_code LIKE 'A61K%'
Espacenet OPS API
🗄️ Database 로컬 구축
PostgreSQL + RDKit Cartridge
CREATE EXTENSION rdkit;
CREATE TABLE molecules (
id SERIAL PRIMARY KEY,
smiles TEXT,
mol MOL
);
CREATE INDEX molidx ON molecules USING gist(mol);
-- Substructure search
SELECT * FROM molecules WHERE mol @> 'c1ccccc1'::qmol;
-- Similarity search
SELECT id, smiles, tanimoto_sml(mol, 'CCO'::mol) AS similarity
FROM molecules
WHERE mol % 'CCO'::mol
ORDER BY similarity DESC
LIMIT 10;
MongoDB for Flexible Schema
from pymongo import MongoClient
client = MongoClient()
db = client.drugdiscovery
# Insert with flexible schema
db.compounds.insert_many([
{"smiles": "CCO", "properties": {"MW": 46}, "assays": [...]},
{"smiles": "c1ccccc1", "metadata": {...}}
])
# Index for fast queries
db.compounds.create_index([("properties.MW", 1)])
DuckDB for Analytics (초고속)
import duckdb
con = duckdb.connect('compounds.db')
# Parquet 지원 (압축률 좋음)
con.execute("COPY (SELECT * FROM chembl) TO 'chembl.parquet'")
# SQL aggregations 엄청 빠름
result = con.execute("""
SELECT molecular_weight_bucket, COUNT(*)
FROM compounds
GROUP BY FLOOR(molecular_weight/100)*100 AS molecular_weight_bucket
""").fetchdf()
🧰 실전 Workflow 조합
Hit Discovery Pipeline
# 1. ZINC20에서 fragment library 다운
# 2. RDKit로 Lipinski 필터링
from rdkit import Chem
from rdkit.Chem import Descriptors
def lipinski_filter(smiles):
mol = Chem.MolFromSmiles(smiles)
mw = Descriptors.MolWt(mol)
logp = Descriptors.MolLogP(mol)
hbd = Descriptors.NumHDonors(mol)
hba = Descriptors.NumHAcceptors(mol)
return mw<=500 and logp<=5 and hbd<=5 and hba<=10
# 3. FPSim2로 similarity clustering
# 4. GNINA로 docking
# 5. ProLIF로 상호작용 분석
ADMET Prediction Ensemble
from admetlab import ADMETlab
from pkcsm import pkCSM
import numpy as np
def ensemble_admet(smiles):
pred1 = ADMETlab().predict(smiles)
pred2 = pkCSM().predict(smiles)
# Ensemble averaging
return np.mean([pred1, pred2], axis=0)
Patent Freedom-to-Operate Check
# 1. SureChEMBL로 유사 구조 검색
# 2. RDKit MCS로 공통 구조 추출
# 3. PatCID로 생물활성 확인
# 4. Espacenet으로 특허 상태 확인
🚀 Performance 최적화
Parallel Processing
from joblib import Parallel, delayed
from rdkit import Chem
def process_smiles(smi):
mol = Chem.MolFromSmiles(smi)
return Descriptors.MolWt(mol)
# 멀티코어 활용
results = Parallel(n_jobs=-1)(
delayed(process_smiles)(smi) for smi in smiles_list
)
Dask for Big Data
import dask.dataframe as dd
# Pandas보다 100배 빠름 (큰 데이터셋)
df = dd.read_csv('chembl_30.csv')
filtered = df[df['molecular_weight'] < 500].compute()
Cython for Speed
# similarity.pyx
import numpy as np
cimport numpy as np
def tanimoto_cython(np.ndarray[np.uint8_t, ndim=1] fp1,
np.ndarray[np.uint8_t, ndim=1] fp2):
cdef int c = np.sum(fp1 & fp2)
cdef int a = np.sum(fp1)
cdef int b = np.sum(fp2)
return c / float(a + b - c)
📚 추가 자료
Awesome Lists (GitHub)
awesome-cheminformatics
awesome-drug-discovery
awesome-molecular-generation
Papers with Code (Drug Discovery)
- 최신 논문 + 구현 코드
- Leaderboards
RDKit Cookbook (비공식)
- Greg Landrum의 숨은 예제들
- https://github.com/rdkit/rdkit/wiki
댓글
댓글 쓰기