# PredictStructure CLI Reference

The `predict-structure` command-line tool is the workhorse beneath the BV-BRC Protein Structure Prediction Service. It exposes the same five engines (Boltz-2, OpenFold 3, Chai-1, AlphaFold 2, ESMFold) through a single Click-based interface with shared global options and per-tool subcommands.

You can run it three ways:

1. **Directly** on a workstation with the tools installed (or via Docker / Apptainer / Singularity containers).
2. **As CWL workflow** — every subcommand has a matching `cwl/tools/<tool>.cwl` definition.
3. **Through BV-BRC AppService** — the Perl wrapper `service-scripts/App-PredictStructure.pl` builds and executes the same CLI under the hood.

> **Source:** [CEPI-dxkb/PredictStructureApp](https://github.com/CEPI-dxkb/PredictStructureApp) · `predict_structure/cli.py`

## Installation

### Prerequisites

- Python 3.10+ (3.12 recommended)
- conda or miniconda
- One or more of: Docker / Apptainer / Singularity (if you want to use the per-tool containers instead of native installs)
- A GPU (any tool other than ESMFold; ESMFold has a CPU-capable path)

### Quick start

```bash
conda create -n predict-structure python=3.12 -y
conda activate predict-structure

git clone https://github.com/CEPI-dxkb/PredictStructureApp.git
cd PredictStructureApp
pip install -e ".[all]"

predict-structure --version
predict-structure --help
```

### Optional dependency groups

| Group | Adds | Install |
|---|---|---|
| `chai` | PyArrow (A3M → Parquet MSA conversion) | `pip install -e ".[chai]"` |
| `esmfold` | PyTorch, Transformers, Accelerate | `pip install -e ".[esmfold]"` |
| `cwl` | cwltool | `pip install -e ".[cwl]"` |
| `dev` | pytest, black, ruff, mypy | `pip install -e ".[dev]"` |
| `all` | Everything above | `pip install -e ".[all]"` |

The prediction tools themselves (`boltz`, `chai-lab`, `alphafold`, etc.) are installed separately or run inside their Docker images.

## Command structure

```
predict-structure [GLOBAL_FLAGS] <TOOL> [TOOL_FLAGS] --protein <FASTA> -o <DIR>
predict-structure --job jobs.yaml -o <DIR>
```

Where `<TOOL>` is one of `auto`, `boltz`, `openfold`, `chai`, `alphafold`, or `esmfold`. The CLI uses `click.group()`, so `predict-structure <tool> --help` shows the per-tool flag set.

## Entity flags (inputs)

Inputs are specified with explicit entity flags. Every flag is **repeatable** to build multi-entity complexes.

| Flag | Type | Description |
|---|---|---|
| `--protein` | file path | Protein FASTA (a single multi-record FASTA is treated as one multi-chain complex, not separate jobs). Boltz YAML manifests pass through automatically. |
| `--dna` | file path | DNA FASTA |
| `--rna` | file path | RNA FASTA |
| `--ligand` | string | Ligand CCD code (e.g. `ATP`). Glycans use CCD codes too (`NAG`, `MAN`). |
| `--smiles` | string | SMILES string for arbitrary small molecules |

## Global options (every subcommand)

| Flag | Type | Default | Description |
|---|---|---|---|
| `-o`, `--output-dir` | path | **required** | Output directory |
| `-n`, `--num-samples` | int | 1 | Number of structure samples (diffusion samples for Boltz/OpenFold/Chai) |
| `--num-recycles` | int | 3 | Recycling iterations |
| `--seed` | int | (none) | Random seed |
| `--msa` | path | (none) | Pre-computed MSA (`.a3m`, `.sto`, `.pqt`) |
| `--output-format` | enum | `pdb` | `pdb` or `mmcif` |
| `--verbose` | flag | off | Verbose logging |
| `--debug` | flag | off | Print the command instead of executing it |

## Execution options

| Flag | Type | Default | Description |
|---|---|---|---|
| `--backend` | enum | `subprocess` | `subprocess`, `docker`, `apptainer`, or `cwl` |
| `--device` | enum | `gpu` | `gpu` or `cpu` (CPU path is meaningful only for ESMFold) |
| `--image` | string | (per-tool) | Override Docker image (docker / apptainer backends) |
| `--cwl-runner` | string | `cwltool` | CWL runner command (cwl backend only) |
| `--cwl-tool` | path | (auto) | CWL tool definition path (cwl backend only) |

## Tool-specific options

### `boltz` — Boltz-2

```bash
predict-structure boltz --protein input.fasta -o output/ \
  --num-samples 5 --sampling-steps 200 --use-potentials
```

| Flag | Type | Default | Description |
|---|---|---|---|
| `--sampling-steps` | int | 200 | Diffusion sampling steps |
| `--use-msa-server` | flag | off | Use the ColabFold MSA server (MMseqs2 against UniRef + ColabFoldDB) instead of requiring a pre-computed MSA file |
| `--msa-server-url` | string | (none) | Custom MSA server URL (implies `--use-msa-server`) |
| `--use-potentials` | flag | off | Enable potential terms (steered diffusion) |

### `openfold` — OpenFold 3

| Flag | Type | Default | Description |
|---|---|---|---|
| `--num-diffusion-samples` | int | 5 | Diffusion samples per query |
| `--num-model-seeds` | int | 1 | Independent model seeds |
| `--use-msa-server / --no-msa-server` | flag | on | ColabFold MSA server |
| `--use-templates / --no-templates` | flag | on | Use template structures |
| `--checkpoint` | string | (latest) | Model checkpoint name |

### `chai` — Chai-1

```bash
predict-structure chai --protein input.fasta -o output/ \
  --num-samples 5 --use-msa-server --no-esm-embeddings
```

| Flag | Type | Default | Description |
|---|---|---|---|
| `--sampling-steps` | int | 200 | Diffusion timesteps |
| `--use-msa-server` | flag | off | Use remote MSA server |
| `--msa-server-url` | string | (none) | Custom MSA server URL |
| `--no-esm-embeddings` | flag | off | Disable ESM2 language-model embeddings |
| `--use-templates-server` | flag | off | Use PDB template server |
| `--constraint-path` | path | (none) | Constraint JSON file |
| `--template-hits-path` | path | (none) | Pre-computed template hits |
| `--num-trunk-samples` | int | 1 | Independent trunk forward passes |
| `--recycle-msa-subsample` | int | 0 | MSA subsample per recycle (0 = all) |
| `--no-low-memory` | flag | off | Disable low-memory mode |

### `alphafold` — AlphaFold 2

```bash
predict-structure alphafold --protein input.fasta -o output/ \
  --af2-data-dir /databases --af2-model-preset monomer
```

| Flag | Type | Default | Description |
|---|---|---|---|
| `--af2-data-dir` | path | **required** | AlphaFold database directory (~2 TB) |
| `--af2-model-preset` | string | `monomer` | `monomer`, `monomer_casp14`, or `multimer` |
| `--af2-db-preset` | string | `reduced_dbs` | `reduced_dbs` or `full_dbs` |
| `--af2-max-template-date` | YYYY-MM-DD | `2022-01-01` | Maximum template date |

### `esmfold` — ESMFold

```bash
predict-structure esmfold --protein input.fasta -o output/ --fp16 --device cpu
```

| Flag | Type | Default | Description |
|---|---|---|---|
| `--fp16` | flag | off | Half-precision inference (faster, lower memory) |
| `--chunk-size` | int | (none) | Chunk size for long sequences |
| `--max-tokens-per-batch` | int | (none) | Max tokens per batch |

## Auto subcommand

`predict-structure auto --protein input.fasta -o output/` runs the auto-selector. The selection algorithm:

```
if device == cpu and only protein:
    → ESMFold
for tool in [boltz, openfold, chai, esmfold, alphafold]:
    if tool in {alphafold, esmfold} and any non-protein entity:
        skip
    if tool in {boltz, openfold, chai} and protein and no MSA:
        skip   # diffusion tools need real MSA; dummy single-sequence MSA produces unusable output
    if tool == alphafold and AF database dir missing:
        skip
    if tool not installed:
        skip
    return tool
raise: no prediction tool found
```

## Batch jobs (`--job`)

The `--job` flag runs multiple independent predictions from a YAML manifest. It is **mutually exclusive** with the subcommands — you cannot combine `--job` with `boltz`, `chai`, etc.

```bash
predict-structure --job jobs.yaml -o output/
```

Each job lands in `output/job_000/`, `output/job_001/`, … Job manifest schema:

```yaml
- protein: [/path/to/protein1.fasta]
  options:
    num_samples: 5
    device: gpu

- protein: [/path/to/protein2.fasta]
  ligands: [ATP]
  tool: boltz
  options:
    num_samples: 3
    use_potentials: true

- protein: [/path/to/protein3.fasta]
  dna: [/path/to/dna.fasta]
  tool: chai
  options:
    sampling_steps: 100
```

| Key | Type | Description |
|---|---|---|
| `protein` | list of paths | Protein FASTA files |
| `dna` | list of paths | DNA FASTA files |
| `rna` | list of paths | RNA FASTA files |
| `ligands` | list of strings | Ligand CCD codes |
| `smiles` | list of strings | SMILES strings |
| `tool` | string | Tool name (optional — auto-selects if omitted) |
| `options` | dict | Any shared or tool-specific option |

## Parameter mapping (shared → native)

The unified flags are mapped to each tool's native option name internally:

| Shared flag | Boltz-2 | OpenFold 3 | Chai-1 | AlphaFold 2 | ESMFold |
|---|---|---|---|---|---|
| `--output-dir` | `--out_dir` | `--output_dir` | positional | `--output_dir` | `-o` |
| `--num-samples` | `--diffusion_samples` | `--num_diffusion_samples` | `--num-diffn-samples` | (N/A) | (N/A) |
| `--num-recycles` | `--recycling_steps` | `--num_recycles` | `--num-trunk-recycles` | implicit | `--num-recycles` |
| `--seed` | (N/A) | `--seed` | `--seed` | `--random_seed` | (N/A) |
| `--device` | `--accelerator` | `--device` | `--device` | implicit | `--cpu-only` |
| `--msa` | injected into Boltz YAML `msa:` | JSON `main_msa_file_paths` | A3M → Parquet converted | (uses local DBs) | ignored |

## Examples

```bash
# Protein structure prediction with Boltz-2
predict-structure boltz --protein input.fasta -o output/

# Protein with ESMFold (CPU-capable, FP16)
predict-structure esmfold --protein input.fasta -o output/ --fp16

# Chai-1 with pre-computed MSA
predict-structure chai --protein input.fasta -o output/ --msa alignment.a3m

# AlphaFold 2 with local databases
predict-structure alphafold --protein input.fasta -o output/ \
  --af2-data-dir /databases

# Auto: pick the best available tool
predict-structure auto --protein input.fasta -o output/

# Multi-entity protein–DNA complex
predict-structure boltz --protein protein.fasta --dna dna.fasta -o output/

# Protein–ligand with CCD code
predict-structure boltz --protein protein.fasta --ligand ATP -o output/

# Protein with SMILES ligand
predict-structure boltz --protein protein.fasta --smiles "CCO" -o output/

# Multi-chain protein with Chai
predict-structure chai --protein chainA.fasta --protein chainB.fasta \
  --ligand ATP -o output/

# Dry-run — print the underlying command without executing
predict-structure boltz --protein input.fasta -o output/ --debug
```

## Exit codes

| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | Generic failure (see logs) |
| 2 | Usage / argument error (Click) |
| 3 | Input validation error (missing required entity, conflicting flags) |
| 4 | Tool / dependency not found |
| 5 | Runtime error inside the underlying engine |
| 124 | Time-out (killed by external scheduler) |

## Logging and debugging

- `--verbose` enables `INFO`-level logging from the CLI and adapters.
- `--debug` is a **dry-run**: prints the full underlying command line and exits 0. Use it to verify parameter mapping before submitting a long job.
- Setting `P3_DEBUG=1` and `P3_LOG_LEVEL=DEBUG` in the environment turns on the Perl AppService trace path (mirrors the **Debug Mode** checkbox in the BV-BRC form).
- Every run writes `metadata/runtime.json` with the resolved command, environment, peak memory, and wall time — the first thing to look at when a job behaves strangely.

## CWL workflows

The CLI is mirrored by a set of CWL tool definitions for use in pipeline runners:

```
cwl/tools/
  predict-structure-app.cwl   # Entry point
  predict-structure.cwl       # Unified CLI wrapper
  boltz.cwl
  chai.cwl
  openfold.cwl
  alphafold.cwl
  esmfold.cwl
  ...
cwl/workflows/
  protein-structure-prediction.cwl
  multi-tool-comparison.cwl   # Run all engines side-by-side
  boltz-report.cwl
  alphafold-report.cwl
  ...
```

Run a workflow with cwltool or GoWe:

```bash
cwltool cwl/workflows/protein-structure-prediction.cwl cwl/jobs/test-predict-alphafold.yml
```

See `docs/CWL_WORKFLOWS.md` in the source repository for the full CWL reference.

## See also

- [Quick Reference](/quick_references/services/predict_structure_service)
- [API Reference](/quick_references/services/predict_structure_api)
- [Folding Tools Comparison](https://github.com/wilke/ProteinStructurePrediction/blob/main/docs/folding_tools_comparison.md)
- [Protein Folding Primer](https://github.com/wilke/ProteinStructurePrediction/blob/main/docs/protein_folding_primer.md)
- Source: [CEPI-dxkb/PredictStructureApp](https://github.com/CEPI-dxkb/PredictStructureApp)