AstroData Onboarding Guide¶

Welcome to the team! This guide will take you from “I just cloned the repo” to “I generated my first project and ran it” in under 30 minutes, without any side-channel help.

1. What is cookiecutter-astrodata-squeleton?¶

The AstroData Squeleton is a TEMPLATE GENERATOR, not a project itself. Its purpose is to provide you with a professional, production-ready starting point for data science applications. Think of it as a “blueprint”: you run the skeleton once to generate a new project folder (the “generated project”), and then you spend 100% of your time working inside that new folder.

This repository contains two independent templates:

data-science/: Use this for 95% of your work. It creates a full environment with a FastAPI backend, a Streamlit dashboard, and a robust data science module.
sphinx-docs/: Use this ONLY if you need a separate, standalone documentation website without any data science or API code.

2. Prerequisites¶

Before you start, you must have these tools installed on your computer.

Git: The version control system used to track your code changes. Install here.
uv (Min version 0.1.0): An ultra-fast Python package and project manager written in Rust. It replaces pip, virtualenv, and pyenv in one tool. This is your primary tool.
- Install: $ curl -LsSf https://astral.sh/uv/install.sh | sh
Docker & Docker Compose (Optional): Used to run your project in isolated “containers” that match production perfectly. Install Desktop here.
Cookiecutter: You do not need to install this separately. It will be invoked automatically through uv.

3. Generating a new Data Science project (step by step)¶

Follow these 6 steps to create my_project from scratch:

Step 1: Run the skeleton command

$ uv run --with cookiecutter cookiecutter git@github.com:malganis35/cookiecutter-astrodata-squeleton.git --directory="data-science"

Why: This downloads the latest template and starts the interactive setup without installing anything permanently on your system.

Step 2: Answer the interactive prompts Use these values for your first run:

project_name: My Project
repo_name: my_project
package_name: my_project_pkg
author_name: Your Name
author_email: your.email@example.com
install_precommit: yes
initialize_sphinx_documentation: yes Why: This configures your project identity and enables quality controls like linting and automated documentation.

Step 3: Post-generation automation The hook automatically performs these 5 actions for you:

Removes the LICENSE file if “No license” was chosen.
Removes .pre-commit-config.yaml if you opted out of pre-commits.
Generates a Sphinx documentation sub-project in docs/project_documentation/.
Initializes Git, runs uv sync to install dependencies, and makes the initial commit.
Copies .env.example to .env for your local settings. Why: This ensures you start with a “ready-to-code” project folder already under version control.

Step 4: Navigate into the generated project

$ cd my_project

Why: You must always run project commands from the root of your newly generated folder.

Step 5: Connect to your remote repository

# For GitHub
$ git remote add origin https://github.com/your-org/my_project.git
$ git push -u origin main

Why: This backs up your code and allows collaboration with your team.

Step 6: Verify the setup

$ make check

Why: This runs all quality checks (linting, types, tests) to confirm the project is correctly configured.

4. Generating a new Sphinx Documentation project¶

If you only need a standalone documentation site (without any Data Science code):

$ uv run --with cookiecutter cookiecutter git@github.com:malganis35/cookiecutter-astrodata-squeleton.git --directory="sphinx-docs"

Why: This uses a different template optimized for text and diagrams. In a standard Data Science project, a docs/ folder is already generated for you.

5. Generated project structure — annotated¶

Here is what your my_project folder structure looks like:

my_project/
├── .github/              # GitHub Actions CI/CD workflows
├── .gitlab/              # GitLab CI/CD pipelines
├── app/                  
│   └── streamlit_app.py  # Your Dashboard (Streamlit)
├── api/                  
│   └── main.py           # Your Backend (FastAPI)
├── data/                 # Raw/Processed data (Git ignored)
│   ├── raw/              # Source data
│   └── processed/        # Model-ready data
├── dockerfiles/          # Recipes for production images
├── src/                  
│   └── my_project_pkg/   # Your reusable Python code
│       ├── core/         # Paths, versioning, and shared logic
│       ├── data_io/      # Loading/saving logic
│       └── models/       # Training/inference logic
├── Makefile              # Your command shortcuts
└── pyproject.toml        # Master dependency list (managed by uv)

Path	Role & Newcomer Notes
`src/my_project_pkg/`	The core package. Any code here is importable in notebooks or APIs.
`app/`	The Streamlit data studio. Run it with `make run`.
`api/`	The FastAPI server. Access interactive docs at `http://localhost:8000/docs`.
`data/raw/`	Immutable source data. Never modify files here manually.
`.env`	Your private keys and secrets. Never commit this to Git.
`Makefile`	The entry point for all dev tasks. Run `make` to see help.

6. Core design decisions¶

a. uv as the single dependency manager¶

Rule: No pip, no conda, no virtualenv.
Why: uv is 10x faster and uses uv.lock to ensure deterministic builds.
Example: uv add pandas updates both your project and your lockfile.
Failure: Using pip manually leads to version mismatches across the team.

b. Immutable raw data pipeline¶

Rule: Data flows one-way: raw → interim → processed.
Why: Ensures your data science experiments are reproducible.
Example: Use a script to clean raw files and save them into processed.
Failure: Overwriting raw data means you can never re-verify your source.

c. Centralized path management¶

Rule: Never hardcode strings like "../../data/raw" or use os.getcwd().
Why: Code should run the same from a notebook, a terminal, or a Docker container.
Example: Use from my_project_pkg.core.utils.paths import RAW_DATA_DIR.
Failure: Your code works in Jupyter but crashes in the production API.

d. Notebooks vs. Source Code¶

Rule: Notebooks are for exploration; src/ is for reusable tested code.
Why: Notebooks are hard to test and version. Code in src/ is clean and professional.
Example: Prototype a model in a notebook, then move the class to src/my_project_pkg/models/.
Failure: You end up with 50 undocumented notebooks and no clear model pipeline.

e. Conventional Commits + Commitizen¶

Rule: Every commit must start with a type (e.g., feat:, fix:, docs:).
Why: Allows automated generation of the CHANGELOG and version numbers.
Example: feat: add xgboost model to pipeline.
Failure: You lose track of which version contains which feature.

7. Day-to-day workflows¶

7.1 Clone and Setup¶

$ git clone <repo_url>
$ make dev-install
$ cp .env.example .env  # Then edit .env with your keys

7.2 Running Locally¶

Dashboard: $ make run (Visualization and demo).
API: $ make run_api (Serving models).

7.3 Running with Docker¶

$ make docker-build
$ make up
# Explore at http://localhost:8501 (App) and http://localhost:8000/docs (API)
$ make down

7.4 Working with Data¶

Source files go to data/raw/.
Large files (>10MB) should be tracked with DVC (Data Version Control).

7.5 Import Pattern in Notebooks¶

Always use this pattern to access your project code:

from my_project_pkg.core.utils.paths import PROJECT_ROOT, RAW_DATA_DIR
from my_project_pkg.core.data_io import loaders

7.6 Quality Checks¶

Run these before every commit:

$ make format  # Prettifies code
$ make check   # Runs linting, type checks, and tests

8. All Makefile commands — reference table¶

Command	What it does	When to use it
`make install`	`uv sync` (Prod + Dev)	Basic environment setup.
`make dev-install`	`uv sync` + pre-commit	First time onboarding.
`make run`	Launch Streamlit	Exploring data.
`make run_api`	Launch FastAPI	Testing endpoints.
`make check`	Full QA (Lint + Type + Test)	Before every push.
`make bump`	Version bump (`cz bump`)	Preparing a release.

9. Environment variables reference¶

Variable	Required?	Default	Description
`API_MAX_UPLOAD_SIZE`	No	52428800	Max upload size (50MB) for API files.
`GITHUB_TOKEN`	Yes	-	Used by CI/CD workflows.
`DATABRICKS_TOKEN`	No	-	Access to remote Databricks clusters.

10. Troubleshooting / FAQ¶

Q1: Did the generation work? A: Check if my_project exists. Move into it and run ls -a. If no error appeared at the end of the command, it worked.

Q2: uv is not found A: Restart your terminal. If it still fails, ensure your ~/.local/bin is in your PATH.

Q3: Import error in Notebook A: Ensure you ran make dev-install and restarted your Jupyter kernel.

Q4: API returns 500 errors A: Check terminal logs. Ensure your .env is correctly filled and data is in data/raw/.

Q5: Data files are too large A: These are ignored by Git. Use DVC or a shared bucket for large datasets.

11. Contributing to the skeleton¶

To improve the skeleton itself:

Validate: $ make validate (generates a test project).
Digest: $ make digest (creates context for LLMs).
Release: $ make bump then $ make release.