Skip to content

Scheduled Jobs

Repository: credinvest/cred-scheduled-jobs

The cred-scheduled-jobs repository is the single source of truth for all scheduled task definitions in CRED's infrastructure. It manages Cloud Scheduler jobs, Google Cloud Workflows, and the Cloud Run launcher services that bridge the two.

Repository Structure

cred-scheduled-jobs/
├── cloud-scheduler-jobs/      # Cloud Scheduler JSON definitions
│   ├── dev/                   # Development environment jobs
│   ├── staging/               # Staging environment jobs
│   ├── prod/                  # Production environment jobs
│   └── shared/                # Single-environment jobs (model, data pipelines)
├── google-workflows/          # Google Cloud Workflows (single environment)
│   └── <workflow-name>/
│       ├── manifest.json      # Workflow metadata (name, service account, etc.)
│       └── workflow.yaml      # Workflow DSL source
├── scheduled-cloud-run-code/  # Cloud Run launcher services
│   └── execute-*/             # One service per target application
│       ├── main.py
│       ├── requirements.txt
│       └── deploy.md
├── monitoring/                # Observability assets
│   └── cred-scheduler/
│       ├── grafana-dashboard.json
│       └── kube-state-metrics.yaml
├── scripts/                   # CI/CD helpers
│   ├── apply_scheduler_job.py
│   ├── apply_workflow.py
│   └── tests/
└── .github/workflows/         # GitHub Actions pipelines
    ├── apply-cloud-scheduler-jobs.yml
    └── apply-google-workflows.yml

How It Works

graph LR
    CS[Cloud Scheduler] -->|HTTP POST| CR[Cloud Run Launcher]
    CR -->|creates| K8[Kubernetes Job]
    CS -->|or triggers| GW[Google Workflow]
    GW -->|orchestrates| BQ[(BigQuery)]
    GW -->|orchestrates| DF[Dataflow]
    GW -->|orchestrates| DBT[dbt Cloud]
    K8 -->|runs in| GKE[GKE Cluster]

The general flow is:

  1. Cloud Scheduler fires on a cron schedule
  2. It calls a Cloud Run launcher (execute-* service) via HTTP
  3. The launcher creates a Kubernetes Job in the scheduler cluster
  4. Alternatively, Cloud Scheduler can trigger a Google Workflow directly for multi-step orchestration

Cloud Scheduler Jobs

Cloud Scheduler jobs are defined as JSON files, one per job, organized by environment.

Environment Layout

Folder Environment Applied on merge to main
dev/ Development Yes
staging/ Staging Yes
prod/ Production Yes
shared/ Single-environment (model/data) Yes

Job JSON Structure

Each JSON file contains the full Cloud Scheduler job definition:

{
  "schedule": "0 7 * * *",
  "timeZone": "Etc/UTC",
  "description": "CompanyFields",
  "httpTarget": {
    "uri": "https://us-central1-cred-1556636033881.cloudfunctions.net/execute-*",
    "httpMethod": "POST",
    "body": "<base64-encoded payload>",
    "oidcToken": {
      "serviceAccountEmail": "...",
      "audience": "..."
    }
  },
  "retryConfig": { "..." : "..." }
}

Key fields:

  • schedule — Cron expression (UTC unless timeZone overrides)
  • httpTarget.uri — The Cloud Run launcher endpoint to invoke
  • httpTarget.body — Base64-encoded JSON payload with jobLabel and command
  • x-job-label — Custom label used for Grafana monitoring correlation

Job Categories

The ~900+ scheduler jobs span these areas:

  • BQ-to-PG syncs — Transfer data from BigQuery to PostgreSQL (model and commercial)
  • BQ-to-ES syncs — Transfer data from BigQuery to Elasticsearch
  • Data exploration/scraping — Coresignal, LinkedIn, Transfermarkt, Wikidata, etc.
  • CRM sync — Salesforce, HubSpot, MS Dynamics read/write/update cycles
  • Notifications — News, saved search, deal takeaway, CXO position alerts
  • Maintenance — Table partitions, deleted row acknowledgment, stalled task recovery
  • Nylas sync — Email accounts, calendars, contacts, messages, threads
  • Sequences — Outreach email/LinkedIn send, pause, retry, recover steps

Google Cloud Workflows

Google Cloud Workflows handle multi-step orchestrations that are too complex for a single Cloud Scheduler → K8s Job pattern. They coordinate across BigQuery, Dataflow, dbt Cloud, Cloud Storage, and more.

Single Environment

All workflows live in a single GCP project (cred-1556636033881 / us-central1). Environment differentiation (dev/staging/prod) is handled via runtime parameters passed at execution time, not via folder layout.

Workflow Structure

Each workflow lives in its own folder under google-workflows/:

google-workflows/<workflow-name>/
├── manifest.json    # Workflow metadata
└── workflow.yaml    # Workflow DSL source

manifest.json defines the workflow resource:

{
  "name": "projects/cred-1556636033881/locations/us-central1/workflows/<workflow-name>",
  "serviceAccount": "projects/.../serviceAccounts/work-flow-run@cred-1556636033881.iam.gserviceaccount.com",
  "sourceFile": "workflow.yaml",
  "executionHistoryLevel": "EXECUTION_HISTORY_BASIC",
  "userEnvVars": {}
}

workflow.yaml contains the Google Workflows DSL — steps, conditionals, loops, API calls, and error handling.

Available Workflows

Workflow Purpose
apptopia-traffic Apptopia app traffic data import
build-and-sync-model Build and sync model data
ch-xbrl-pipeline Companies House XBRL financial data import
coresignal-import-direct Direct Coresignal data import to BigQuery
coresignal-import-with-dbt Coresignal import with dbt transformation
coresignal-import-with-intermediary-workflow Coresignal import via intermediary orchestration
coresignal-person-import-workflow Coresignal person data import
daily-all-coresignal-companies Daily full Coresignal company sync
daily-all-coresignal-persons Daily full Coresignal person sync
daily-coresignal-linkedin-jobs Daily Coresignal LinkedIn job listings sync
dbt-job-cleanup Remove failed workflow jobs from dbt Cloud
seed-coresignal-company Seed initial Coresignal company data
similarweb-keywords SimilarWeb keyword data import
wikidata-processing Wikidata entity import and processing

Secrets Management

No Inline Secrets

Secrets are never stored in userEnvVars or manifest files. They must be fetched at runtime via Secret Manager from inside workflow.yaml:

- get_dbt_token:
    call: googleapis.secretmanager.v1.projects.secrets.versions.accessString
    args:
        secret_id: dbt-cloud-api-token
    result: dbtToken

The shared secret dbt-cloud-api-token (project cred-1556636033881) is accessible by both service accounts used in workflows.

The userEnvVars field is always declared (typically as {}) in every manifest. This ensures any value set out-of-band via the GCP console is wiped on the next apply — enforcing that all configuration lives in Git.

Adding a New Workflow

  1. Create a new folder: google-workflows/<workflow-name>/
  2. Add a manifest.json with the workflow metadata (use an existing one as a template)
  3. Add a workflow.yaml with the workflow steps
  4. Validate locally:
    python3 scripts/apply_workflow.py --validate-only --dir google-workflows/<workflow-name>
    
  5. Open a PR — the workflow is applied to GCP automatically on merge to main

Cloud Run Launchers

The scheduled-cloud-run-code/execute-* folders contain lightweight Python services that:

  1. Receive an HTTP POST from Cloud Scheduler
  2. Decode the base64 payload to get the command and jobLabel
  3. Create a Kubernetes Job in the scheduler GKE cluster

Each launcher targets a specific application (e.g., execute-cred-model-cmd, execute-cred-api-commercial-cmd, execute-cred-elasticsearch-ingest-cmd).

CI/CD Pipelines

Both pipelines trigger on push to main and run unit tests before applying changes.

Cloud Scheduler Pipeline

File: .github/workflows/apply-cloud-scheduler-jobs.yml

  1. Runs unit tests (scripts/tests/)
  2. Detects which scheduler JSON files changed
  3. Applies changed jobs per environment matrix (dev, staging, prod, shared)
  4. Authenticates to GCP via Workload Identity Federation

Google Workflows Pipeline

File: .github/workflows/apply-google-workflows.yml

  1. Runs unit tests (scripts/tests/)
  2. Detects which workflow folders have changed (manifest.json or workflow.yaml)
  3. Validates manifests with --validate-only
  4. Applies changed workflows to GCP
  5. Authenticates via Workload Identity Federation using the prod GitHub Environment

Full Re-Apply

The Google Workflows pipeline supports workflow_dispatch with an apply_all option to re-apply all workflow folders, not just changed ones.

No Deletion Support

The pipeline fails fast if it detects a workflow folder was deleted. Workflow deletion from GCP must be done manually — the apply helper does not support it.

Monitoring

Job execution is tracked via a Grafana dashboard:

  • URL: https://grafana.cred.internal/d/cred-scheduler-jobs/cred-scheduler-jobs
  • Access: Twingate VPN or Google IAP access to the pgwatch VM

Filter by Environment, Namespace, and Scheduler job label to find specific job runs.

The monitoring/cred-scheduler/ folder contains:

  • grafana-dashboard.json — The dashboard definition
  • kube-state-metrics.yaml — Kubernetes metrics collection config

Quick Reference

Task How
Find a scheduler job Browse cloud-scheduler-jobs/{env}/ for the JSON file
Check job schedule Read schedule and timeZone fields in the JSON
Find the launcher Match httpTarget.uri to scheduled-cloud-run-code/execute-*
Check runtime status Grafana dashboard, filter by Environment + Namespace + job label
Validate a scheduler JSON python3 scripts/apply_scheduler_job.py --validate-only --file <path>
Validate a workflow python3 scripts/apply_workflow.py --validate-only --dir google-workflows/<name>
Run unit tests python3 -m unittest discover -v -s scripts/tests -p "test_*.py"
Add a new workflow Create folder in google-workflows/, add manifest.json + workflow.yaml, open PR
Re-apply all workflows Trigger Apply Google Workflows action with apply_all: true