Scheduled Jobs

Repository: credinvest/cred-scheduled-jobs

The cred-scheduled-jobs repository is the single source of truth for all scheduled task definitions in CRED's infrastructure. It manages Cloud Scheduler jobs, Google Cloud Workflows, and the Cloud Run launcher services that bridge the two.

Repository Structure

cred-scheduled-jobs/
├── cloud-scheduler-jobs/      # Cloud Scheduler JSON definitions
│   ├── dev/                   # Development environment jobs
│   ├── staging/               # Staging environment jobs
│   ├── prod/                  # Production environment jobs
│   └── shared/                # Single-environment jobs (model, data pipelines)
├── google-workflows/          # Google Cloud Workflows (single environment)
│   └── <workflow-name>/
│       ├── manifest.json      # Workflow metadata (name, service account, etc.)
│       └── workflow.yaml      # Workflow DSL source
├── scheduled-cloud-run-code/  # Cloud Run launcher services
│   └── execute-*/             # One service per target application
│       ├── main.py
│       ├── requirements.txt
│       └── deploy.md
├── monitoring/                # Observability assets
│   └── cred-scheduler/
│       ├── grafana-dashboard.json
│       └── kube-state-metrics.yaml
├── scripts/                   # CI/CD helpers
│   ├── apply_scheduler_job.py
│   ├── apply_workflow.py
│   └── tests/
└── .github/workflows/         # GitHub Actions pipelines
    ├── apply-cloud-scheduler-jobs.yml
    └── apply-google-workflows.yml

How It Works

graph LR
    CS[Cloud Scheduler] -->|HTTP POST| CR[Cloud Run Launcher]
    CR -->|creates| K8[Kubernetes Job]
    CS -->|or triggers| GW[Google Workflow]
    GW -->|orchestrates| BQ[(BigQuery)]
    GW -->|orchestrates| DF[Dataflow]
    GW -->|orchestrates| DBT[dbt Cloud]
    K8 -->|runs in| GKE[GKE Cluster]

The general flow is:

Cloud Scheduler fires on a cron schedule
It calls a Cloud Run launcher (execute-* service) via HTTP
The launcher creates a Kubernetes Job in the scheduler cluster
Alternatively, Cloud Scheduler can trigger a Google Workflow directly for multi-step orchestration

Cloud Scheduler Jobs

Cloud Scheduler jobs are defined as JSON files, one per job, organized by environment.

Environment Layout

Folder	Environment	Applied on merge to `main`
`dev/`	Development	Yes
`staging/`	Staging	Yes
`prod/`	Production	Yes
`shared/`	Single-environment (model/data)	Yes

Job JSON Structure

Each JSON file contains the full Cloud Scheduler job definition:

{
  "schedule": "0 7 * * *",
  "timeZone": "Etc/UTC",
  "description": "CompanyFields",
  "httpTarget": {
    "uri": "https://us-central1-cred-1556636033881.cloudfunctions.net/execute-*",
    "httpMethod": "POST",
    "body": "<base64-encoded payload>",
    "oidcToken": {
      "serviceAccountEmail": "...",
      "audience": "..."
    }
  },
  "retryConfig": { "..." : "..." }
}

Key fields:

schedule — Cron expression (UTC unless timeZone overrides)
httpTarget.uri — The Cloud Run launcher endpoint to invoke
httpTarget.body — Base64-encoded JSON payload with jobLabel and command
x-job-label — Custom label used for Grafana monitoring correlation

Job Categories

The ~900+ scheduler jobs span these areas:

BQ-to-PG syncs — Transfer data from BigQuery to PostgreSQL (model and commercial)
BQ-to-ES syncs — Transfer data from BigQuery to Elasticsearch
Data exploration/scraping — Coresignal, LinkedIn, Transfermarkt, Wikidata, etc.
CRM sync — Salesforce, HubSpot, MS Dynamics read/write/update cycles
Notifications — News, saved search, deal takeaway, CXO position alerts
Maintenance — Table partitions, deleted row acknowledgment, stalled task recovery
Nylas sync — Email accounts, calendars, contacts, messages, threads
Sequences — Outreach email/LinkedIn send, pause, retry, recover steps

Google Cloud Workflows

Google Cloud Workflows handle multi-step orchestrations that are too complex for a single Cloud Scheduler → K8s Job pattern. They coordinate across BigQuery, Dataflow, dbt Cloud, Cloud Storage, and more.

Single Environment

All workflows live in a single GCP project (cred-1556636033881 / us-central1). Environment differentiation (dev/staging/prod) is handled via runtime parameters passed at execution time, not via folder layout.

Workflow Structure

Each workflow lives in its own folder under google-workflows/:

google-workflows/<workflow-name>/
├── manifest.json    # Workflow metadata
└── workflow.yaml    # Workflow DSL source

manifest.json defines the workflow resource:

{
  "name": "projects/cred-1556636033881/locations/us-central1/workflows/<workflow-name>",
  "serviceAccount": "projects/.../serviceAccounts/work-flow-run@cred-1556636033881.iam.gserviceaccount.com",
  "sourceFile": "workflow.yaml",
  "executionHistoryLevel": "EXECUTION_HISTORY_BASIC",
  "userEnvVars": {}
}

workflow.yaml contains the Google Workflows DSL — steps, conditionals, loops, API calls, and error handling.

Available Workflows

Workflow	Purpose
`apptopia-traffic`	Apptopia app traffic data import
`build-and-sync-model`	Build and sync model data
`ch-xbrl-pipeline`	Companies House XBRL financial data import
`coresignal-import-direct`	Direct Coresignal data import to BigQuery
`coresignal-import-with-dbt`	Coresignal import with dbt transformation
`coresignal-import-with-intermediary-workflow`	Coresignal import via intermediary orchestration
`coresignal-person-import-workflow`	Coresignal person data import
`daily-all-coresignal-companies`	Daily full Coresignal company sync
`daily-all-coresignal-persons`	Daily full Coresignal person sync
`daily-coresignal-linkedin-jobs`	Daily Coresignal LinkedIn job listings sync
`dbt-job-cleanup`	Remove failed workflow jobs from dbt Cloud
`seed-coresignal-company`	Seed initial Coresignal company data
`similarweb-keywords`	SimilarWeb keyword data import
`wikidata-processing`	Wikidata entity import and processing

Secrets Management

No Inline Secrets

Secrets are never stored in userEnvVars or manifest files. They must be fetched at runtime via Secret Manager from inside workflow.yaml:

- get_dbt_token:
    call: googleapis.secretmanager.v1.projects.secrets.versions.accessString
    args:
        secret_id: dbt-cloud-api-token
    result: dbtToken

The shared secret dbt-cloud-api-token (project cred-1556636033881) is accessible by both service accounts used in workflows.

The userEnvVars field is always declared (typically as {}) in every manifest. This ensures any value set out-of-band via the GCP console is wiped on the next apply — enforcing that all configuration lives in Git.

Adding a New Workflow

Create a new folder: google-workflows/<workflow-name>/
Add a manifest.json with the workflow metadata (use an existing one as a template)
Add a workflow.yaml with the workflow steps

Validate locally:

python3 scripts/apply_workflow.py --validate-only --dir google-workflows/<workflow-name>

Open a PR — the workflow is applied to GCP automatically on merge to main

Cloud Run Launchers

The scheduled-cloud-run-code/execute-* folders contain lightweight Python services that:

Receive an HTTP POST from Cloud Scheduler
Decode the base64 payload to get the command and jobLabel
Create a Kubernetes Job in the scheduler GKE cluster

Each launcher targets a specific application (e.g., execute-cred-model-cmd, execute-cred-api-commercial-cmd, execute-cred-elasticsearch-ingest-cmd).

CI/CD Pipelines

Both pipelines trigger on push to main and run unit tests before applying changes.

Cloud Scheduler Pipeline

File: .github/workflows/apply-cloud-scheduler-jobs.yml

Runs unit tests (scripts/tests/)
Detects which scheduler JSON files changed
Applies changed jobs per environment matrix (dev, staging, prod, shared)
Authenticates to GCP via Workload Identity Federation

Google Workflows Pipeline

File: .github/workflows/apply-google-workflows.yml

Runs unit tests (scripts/tests/)
Detects which workflow folders have changed (manifest.json or workflow.yaml)
Validates manifests with --validate-only
Applies changed workflows to GCP
Authenticates via Workload Identity Federation using the prod GitHub Environment

Full Re-Apply

The Google Workflows pipeline supports workflow_dispatch with an apply_all option to re-apply all workflow folders, not just changed ones.

No Deletion Support

The pipeline fails fast if it detects a workflow folder was deleted. Workflow deletion from GCP must be done manually — the apply helper does not support it.

Monitoring

Job execution is tracked via a Grafana dashboard:

URL: https://grafana.cred.internal/d/cred-scheduler-jobs/cred-scheduler-jobs
Access: Twingate VPN or Google IAP access to the pgwatch VM

Filter by Environment, Namespace, and Scheduler job label to find specific job runs.

The monitoring/cred-scheduler/ folder contains:

grafana-dashboard.json — The dashboard definition
kube-state-metrics.yaml — Kubernetes metrics collection config

Quick Reference

Task	How
Find a scheduler job	Browse `cloud-scheduler-jobs/{env}/` for the JSON file
Check job schedule	Read `schedule` and `timeZone` fields in the JSON
Find the launcher	Match `httpTarget.uri` to `scheduled-cloud-run-code/execute-*`
Check runtime status	Grafana dashboard, filter by Environment + Namespace + job label
Validate a scheduler JSON	`python3 scripts/apply_scheduler_job.py --validate-only --file <path>`
Validate a workflow	`python3 scripts/apply_workflow.py --validate-only --dir google-workflows/<name>`
Run unit tests	`python3 -m unittest discover -v -s scripts/tests -p "test_*.py"`
Add a new workflow	Create folder in `google-workflows/`, add `manifest.json` + `workflow.yaml`, open PR
Re-apply all workflows	Trigger `Apply Google Workflows` action with `apply_all: true`