Scheduled Jobs
Repository: credinvest/cred-scheduled-jobs
The cred-scheduled-jobs repository is the single source of truth for all scheduled task definitions in CRED's infrastructure. It manages Cloud Scheduler jobs, Google Cloud Workflows, and the Cloud Run launcher services that bridge the two.
Repository Structure
cred-scheduled-jobs/
├── cloud-scheduler-jobs/ # Cloud Scheduler JSON definitions
│ ├── dev/ # Development environment jobs
│ ├── staging/ # Staging environment jobs
│ ├── prod/ # Production environment jobs
│ └── shared/ # Single-environment jobs (model, data pipelines)
├── google-workflows/ # Google Cloud Workflows (single environment)
│ └── <workflow-name>/
│ ├── manifest.json # Workflow metadata (name, service account, etc.)
│ └── workflow.yaml # Workflow DSL source
├── scheduled-cloud-run-code/ # Cloud Run launcher services
│ └── execute-*/ # One service per target application
│ ├── main.py
│ ├── requirements.txt
│ └── deploy.md
├── monitoring/ # Observability assets
│ └── cred-scheduler/
│ ├── grafana-dashboard.json
│ └── kube-state-metrics.yaml
├── scripts/ # CI/CD helpers
│ ├── apply_scheduler_job.py
│ ├── apply_workflow.py
│ └── tests/
└── .github/workflows/ # GitHub Actions pipelines
├── apply-cloud-scheduler-jobs.yml
└── apply-google-workflows.yml
How It Works
graph LR
CS[Cloud Scheduler] -->|HTTP POST| CR[Cloud Run Launcher]
CR -->|creates| K8[Kubernetes Job]
CS -->|or triggers| GW[Google Workflow]
GW -->|orchestrates| BQ[(BigQuery)]
GW -->|orchestrates| DF[Dataflow]
GW -->|orchestrates| DBT[dbt Cloud]
K8 -->|runs in| GKE[GKE Cluster]
The general flow is:
- Cloud Scheduler fires on a cron schedule
- It calls a Cloud Run launcher (
execute-*service) via HTTP - The launcher creates a Kubernetes Job in the scheduler cluster
- Alternatively, Cloud Scheduler can trigger a Google Workflow directly for multi-step orchestration
Cloud Scheduler Jobs
Cloud Scheduler jobs are defined as JSON files, one per job, organized by environment.
Environment Layout
| Folder | Environment | Applied on merge to main |
|---|---|---|
dev/ |
Development | Yes |
staging/ |
Staging | Yes |
prod/ |
Production | Yes |
shared/ |
Single-environment (model/data) | Yes |
Job JSON Structure
Each JSON file contains the full Cloud Scheduler job definition:
{
"schedule": "0 7 * * *",
"timeZone": "Etc/UTC",
"description": "CompanyFields",
"httpTarget": {
"uri": "https://us-central1-cred-1556636033881.cloudfunctions.net/execute-*",
"httpMethod": "POST",
"body": "<base64-encoded payload>",
"oidcToken": {
"serviceAccountEmail": "...",
"audience": "..."
}
},
"retryConfig": { "..." : "..." }
}
Key fields:
schedule— Cron expression (UTC unlesstimeZoneoverrides)httpTarget.uri— The Cloud Run launcher endpoint to invokehttpTarget.body— Base64-encoded JSON payload withjobLabelandcommandx-job-label— Custom label used for Grafana monitoring correlation
Job Categories
The ~900+ scheduler jobs span these areas:
- BQ-to-PG syncs — Transfer data from BigQuery to PostgreSQL (model and commercial)
- BQ-to-ES syncs — Transfer data from BigQuery to Elasticsearch
- Data exploration/scraping — Coresignal, LinkedIn, Transfermarkt, Wikidata, etc.
- CRM sync — Salesforce, HubSpot, MS Dynamics read/write/update cycles
- Notifications — News, saved search, deal takeaway, CXO position alerts
- Maintenance — Table partitions, deleted row acknowledgment, stalled task recovery
- Nylas sync — Email accounts, calendars, contacts, messages, threads
- Sequences — Outreach email/LinkedIn send, pause, retry, recover steps
Google Cloud Workflows
Google Cloud Workflows handle multi-step orchestrations that are too complex for a single Cloud Scheduler → K8s Job pattern. They coordinate across BigQuery, Dataflow, dbt Cloud, Cloud Storage, and more.
Single Environment
All workflows live in a single GCP project (cred-1556636033881 / us-central1). Environment differentiation (dev/staging/prod) is handled via runtime parameters passed at execution time, not via folder layout.
Workflow Structure
Each workflow lives in its own folder under google-workflows/:
google-workflows/<workflow-name>/
├── manifest.json # Workflow metadata
└── workflow.yaml # Workflow DSL source
manifest.json defines the workflow resource:
{
"name": "projects/cred-1556636033881/locations/us-central1/workflows/<workflow-name>",
"serviceAccount": "projects/.../serviceAccounts/work-flow-run@cred-1556636033881.iam.gserviceaccount.com",
"sourceFile": "workflow.yaml",
"executionHistoryLevel": "EXECUTION_HISTORY_BASIC",
"userEnvVars": {}
}
workflow.yaml contains the Google Workflows DSL — steps, conditionals, loops, API calls, and error handling.
Available Workflows
| Workflow | Purpose |
|---|---|
apptopia-traffic |
Apptopia app traffic data import |
build-and-sync-model |
Build and sync model data |
ch-xbrl-pipeline |
Companies House XBRL financial data import |
coresignal-import-direct |
Direct Coresignal data import to BigQuery |
coresignal-import-with-dbt |
Coresignal import with dbt transformation |
coresignal-import-with-intermediary-workflow |
Coresignal import via intermediary orchestration |
coresignal-person-import-workflow |
Coresignal person data import |
daily-all-coresignal-companies |
Daily full Coresignal company sync |
daily-all-coresignal-persons |
Daily full Coresignal person sync |
daily-coresignal-linkedin-jobs |
Daily Coresignal LinkedIn job listings sync |
dbt-job-cleanup |
Remove failed workflow jobs from dbt Cloud |
seed-coresignal-company |
Seed initial Coresignal company data |
similarweb-keywords |
SimilarWeb keyword data import |
wikidata-processing |
Wikidata entity import and processing |
Secrets Management
No Inline Secrets
Secrets are never stored in userEnvVars or manifest files. They must be fetched at runtime via Secret Manager from inside workflow.yaml:
- get_dbt_token:
call: googleapis.secretmanager.v1.projects.secrets.versions.accessString
args:
secret_id: dbt-cloud-api-token
result: dbtToken
The shared secret dbt-cloud-api-token (project cred-1556636033881) is accessible by both service accounts used in workflows.
The userEnvVars field is always declared (typically as {}) in every manifest. This ensures any value set out-of-band via the GCP console is wiped on the next apply — enforcing that all configuration lives in Git.
Adding a New Workflow
- Create a new folder:
google-workflows/<workflow-name>/ - Add a
manifest.jsonwith the workflow metadata (use an existing one as a template) - Add a
workflow.yamlwith the workflow steps - Validate locally:
python3 scripts/apply_workflow.py --validate-only --dir google-workflows/<workflow-name> - Open a PR — the workflow is applied to GCP automatically on merge to
main
Cloud Run Launchers
The scheduled-cloud-run-code/execute-* folders contain lightweight Python services that:
- Receive an HTTP POST from Cloud Scheduler
- Decode the base64 payload to get the
commandandjobLabel - Create a Kubernetes Job in the scheduler GKE cluster
Each launcher targets a specific application (e.g., execute-cred-model-cmd, execute-cred-api-commercial-cmd, execute-cred-elasticsearch-ingest-cmd).
CI/CD Pipelines
Both pipelines trigger on push to main and run unit tests before applying changes.
Cloud Scheduler Pipeline
File: .github/workflows/apply-cloud-scheduler-jobs.yml
- Runs unit tests (
scripts/tests/) - Detects which scheduler JSON files changed
- Applies changed jobs per environment matrix (dev, staging, prod, shared)
- Authenticates to GCP via Workload Identity Federation
Google Workflows Pipeline
File: .github/workflows/apply-google-workflows.yml
- Runs unit tests (
scripts/tests/) - Detects which workflow folders have changed (
manifest.jsonorworkflow.yaml) - Validates manifests with
--validate-only - Applies changed workflows to GCP
- Authenticates via Workload Identity Federation using the
prodGitHub Environment
Full Re-Apply
The Google Workflows pipeline supports workflow_dispatch with an apply_all option to re-apply all workflow folders, not just changed ones.
No Deletion Support
The pipeline fails fast if it detects a workflow folder was deleted. Workflow deletion from GCP must be done manually — the apply helper does not support it.
Monitoring
Job execution is tracked via a Grafana dashboard:
- URL:
https://grafana.cred.internal/d/cred-scheduler-jobs/cred-scheduler-jobs - Access: Twingate VPN or Google IAP access to the
pgwatchVM
Filter by Environment, Namespace, and Scheduler job label to find specific job runs.
The monitoring/cred-scheduler/ folder contains:
grafana-dashboard.json— The dashboard definitionkube-state-metrics.yaml— Kubernetes metrics collection config
Quick Reference
| Task | How |
|---|---|
| Find a scheduler job | Browse cloud-scheduler-jobs/{env}/ for the JSON file |
| Check job schedule | Read schedule and timeZone fields in the JSON |
| Find the launcher | Match httpTarget.uri to scheduled-cloud-run-code/execute-* |
| Check runtime status | Grafana dashboard, filter by Environment + Namespace + job label |
| Validate a scheduler JSON | python3 scripts/apply_scheduler_job.py --validate-only --file <path> |
| Validate a workflow | python3 scripts/apply_workflow.py --validate-only --dir google-workflows/<name> |
| Run unit tests | python3 -m unittest discover -v -s scripts/tests -p "test_*.py" |
| Add a new workflow | Create folder in google-workflows/, add manifest.json + workflow.yaml, open PR |
| Re-apply all workflows | Trigger Apply Google Workflows action with apply_all: true |