Skip to content

Data Architecture

Data Science

Matching

MLOps

QA

Data Engineering

Data Ingestion Automation

Coresignal Data Ingestion

  • Google Cloud Workflows Console - Seeding Coresignal data is handled by Workflows
  • Google Cloud Scheduler Console - Workflows are triggered by cloud scheduler
  • To run a workflow outside the schedule - you need to select the job and click Force run it will trigger import from Coresignal datasource into BQ tables

Daily Batch Workflows

The following workflows run daily to ingest Coresignal data into BigQuery tables:

Companies Workflow:

Persons Workflow:

  • Daily All Coresignal Persons - Daily batch processing for person data
  • Target Table: AllCoresignalLinkedinPersons
  • Schedule: Daily execution

Both workflows are configured to run automatically via Google Cloud Scheduler and can be manually triggered through the Google Cloud Workflows console when needed.

Coresignal Daily Metrics

These two tables monitor and analyze changes in LinkedIn data from Coresignal between refreshes. They provide daily aggregated statistics for both LinkedIn person and company data.

Tables:

  • coresignal.CoresignalLinkedinPersonDailyStats
  • coresignal.CoresignalLinkedinCompanyDailyStats

How both tables work:

  • Each row represents one specific date with aggregated statistics for that day
  • The refresh_date field shows when data was processed (based on updated_at for person data, last_updated for company data)

Important: All metrics are calculated ONLY for records that were updated on that specific date, NOT the total table

Key metrics per row:

The following metrics apply to both tables:

Volume metrics

  • unique_profiles/companies = Number of unique records updated on this date
  • new_profiles/companies = Records seen for the first time on this date
  • updated_profiles/companies = Records that already existed but were updated on this date
  • deleted_profiles/companies = Records marked as deleted on this date

Data quality metrics (coverage percentages)

Below we have some examples of information you will find, but please check the table for the rest.

  • Person data: headline_coverage_pct, location_coverage_pct, experience_coverage_pct, education_coverage_pct
  • Company data: name_coverage_pct, description_coverage_pct, headquarters_country_coverage_pct

Example interpretation:

refresh_date | unique_profiles | new_profiles | updated_profiles | headline_coverage_pct
2024-01-15  | 1000           | 50          | 950            | 85.00
2024-01-16  | 1200           | 30          | 1170           | 88.50

This means:

  • Jan 15: 1000 profiles were updated, 50 were new, 950 were updates, 85% had headlines
  • Jan 16: 1200 profiles were updated, 30 were new, 1170 were updates, 88.5% had headlines

Usage:

SELECT * FROM coresignal.CoresignalLinkedinPersonDailyStats
WHERE refresh_date >= '2024-01-01'
ORDER BY refresh_date DESC;

SELECT * FROM coresignal.CoresignalLinkedinCompanyDailyStats
WHERE refresh_date >= '2024-01-01'
ORDER BY refresh_date DESC;

Alerts and Monitoring

1. Volume Drop Alert

Trigger: When daily unique profiles/companies drop

WITH daily_comparison AS (
  SELECT
    refresh_date,
    unique_profiles,
    LAG(unique_profiles) OVER (ORDER BY refresh_date) as prev_unique_profiles
  FROM coresignal.CoresignalLinkedinPersonDailyStats
)
SELECT refresh_date, unique_profiles, prev_unique_profiles,
       (unique_profiles - prev_unique_profiles) / prev_unique_profiles * 100 as pct_change
FROM daily_comparison

2. Data Quality Alert

Trigger: When field coverage drops significantly

-- Monitor headline coverage drop
WITH coverage_trend AS (
  SELECT
    refresh_date,
    headline_coverage_pct,
    LAG(headline_coverage_pct) OVER (ORDER BY refresh_date) as prev_coverage
  FROM coresignal.CoresignalLinkedinPersonDailyStats
)
SELECT refresh_date, headline_coverage_pct, prev_coverage,
       headline_coverage_pct - prev_coverage as coverage_change
FROM coverage_trend

Monitoring Queries

Daily Health Check

SELECT
  'Person Data' as data_type,
  unique_profiles as unique_count,
  new_profiles as new_count,
  updated_profiles as updated_count,
  headline_coverage_pct as coverage_pct_1,
  location_coverage_pct as coverage_pct_2
FROM coresignal.CoresignalLinkedinPersonDailyStats
WHERE refresh_date = CURRENT_DATE()
UNION ALL
SELECT
  'Company Data' as data_type,
  unique_companies as unique_count,
  new_companies as new_count,
  updated_companies as updated_count,
  name_coverage_pct as coverage_pct_1,
  description_coverage_pct as coverage_pct_2
FROM coresignal.CoresignalLinkedinCompanyDailyStats
WHERE refresh_date = CURRENT_DATE();