API Reference

Endpoints

POST /batch-trigger/

Receives batches of LinkedIn profiles from Pub/Sub and triggers BrightData scraping.

Service: priority-pipeline-api

Endpoint: POST /batch-trigger/

Request Body

Pub/Sub push format:

{
  "message": {
    "data": "base64_encoded_batch_data",
    "attributes": {
      "batch_number": "1",
      "total_batches": "50"
    }
  }
}

Batch Data Format (after base64 decode):

{
  "profiles": [
    {
      "linkedin_username": "john-doe-123456",
      "person_id": "uuid-here"
    }
  ]
}

Response

{
  "status": "success",
  "profiles_triggered": 20,
  "batch_number": 1
}

POST /webhook/linkedin

Receives scraped profile data from BrightData.

Service: priority-pipeline-webhook

Endpoint: POST /webhook/linkedin

Request Body

BrightData webhook format:

{
  "url": "https://linkedin.com/in/john-doe-123456",
  "data": {
    "profile": {
      "name": "John Doe",
      "headline": "CEO at Company",
      "location": "San Francisco, CA"
    }
  }
}

Response

{
  "status": "success",
  "saved_to_bigquery": true,
  "backed_up_to_gcs": true
}

GET /health

Health check endpoint for both services.

Endpoints: - GET /health (API Service) - GET /health (Webhook Service)

Response

{
  "status": "healthy",
  "service": "priority-pipeline-api",
  "environment": "production"
}

Manual Execution

Trigger Coordinator Job

Execute the monthly coordinator job manually:

gcloud run jobs execute priority-pipeline-coordinator --region=us-central1

Check Profile Scrape Status

Query BigQuery to check if a profile has been scraped:

SELECT 
  requestResource, 
  requestDate,
  requestStatus
FROM `cred-1556636033881.linkedin.LinkedinApiCall`
WHERE requestType = 'BRIGHTDATA_API_PERSON'
  AND requestResource = 'john-doe-123456'
ORDER BY requestDate DESC
LIMIT 1;

View Recent Scrapes

SELECT 
  requestResource,
  requestDate,
  requestStatus,
  COUNT(*) as scrape_count
FROM `cred-1556636033881.linkedin.LinkedinApiCall`
WHERE requestType = 'BRIGHTDATA_API_PERSON'
  AND requestDate >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY requestResource, requestDate, requestStatus
ORDER BY requestDate DESC
LIMIT 100;

BigQuery Tables

LinkedinApiCall

Stores all scraped profile data.

Table: linkedin.LinkedinApiCall

Key Columns: - requestResource - LinkedIn username - requestType - Always 'BRIGHTDATA_API_PERSON' for this pipeline - requestDate - When profile was scraped - requestStatus - Success/failure status - responseData - JSON with scraped profile data

Source Tables

PersonFields (credentity.PersonFields) - Source for priority profiles - Filter: isPriority = TRUE

PersonIdentifier (credmodel_google.PersonIdentifier) - Contains LinkedIn usernames - Filter: identifierType = 'LINKEDIN'

GCS Backup

All scraped data is backed up to Google Cloud Storage for disaster recovery.

Bucket: brightdata-monthly-priority-people

File Format: {linkedin_username}_{timestamp}.json

Example:

brightdata-monthly-priority-people/
  john-doe-123456_20250119120000.json
  jane-smith-789012_20250119120030.json

Pub/Sub Topic

Topic: linkedin-scraping-batches

Purpose: Coordinator job publishes batches here; API service subscribes

Message Format:

{
  "profiles": [
    {"linkedin_username": "user1", "person_id": "id1"},
    {"linkedin_username": "user2", "person_id": "id2"}
  ]
}