Skip to content

Data Pipelines

Data Pipelines sync external business systems — CRMs, support tools, project management, and more — into your organization's knowledge graph. Once connected, your AI agents can query live business data alongside everything else in the graph.

You can set up pipelines in two ways:

  • AI-assisted setup: Ask your AI agent to set up a pipeline using the built-in "Setup Data Pipeline" skill (recommended for most users)
  • Manual setup: Create and configure pipelines directly in the Control Room UI

This guide focuses on the AI-assisted approach, which walks you through each step interactively.

Prerequisites

Before setting up a data pipeline, make sure you have:

  • An active Airlock organization with a knowledge graph (memory server) provisioned
  • API credentials for the source system you want to connect (see Supported Data Sources for details)
  • An AI agent connected to your organization's MCP endpoint

How AI-Assisted Setup Works

Your organization comes with a built-in skill called Setup Data Pipeline. When activated, it guides your AI agent through a multi-step wizard to connect a data source, map fields to your knowledge graph, configure enrichments, and set a sync schedule.

The skill automatically uses two tool sets:

  • Knowledge graph tools — to inspect existing entities and schema in your graph
  • Pipeline tools — to manage sources, stream mappings, enrichments, and scheduling

Triggering the Skill

Start by asking your AI agent something like:

  • "Set up a data pipeline"
  • "Connect HubSpot to my knowledge graph"
  • "Sync data from my CRM"
  • "Import data from PhantomBuster"
  • "Feed data into the knowledge graph"
  • "Connect my CRM"
  • "Add a data source"

The agent recognizes these phrases and activates the Setup Data Pipeline skill automatically.

Setup Walkthrough

Beta: Pipeline execution is not yet available. You can complete Steps 1–7 to fully configure your pipeline now. Steps 8–9 (verification run and monitoring) will become functional when execution launches.

The wizard has nine steps. The agent adapts its recommendations based on your domain and data sources — you don't need to memorize every detail.

Step 1: Domain Detection

The agent asks what kind of data you want to sync. Based on your answer, it tailors its recommendations for entity types, naming conventions, and enrichments. Supported domains include:

DomainTypical entitiesExample sources
Go-To-MarketPerson, Company, DealHubSpot, Salesforce
EngineeringRepository, Issue, PullRequestGitHub, Jira
SupportTicket, Customer, ArticleZendesk, Intercom
FinanceInvoice, Transaction, AccountStripe, QuickBooks

If your use case doesn't fit neatly into one domain, the agent will adapt.

Step 2: Schema Check

The agent inspects your existing knowledge graph to see what entity types are already present. It recommends mapping to existing types when possible to avoid fragmentation — for example, if your graph already has a Person type, new contacts from HubSpot will map to that same type.

Step 3: Source Connection

The agent walks you through connecting your data source:

  1. Shows available source types
  2. Guides you through credential setup (API keys, tokens)
  3. Connects to the source and discovers available data streams and fields

For webhook-based sources like PhantomBuster, the agent also helps you configure the webhook URL so extraction results are delivered automatically.

Step 4: Stream Mapping

For each data stream you want to sync, the agent helps you:

  • Choose which entity type to map the data to (e.g., HubSpot contacts → Person)
  • Set a name template that determines how entities are named (e.g., ${first_name} ${last_name})
  • Configure dedup keys to prevent duplicate entities across syncs (e.g., use email as the dedup key for people)
  • Map individual fields from the source to graph properties
  • Preview the mapping with sample data to verify it looks correct

Step 5: Relationship Mapping

The agent suggests relationships between your mapped entities based on common patterns:

Source entityTarget entityRelationship
PersonCompanyWORKS_AT
DealCompanyBELONGS_TO
DealPersonOWNED_BY
TicketCustomerSUBMITTED_BY
IssueRepositoryBELONGS_TO

You can accept, modify, or skip any suggested relationship.

Step 6: Enrichment Setup

Enrichments are computed properties that the pipeline calculates after loading data. For example:

  • meeting_count: How many meetings a person has attended
  • deal_value_total: Total value of deals owned by a person
  • days_since_contact: Days since the last interaction

The agent proposes enrichments based on your domain and helps you define the queries. Enrichments run automatically after each sync.

Step 7: Schedule Configuration

Set how often the pipeline syncs data. Common schedules:

ScheduleBest for
Daily at 3:00 AMSmall to medium datasets
Every 6 hoursLarger datasets or time-sensitive data
Every hourNear-real-time sync needs
Weekly (Sunday 3:00 AM)Infrequently changing data

The agent asks for your timezone preference and sets the schedule accordingly.

Step 8: Verification Run

The agent triggers a test run to verify everything works:

  1. Starts the pipeline
  2. Monitors progress through extract, transform, load, and enrich stages
  3. Reports entity counts (created, updated, merged) and any errors

For webhook-based sources, the first run may take a few minutes while waiting for the external service to deliver results. This is expected behavior.

Step 9: Summary

The agent presents a complete summary of your configured pipeline: sources connected, streams mapped, enrichments configured, schedule set, and first run results.

Supported Data Sources

HubSpot

Syncs contacts, companies, deals, and owners from HubSpot CRM.

Credential setup:

  1. In HubSpot, go to Settings > Integrations > Private Apps
  2. Create a new app named "Airlock Pipeline"
  3. Grant read scopes: crm.objects.contacts.read, crm.objects.companies.read, crm.objects.deals.read, crm.objects.owners.read
  4. Copy the access token

Sync mode: Incremental (only syncs changes since last run) or full refresh.

PhantomBuster

Syncs output from PhantomBuster automation agents (LinkedIn enrichment, lead generation, etc.).

Credential setup:

  1. In PhantomBuster, go to Settings > API Keys
  2. Copy the API key

Webhook setup (required): PhantomBuster uses webhooks to deliver results. After adding the source, the agent guides you through configuring a webhook in PhantomBuster that points to your Airlock API. Without this webhook, pipeline runs will time out.

Sample data: Because PhantomBuster output schemas vary by agent type, the agent asks you to provide sample JSON output from a previous run. This allows it to discover field names and types for mapping.

After Setup

Note: The features in this section require pipeline execution, which is not yet available during beta. These will work once execution launches.

Monitoring Runs

View pipeline run history from the pipeline detail page in the Control Room (Pipelines > your pipeline > Runs tab). Each run shows:

  • Status: Running, Success, Partial, or Failed
  • Stage breakdown: Timing and results for extract, transform, load, and enrich stages
  • Entity counts: How many entities were created, updated, merged, or failed
  • Error details: Specific errors for any failed stages

You can also ask your AI agent to check pipeline status: "How did my last pipeline run go?" or "Show me recent pipeline runs."

Viewing Results in the Knowledge Graph

After a successful run, synced data appears in your knowledge graph. You can:

  • Browse entities in the Memory page in the Control Room
  • Ask your AI agent to query the graph: "Show me all companies from HubSpot" or "Who are the contacts at Acme Corp?"
  • Use the query_analytics tool for aggregate queries: "How many contacts were synced this week?"

Triggering Manual Runs

You don't have to wait for the next scheduled sync. Trigger a manual run from the pipeline detail page by clicking Trigger Run, or ask your agent: "Run the HubSpot pipeline now."

Resuming Incomplete Setup

This feature works today — it does not require pipeline execution.

If you stop partway through the setup wizard, the agent can pick up where you left off. Just ask it to continue: "Continue setting up my data pipeline." It checks the current pipeline configuration and resumes from the first incomplete step.

Tips and Best Practices

  • Start small. Connect one data source and map one or two streams first. Verify the data looks correct before adding more.
  • Check the preview. The agent offers to preview mappings with sample data — always review this before finalizing. It's much easier to fix a mapping before the first sync than after.
  • Use meaningful dedup keys. Email addresses for people, domain names for companies, and external IDs for deals prevent duplicates and keep your graph clean.
  • Leverage enrichments. Computed properties like meeting_count or days_since_contact make your graph more useful for agents without requiring extra source data.
  • Match existing entity types. If your graph already has Person entities from another source, map new contacts to the same type rather than creating a new one.
  • Set schedules conservatively. Daily syncs work well for most use cases. Only increase frequency if your data changes often and your agents need near-real-time information.
  • Monitor early runs. (Available once execution launches.) Check the first few pipeline runs to catch any mapping issues or credential problems early.

Built with VitePress