Build a Company Enrichment Agent with Nimble + Databricks + LangChain
.avif)
.png)
Building a Company Enrichment Agent with Nimble, Databricks, LangChain, and Claude Sonnet
A company enrichment agent is a critical tool for sales, marketing, and research teams. Whether you're building prospect lists, conducting market research, or analyzing competitors, you need accurate, up-to-date information about companies: their headquarters, funding history, team size, investors, and founders.
Manual research is time-consuming and doesn't scale. What if you could automate this process using AI agents that intelligently search the web, extract relevant information, and structure it in a database, all while running in a production-grade data platform?
In this post, I'll show you how to build a company enrichment agent using:
- Databricks: For data processing and Delta table storage
- LangChain: For agent orchestration
- Nimble Real-Time Search API: For live web data extraction (not stale indexed search)
- Claude Sonnet 4.5: For reasoning and structured output
The result is a powerful, automated system that can enrich hundreds of companies with minimal manual effort - combining real-time web intelligence with production-grade governance and observability.
The Problem: Manual Company Enrichment Tools Don't Scale
Sales, marketing, and research teams constantly need to enrich company data.
For example, you want to gather information like:
- Full headquarters address
- Total funding raised
- Employee count and growth trajectory
- List of investors and latest funding round
- Names of founders and key executives
For a list of even 100 companies, the manual process is painful:
- Google each company name
- Navigate between multiple sources (company website, Crunchbase, LinkedIn, news articles)
- Copy-paste data fragments from each site
- Normalize and format information consistently
- Store it in your database
This isn't a one-time task. Company data changes constantly - new funding rounds, leadership transitions, headcount growth. What took days to compile becomes stale in weeks, forcing you to repeat the entire process.
The Solution: An AI Company Enrichment Agent That Does the Research For You
Our approach uses an AI agent with a two-step strategy powered by Nimble's Real-Time Search API:
Why Real-Time Search Matters
Traditional index-based search APIs might return stale, cached results that can be days or weeks old. For data enrichment, this is a critical problem:
- A company's funding status changes overnight
- Leadership transitions aren't reflected in cached indexes
- New companies don't appear in search results for days
Nimble takes a fundamentally different approach: real-time web browsing. Instead of querying a static index, Nimble spins up headless browsers that navigate live websites, extracting fresh data in real-time, an important distinction if you’re comparing approaches commonly used in RAG systems
Databricks Platform Benefits: Governed Intelligence and Observability
While Nimble provides the real-time external context from the web, Databricks provides the governed intelligence and observability:
Model Flexibility: By using Databricks Model Serving (via `ChatDatabricks`), you're not locked into any specific provider. Use models from Anthropic, OpenAI, Google, Meta, and more with a unified interface, billing, guardrails, and enterprise security. Swap models without rewriting code.
MLflow Observability: Track and debug agent execution with MLflow—see which tools were called, compare different models or retrieval approaches, and quickly diagnose issues as you iterate on your enrichment pipeline.
Unity Catalog Governance: Enriched data is automatically governed under Unity Catalog with full lineage tracking, ready for downstream use in dashboards, AI/BI Genie queries, or other data workflows.
This combination makes the difference between a prototype and a production system that scales reliably across your organization.
Step 1: Fast Real-Time Data Enrichment Search
The agent searches with `deep_search=false`. Nimble's browsers navigate live sites (Crunchbase, LinkedIn, company websites) and return structured JSON data, not HTML to parse. Your agent receives `{"address": "123 Main St", "funding": "$50M"}` with data accurate as of *right now*.
Step 2: Targeted Extraction (if needed)
For deeper data needs, Nimble's site-specialized data extraction kicks in. A Crunchbase-trained agent understands exactly where funding data lives. A LinkedIn agent knows company page structure. This site-awareness delivers enterprise-grade accuracy that generic crawlers can't match.
This strategy balances speed (fast search) with completeness (targeted extraction), while real-time browsing ensures data freshness that indexed search vendors simply cannot provide.
Architecture Overview
%20(1).png)
Implementation
Step 1: Set Up Delta Table
First, we create a Delta table to store our companies and their enrichment data:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.sql.types import StructType, StructField, StringType
from delta.tables import DeltaTable
TABLE_NAME = "users.ilanc.company_enrichment_demo"
spark = SparkSession.builder.getOrCreate()
# Sample companies to enrich
companies_data = [
("Anthropic", "anthropic.com"),
("OpenAI", "openai.com"),
("Databricks", "databricks.com"),
("Nimble", "nimbleway.com")
]
# Create DataFrame with enrichment columns
df = spark.createDataFrame(companies_data, ["company_name", "website"])
df = df.withColumn("address", lit(None).cast(StringType())) \
.withColumn("funding", lit(None).cast(StringType())) \
.withColumn("employees", lit(None).cast(StringType())) \
.withColumn("investors", lit(None).cast(StringType())) \
.withColumn("founders", lit(None).cast(StringType())) \
.withColumn("enrichment_status", lit("pending"))
# Save as Delta table
df.write.format("delta").mode("overwrite").saveAsTable(TABLE_NAME)
Step 2: Configure the LangChain Agent
Next, we set up a LangChain agent with Claude Sonnet 4.5 and Nimble's search/extraction tools (click here to get your API Key) :
import os
import getpass
from typing import List
from pydantic import BaseModel, Field
from databricks_langchain import ChatDatabricks
from langchain.agents import create_agent
from langchain_nimble import NimbleExtractTool, NimbleSearchTool
from langchain.agents.middleware import SummarizationMiddleware
# Set up API key
if not os.environ.get("NIMBLE_API_KEY"):
os.environ["NIMBLE_API_KEY"] = getpass.getpass("NIMBLE_API_KEY:\n")
# Here you switch between models from Anthropic, OpenAI, Google, Meta...
llm_model = ChatDatabricks(endpoint="databricks-claude-sonnet-4-5")
# Define the agent's strategy
prompt_template = """
You are a company enrichment agent. Use this two-step approach:
**Step 1: Fast Search**
Use search_tool with deep_search=false to get quick snippets and URLs.
Extract as much information as possible from the snippets.
**Step 2: Targeted Extraction (if needed)**
If information is missing, use extract_tool on 1-2 relevant URLs from the search results.
Focus on official company websites, LinkedIn, or Crunchbase.
**Required Information:**
- address: Full headquarters address
- funding: Total funding (e.g., "$100M Series B")
- employees: Count or range (e.g., "500-1000")
- investors: List of investor names
- founders: List of founder names
Return "Not found" for missing strings, empty list [] for missing arrays.
"""
# Define structured output schema
class CompanyInfo(BaseModel):
"""Company enrichment information"""
address: str = Field(description="Company headquarters address")
funding: str = Field(description="Total funding raised")
employees: str = Field(description="Employee count or range")
investors: List[str] = Field(description="List of investors")
founders: List[str] = Field(description="List of founders")
# Create agent
agent = create_agent(
model=llm_model,
tools=[NimbleSearchTool(), NimbleExtractTool()],
system_prompt=prompt_template,
response_format=CompanyInfo
)
Step 3: Define the Data Enrichment Function
The enrichment function calls the agent and returns structured data:
import json
async def enrich_company(company_name: str, website: str) -> dict:
"""Use agent to enrich company data with structured output"""
try:
query = f"Find address, funding, employees, investors, and founders for {company_name} (website: {website})"
# Stream agent execution
async for step in agent.astream(
{"messages": [{"role": "user", "content": query}]},
stream_mode="values",
):
pass # Process streaming steps silently
# Extract structured response
structured = step["structured_response"]
result = structured.model_dump()
# Convert lists to JSON strings for Delta table storage
result = {
"address": result.get("address", "Not found"),
"funding": result.get("funding", "Not found"),
"employees": result.get("employees", "Not found"),
"investors": json.dumps(result.get("investors", [])),
"founders": json.dumps(result.get("founders", []))
}
return result
except Exception as e:
print(f"❌ Error enriching {company_name}: {str(e)}")
return {
"address": "Error",
"funding": "Error",
"employees": "Error",
"investors": "[]",
"founders": "[]"
}
Step 4: Run the Enrichment Pipeline
Finally, we process all pending companies and update the Delta table:
from pyspark.sql.functions import col, lit
delta_table = DeltaTable.forName(spark, TABLE_NAME)
pending = spark.table(TABLE_NAME).filter(col("enrichment_status") == "pending").collect()
print(f"🚀 Enriching {len(pending)} companies...\n")
success_count = 0
error_count = 0
for idx, row in enumerate(pending, 1):
company = row.company_name
website = row.website
print(f"[{idx}/{len(pending)}] Processing {company}...")
try:
# Enrich with agent
data = await enrich_company(company, website)
# Update Delta table
delta_table.update(
condition=col("company_name") == company,
set={
"address": lit(data["address"]),
"funding": lit(data["funding"]),
"employees": lit(data["employees"]),
"investors": lit(data["investors"]),
"founders": lit(data["founders"]),
"enrichment_status": lit("completed" if data["address"] != "Error" else "failed")
}
)
if data["address"] != "Error":
print(f" ✅ {data['address']}\n")
success_count += 1
else:
error_count += 1
except Exception as e:
print(f" ❌ Unexpected error: {str(e)}\n")
error_count += 1
print(f"🎉 Complete! Success: {success_count}, Failed: {error_count}")
Next Steps: Scaling to Production
The basic implementation works well for small datasets, but what if you need to enrich 10,000 companies?
Here are two critical improvements for production-scale workloads.
1. Parallel Processing with Spark
The Problem: Our current implementation processes companies sequentially, one at a time. For 1,000 companies, this could take hours.
The Solution: Use Spark's `pandas_udf` to distribute enrichment across your cluster. Wrap your agent logic in a UDF that processes rows in parallel across multiple nodes.
Key optimization: Initialize the agent once per partition (not per row) to avoid overhead.
Results: 10-100x speedup depending on cluster size. Databricks autoscaling clusters dynamically adjust resources based on workload.
2. Multi-Agent Architecture for Complex Enrichment
For complex enrichment needs, a single agent handling all tasks faces context engineering challenges. When one agent must extract addresses, funding, investors, founders, and executives, its prompt becomes bloated with instructions for every domain, leading to lower accuracy and higher hallucination rates.
The Solution: Specialized Agents with Focused Context
Split enrichment into domain-expert agents, each with laser-focused prompts and specialized knowledge:
- Company Info Agent: Focuses on headquarters, founding year, business model. Knows to prioritize official websites, Wikipedia, LinkedIn company pages.
- Funding Agent: Extracts Series rounds, valuations, investors. Specialized in Crunchbase, PitchBook, SEC filings, and press releases.
- People Agent: Finds founders and executives. Expert in LinkedIn profiles, company About pages, and executive bios.
Why Source Specialization Matters
Each agent knows the authoritative sources for its domain. The funding agent doesn't waste tokens searching LinkedIn profiles; the people agent skips Crunchbase tables. This source awareness prevents agents from searching irrelevant sources and improves accuracy by focusing on where reliable data actually lives.
A supervisor agent coordinates the specialized agents, ensuring consistency (handling alternative company names), detecting conflicts between agents, and assigning confidence scores based on cross-agent validation.
%20(1).png)
Smart Caching: Store search results in Delta tables to avoid redundant API calls for the same queries.
Confidence Scoring: Route low-confidence results (<0.8) to a human review queue for validation.
Error Handling: Implement retry logic with exponential backoff for transient failures, and use async/await patterns for I/O-bound operations to maximize throughput; this sits alongside broader production concerns like orchestration and monitoring data pipeline tools.
Build a Scalable Company Enrichment Agent with Nimble
Building a company enrichment agent is surprisingly straightforward with the right tools. By combining Databricks' data platform, LangChain's agent framework, Nimble's Real-Time Search API, and Claude Sonnet's reasoning, you can automate hours of manual research.
The two-step search strategy (fast search → targeted extraction) is particularly powerful because Nimble handles the hardest part of web data collection. With real-time web browsing, proprietary JavaScript rendering, site-specialized agents, and structured JSON output, Nimble delivers parse-ready data that your AI agents can immediately reason over. No HTML parsing, no browser management, no site-specific scrapers to maintain.
Structured output with Pydantic ensures data quality, while Delta tables provide production-grade storage with ACID guarantees. The result: a scalable, reliable enrichment pipeline that processes thousands of companies without manual intervention.
You can adapt this pattern for other enrichment tasks: contact enrichment, product research, competitive analysis, or market intelligence gathering, anywhere you need AI agents to autonomously gather web data at scale.
Ready to try it yourself? Check out the full notebook on GitHub and start enriching your own company data with Nimble's Search API.
Resources
Code & Documentation:
Multi-Agent Research & Context Engineering:
- How we built our multi-agent research system - Anthropic Engineering
- Effective context engineering for AI agents - Anthropic Engineering
- AI Agent Systems: Modular Engineering for Reliable Enterprise AI Applications - Databricks Blog
FAQ
Answers to frequently asked questions
%20(1).webp)



.png)






