JavaScript LinkedIn Ads Scraper: Complete Node.js Tutorial 2025

Build a production-ready LinkedIn ads scraper with Node.js, Puppeteer, and advanced anti-detection techniques

January 30, 2025•45 min read•Advanced Level

Educational Purpose: This tutorial is for educational and research purposes. Always respect LinkedIn's Terms of Service and implement appropriate rate limiting and ethical practices.

What You'll Build

In this comprehensive tutorial, you'll learn to build a professional-grade JavaScript LinkedIn scraper using Node.js and Puppeteer. This isn't just another basic scraping tutorial - we'll cover everything from initial setup to production deployment, including advanced anti-detection techniques and proxy integration.

What You'll Learn:

✅ Complete Node.js LinkedIn scraper implementation
✅ Puppeteer and Playwright browser automation
✅ Advanced anti-detection and stealth techniques
✅ Proxy rotation with BrightData, Oxylabs, and SmartProxy
✅ Production deployment and monitoring
✅ Database integration and data persistence
✅ Error handling and retry mechanisms
✅ Rate limiting and ethical scraping practices

Prerequisites and Environment Setup

Before we dive into building our JavaScript LinkedIn scraper, let's ensure you have the right environment. You'll need Node.js 18+ and basic familiarity with JavaScript async/await patterns.

Initial Project Setup

# Create new Node.js project
mkdir linkedin-scraper-js
cd linkedin-scraper-js
npm init -y

# Install core dependencies
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
npm install playwright axios cheerio mongodb mongoose
npm install dotenv winston rate-limiter-flexible

# Install development dependencies  
npm install --save-dev @types/node typescript ts-node nodemon
npm install --save-dev jest @types/jest

# Initialize TypeScript
npx tsc --init

Create your project structure:

linkedin-scraper-js/
├── src/
│   ├── scrapers/
│   │   ├── LinkedInScraper.ts
│   │   └── PlaywrightScraper.ts
│   ├── proxies/
│   │   ├── ProxyManager.ts
│   │   └── providers/
│   ├── database/
│   │   ├── models/
│   │   └── connection.ts
│   ├── utils/
│   │   ├── stealth.ts
│   │   └── logger.ts
│   └── index.ts
├── config/
│   └── default.json
├── tests/
└── logs/

Building the Core LinkedIn Scraper

Let's start with our main Puppeteer LinkedIn scraper class. This will handle browser management, navigation, and data extraction with advanced anti-detection measures.

Advanced Puppeteer LinkedIn Scraper

// src/scrapers/LinkedInScraper.ts
import puppeteer, { Browser, Page } from 'puppeteer';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import { ProxyManager } from '../proxies/ProxyManager';
import { logger } from '../utils/logger';
import { RateLimiterMemory } from 'rate-limiter-flexible';

interface LinkedInAd {
  id: string;
  company: string;
  title: string;
  description: string;
  imageUrl?: string;
  landingUrl?: string;
  timestamp: Date;
  location?: string;
  demographics?: string[];
}

interface ScrapingOptions {
  headless?: boolean;
  proxy?: string;
  maxResults?: number;
  delay?: number;
  retries?: number;
}

export class LinkedInScraper {
  private browser: Browser | null = null;
  private page: Page | null = null;
  private proxyManager: ProxyManager;
  private rateLimiter: RateLimiterMemory;
  
  constructor() {
    this.proxyManager = new ProxyManager();
    
    // Rate limiting: 2 requests per minute per IP
    this.rateLimiter = new RateLimiterMemory({
      keyGenerator: (req) => req.ip || 'anonymous',
      points: 2,
      duration: 60,
    });
  }

  async initialize(options: ScrapingOptions = {}) {
    try {
      const proxy = options.proxy || await this.proxyManager.getRotatingProxy();
      
      // Advanced browser configuration with stealth
      const launchOptions = {
        headless: options.headless ?? true,
        args: [
          '--no-sandbox',
          '--disable-setuid-sandbox', 
          '--disable-dev-shm-usage',
          '--disable-accelerated-2d-canvas',
          '--no-first-run',
          '--no-zygote',
          '--disable-gpu',
          '--disable-features=TranslateUI',
          '--disable-ipc-flooding-protection',
          `--proxy-server=${proxy}`,
          '--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
        ],
        ignoreDefaultArgs: ['--enable-automation'],
        defaultViewport: null,
      };

      this.browser = await puppeteer.launch(launchOptions);
      this.page = await this.browser.newPage();
      
      // Advanced stealth configurations
      await this.configureStealth();
      
      logger.info('LinkedIn scraper initialized successfully', { proxy });
      
    } catch (error) {
      logger.error('Failed to initialize scraper', error);
      throw error;
    }
  }

  private async configureStealth() {
    if (!this.page) throw new Error('Page not initialized');

    // Override webdriver detection
    await this.page.evaluateOnNewDocument(() => {
      Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined,
      });
      
      // Remove automation indicators
      delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Array;
      delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Promise;
      delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Symbol;
    });

    // Set realistic headers
    await this.page.setExtraHTTPHeaders({
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
      'Accept-Language': 'en-US,en;q=0.5',
      'Accept-Encoding': 'gzip, deflate, br',
      'DNT': '1',
      'Connection': 'keep-alive',
      'Upgrade-Insecure-Requests': '1',
      'Sec-Fetch-Dest': 'document',
      'Sec-Fetch-Mode': 'navigate',
      'Sec-Fetch-Site': 'none',
      'Cache-Control': 'max-age=0',
    });

    // Randomize viewport
    const viewports = [
      { width: 1366, height: 768 },
      { width: 1920, height: 1080 },
      { width: 1440, height: 900 },
      { width: 1536, height: 864 },
    ];
    
    const randomViewport = viewports[Math.floor(Math.random() * viewports.length)];
    await this.page.setViewport(randomViewport);
  }

  async scrapeLinkedInAds(searchUrl: string, options: ScrapingOptions = {}): Promise<LinkedInAd[]> {
    if (!this.page) {
      await this.initialize(options);
    }

    const ads: LinkedInAd[] = [];
    const maxResults = options.maxResults || 100;
    const delay = options.delay || 2000;
    let retries = options.retries || 3;

    while (retries > 0) {
      try {
        await this.rateLimiter.consume('scrape');
        
        logger.info('Navigating to LinkedIn Ad Library', { url: searchUrl });
        
        // Navigate with timeout and wait for network idle
        await this.page!.goto(searchUrl, { 
          waitUntil: 'networkidle2',
          timeout: 30000 
        });

        // Wait for ads to load
        await this.page!.waitForSelector('.ad-item, [data-testid="ad-item"]', { 
          timeout: 15000 
        });

        // Scroll to load more ads
        await this.infiniteScroll(maxResults);

        // Extract ad data
        const extractedAds = await this.extractAdsData();
        ads.push(...extractedAds);

        logger.info(`Scraped ${extractedAds.length} ads successfully`);
        break;

      } catch (error) {
        retries--;
        logger.warn(`Scraping attempt failed, ${retries} retries left`, { error: error.message });
        
        if (retries === 0) {
          throw new Error(`Failed to scrape after multiple attempts: ${error.message}`);
        }
        
        await this.randomDelay(5000, 10000);
      }
    }

    return ads.slice(0, maxResults);
  }

  private async infiniteScroll(targetCount: number) {
    let previousCount = 0;
    let stableCount = 0;
    
    while (stableCount < 3) {
      // Scroll to bottom
      await this.page!.evaluate(() => {
        window.scrollTo(0, document.body.scrollHeight);
      });

      await this.randomDelay(2000, 4000);

      // Check current ad count
      const currentCount = await this.page!.$$eval(
        '.ad-item, [data-testid="ad-item"]', 
        els => els.length
      );

      logger.info(`Current ad count: ${currentCount}`);

      if (currentCount >= targetCount) {
        break;
      }

      if (currentCount === previousCount) {
        stableCount++;
      } else {
        stableCount = 0;
      }

      previousCount = currentCount;
    }
  }

  private async extractAdsData(): Promise<LinkedInAd[]> {
    return await this.page!.evaluate(() => {
      const adElements = document.querySelectorAll('.ad-item, [data-testid="ad-item"]');
      const ads: LinkedInAd[] = [];

      adElements.forEach((element, index) => {
        try {
          const titleElement = element.querySelector('.ad-title, [data-testid="ad-title"]');
          const companyElement = element.querySelector('.company-name, [data-testid="company-name"]');
          const descriptionElement = element.querySelector('.ad-description, [data-testid="ad-description"]');
          const imageElement = element.querySelector('img');
          const linkElement = element.querySelector('a[href*="linkedin.com"]');

          const ad: LinkedInAd = {
            id: `ad-${Date.now()}-${index}`,
            title: titleElement?.textContent?.trim() || '',
            company: companyElement?.textContent?.trim() || '',
            description: descriptionElement?.textContent?.trim() || '',
            imageUrl: imageElement?.src || undefined,
            landingUrl: linkElement?.href || undefined,
            timestamp: new Date(),
          };

          if (ad.title && ad.company) {
            ads.push(ad);
          }
        } catch (error) {
          console.warn('Error extracting ad data:', error);
        }
      });

      return ads;
    });
  }

  private async randomDelay(min: number, max: number) {
    const delay = Math.floor(Math.random() * (max - min + 1)) + min;
    await new Promise(resolve => setTimeout(resolve, delay));
  }

  async close() {
    if (this.browser) {
      await this.browser.close();
      this.browser = null;
      this.page = null;
      logger.info('LinkedIn scraper closed');
    }
  }
}

Proxy Integration for Production

Professional LinkedIn scraping requires reliable proxy rotation. Let's implement a robust proxy manager that works with major providers.

Advanced Proxy Manager

// src/proxies/ProxyManager.ts
import axios from 'axios';
import { logger } from '../utils/logger';

interface ProxyConfig {
  host: string;
  port: number;
  username?: string;
  password?: string;
  country?: string;
  sticky?: boolean;
}

interface ProxyProvider {
  name: string;
  getProxy(): Promise<ProxyConfig>;
  healthCheck(proxy: ProxyConfig): Promise<boolean>;
}

class BrightDataProvider implements ProxyProvider {
  name = 'BrightData';
  private endpoint: string;
  private username: string;
  private password: string;

  constructor() {
    this.endpoint = process.env.BRIGHTDATA_ENDPOINT || '';
    this.username = process.env.BRIGHTDATA_USERNAME || '';
    this.password = process.env.BRIGHTDATA_PASSWORD || '';
  }

  async getProxy(): Promise<ProxyConfig> {
    // BrightData sticky session format
    const sessionId = Math.random().toString(36).substring(7);
    
    return {
      host: 'zproxy.lum-superproxy.io',
      port: 22225,
      username: `${this.username}-session-${sessionId}`,
      password: this.password,
      country: 'US',
      sticky: true,
    };
  }

  async healthCheck(proxy: ProxyConfig): Promise<boolean> {
    try {
      const proxyUrl = `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`;
      
      const response = await axios.get('https://httpbin.org/ip', {
        proxy: false,
        httpsAgent: new (require('https-proxy-agent'))(proxyUrl),
        timeout: 10000,
      });

      return response.status === 200;
    } catch {
      return false;
    }
  }
}

class OxylabsProvider implements ProxyProvider {
  name = 'Oxylabs';
  private username: string;
  private password: string;

  constructor() {
    this.username = process.env.OXYLABS_USERNAME || '';
    this.password = process.env.OXYLABS_PASSWORD || '';
  }

  async getProxy(): Promise<ProxyConfig> {
    return {
      host: 'pr.oxylabs.io',
      port: 7777,
      username: this.username,
      password: this.password,
      country: 'US',
      sticky: false,
    };
  }

  async healthCheck(proxy: ProxyConfig): Promise<boolean> {
    try {
      const proxyUrl = `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`;
      
      const response = await axios.get('https://ipinfo.io/json', {
        proxy: false,
        httpsAgent: new (require('https-proxy-agent'))(proxyUrl),
        timeout: 10000,
      });

      return response.status === 200 && response.data.ip;
    } catch {
      return false;
    }
  }
}

class SmartProxyProvider implements ProxyProvider {
  name = 'SmartProxy';
  private username: string;
  private password: string;

  constructor() {
    this.username = process.env.SMARTPROXY_USERNAME || '';
    this.password = process.env.SMARTPROXY_PASSWORD || '';
  }

  async getProxy(): Promise<ProxyConfig> {
    const endpoints = [
      'gate.smartproxy.com:7000',
      'gate.smartproxy.com:7001', 
      'gate.smartproxy.com:7002',
    ];
    
    const randomEndpoint = endpoints[Math.floor(Math.random() * endpoints.length)];
    const [host, port] = randomEndpoint.split(':');

    return {
      host,
      port: parseInt(port),
      username: this.username,
      password: this.password,
      country: 'US',
      sticky: false,
    };
  }

  async healthCheck(proxy: ProxyConfig): Promise<boolean> {
    try {
      const proxyUrl = `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`;
      
      const response = await axios.get('https://ipinfo.io/json', {
        proxy: false,
        httpsAgent: new (require('https-proxy-agent'))(proxyUrl),
        timeout: 8000,
      });

      return response.status === 200;
    } catch {
      return false;
    }
  }
}

export class ProxyManager {
  private providers: ProxyProvider[];
  private currentProviderIndex: number = 0;
  private proxyPool: ProxyConfig[] = [];
  private failedProxies: Set<string> = new Set();

  constructor() {
    this.providers = [
      new BrightDataProvider(),
      new OxylabsProvider(), 
      new SmartProxyProvider(),
    ];
    
    this.initializeProxyPool();
  }

  private async initializeProxyPool() {
    logger.info('Initializing proxy pool...');
    
    for (const provider of this.providers) {
      try {
        const proxy = await provider.getProxy();
        const isHealthy = await provider.healthCheck(proxy);
        
        if (isHealthy) {
          this.proxyPool.push(proxy);
          logger.info(`Added ${provider.name} proxy to pool`, { 
            host: proxy.host, 
            port: proxy.port 
          });
        }
      } catch (error) {
        logger.warn(`Failed to initialize ${provider.name} proxy`, { error: error.message });
      }
    }

    logger.info(`Proxy pool initialized with ${this.proxyPool.length} proxies`);
  }

  async getRotatingProxy(): Promise<string> {
    if (this.proxyPool.length === 0) {
      await this.initializeProxyPool();
    }

    // Find healthy proxy
    for (let i = 0; i < this.proxyPool.length; i++) {
      const proxy = this.proxyPool[this.currentProviderIndex % this.proxyPool.length];
      this.currentProviderIndex++;

      const proxyId = `${proxy.host}:${proxy.port}`;
      
      if (!this.failedProxies.has(proxyId)) {
        if (proxy.username && proxy.password) {
          return `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`;
        } else {
          return `http://${proxy.host}:${proxy.port}`;
        }
      }
    }

    // If all proxies failed, reset and try again
    this.failedProxies.clear();
    logger.warn('All proxies marked as failed, resetting failure list');
    
    return await this.getRotatingProxy();
  }

  markProxyAsFailed(proxyUrl: string) {
    const proxyId = proxyUrl.split('@')[1] || proxyUrl.replace('http://', '');
    this.failedProxies.add(proxyId);
    logger.warn('Marked proxy as failed', { proxyId });
  }

  async refreshProxyPool() {
    this.proxyPool = [];
    this.failedProxies.clear();
    await this.initializeProxyPool();
  }
}

Playwright Alternative Implementation

While Puppeteer is excellent, Playwright offers additional benefits for LinkedIn scraping, including better stealth capabilities and multi-browser support.

// src/scrapers/PlaywrightScraper.ts
import { chromium, Browser, Page, BrowserContext } from 'playwright';
import { ProxyManager } from '../proxies/ProxyManager';
import { logger } from '../utils/logger';

export class PlaywrightLinkedInScraper {
  private browser: Browser | null = null;
  private context: BrowserContext | null = null;
  private page: Page | null = null;
  private proxyManager: ProxyManager;

  constructor() {
    this.proxyManager = new ProxyManager();
  }

  async initialize(options: any = {}) {
    const proxy = options.proxy || await this.proxyManager.getRotatingProxy();
    const [, credentials, server] = proxy.match(/http:\/\/(.+)@(.+)/) || [];
    const [username, password] = credentials?.split(':') || [];
    const [host, port] = server?.split(':') || [];

    this.browser = await chromium.launch({
      headless: options.headless ?? true,
      proxy: proxy ? {
        server: `http://${host}:${port}`,
        username,
        password,
      } : undefined,
    });

    this.context = await this.browser.newContext({
      userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      viewport: { width: 1366, height: 768 },
      locale: 'en-US',
      geolocation: { longitude: -74.006, latitude: 40.7128 }, // New York
      permissions: ['geolocation'],
      extraHTTPHeaders: {
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Sec-Fetch-Dest': 'document',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'none',
      },
    });

    this.page = await this.context.newPage();

    // Advanced stealth techniques
    await this.applyStealth();
  }

  private async applyStealth() {
    if (!this.page) return;

    // Override webdriver detection
    await this.page.addInitScript(() => {
      Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined,
      });

      // Remove automation indicators
      delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Array;
      delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Promise;
      delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Symbol;

      // Mock plugins
      Object.defineProperty(navigator, 'plugins', {
        get: () => [
          {
            0: { type: 'application/x-google-chrome-pdf', suffixes: 'pdf', description: 'Portable Document Format' },
            description: 'Portable Document Format',
            filename: 'internal-pdf-viewer',
            length: 1,
            name: 'Chrome PDF Plugin'
          }
        ],
      });
    });

    // Simulate human-like mouse movements
    await this.page.mouse.move(100, 100);
    await this.page.mouse.move(200, 200);
  }

  async scrapeWithPlaywright(searchUrl: string): Promise<any[]> {
    if (!this.page) await this.initialize();

    try {
      await this.page!.goto(searchUrl, { 
        waitUntil: 'networkidle',
        timeout: 30000 
      });

      // Wait for content with retry logic
      await this.page!.waitForSelector('.ad-item, [data-testid="ad-item"]', { 
        timeout: 15000 
      });

      // Scroll with human-like behavior
      await this.humanScroll();

      // Extract data
      const ads = await this.page!.evaluate(() => {
        const elements = document.querySelectorAll('.ad-item, [data-testid="ad-item"]');
        return Array.from(elements).map((el, index) => ({
          id: `playwright-ad-${Date.now()}-${index}`,
          title: el.querySelector('.ad-title')?.textContent?.trim() || '',
          company: el.querySelector('.company-name')?.textContent?.trim() || '', 
          description: el.querySelector('.ad-description')?.textContent?.trim() || '',
          imageUrl: el.querySelector('img')?.src || '',
          timestamp: new Date().toISOString(),
        }));
      });

      logger.info(`Scraped ${ads.length} ads with Playwright`);
      return ads;

    } catch (error) {
      logger.error('Playwright scraping failed', { error: error.message });
      throw error;
    }
  }

  private async humanScroll() {
    const scrolls = Math.floor(Math.random() * 5) + 3; // 3-7 scrolls
    
    for (let i = 0; i < scrolls; i++) {
      const scrollDistance = Math.floor(Math.random() * 800) + 400;
      
      await this.page!.evaluate((distance) => {
        window.scrollBy(0, distance);
      }, scrollDistance);

      // Random delay between scrolls
      await new Promise(resolve => 
        setTimeout(resolve, Math.floor(Math.random() * 2000) + 1000)
      );
    }
  }

  async close() {
    if (this.browser) {
      await this.browser.close();
      this.browser = null;
      this.context = null;
      this.page = null;
    }
  }
}

Database Integration and Data Persistence

A production JavaScript LinkedIn scraper needs robust data storage. Let's implement MongoDB integration with proper schemas and indexing.

// src/database/models/LinkedInAd.ts
import mongoose, { Schema, Document } from 'mongoose';

export interface ILinkedInAd extends Document {
  id: string;
  company: string;
  title: string;
  description: string;
  imageUrl?: string;
  landingUrl?: string;
  timestamp: Date;
  location?: string;
  demographics?: string[];
  scraperVersion: string;
  source: 'puppeteer' | 'playwright';
  metadata: {
    scrapedAt: Date;
    proxy?: string;
    userAgent?: string;
    viewport?: { width: number; height: number };
  };
}

const LinkedInAdSchema: Schema = new Schema({
  id: { type: String, required: true, unique: true },
  company: { type: String, required: true, index: true },
  title: { type: String, required: true },
  description: { type: String, required: true },
  imageUrl: { type: String },
  landingUrl: { type: String },
  timestamp: { type: Date, required: true },
  location: { type: String },
  demographics: [{ type: String }],
  scraperVersion: { type: String, required: true },
  source: { type: String, enum: ['puppeteer', 'playwright'], required: true },
  metadata: {
    scrapedAt: { type: Date, default: Date.now },
    proxy: { type: String },
    userAgent: { type: String },
    viewport: {
      width: { type: Number },
      height: { type: Number },
    },
  },
}, {
  timestamps: true,
  collection: 'linkedin_ads'
});

// Indexes for performance
LinkedInAdSchema.index({ company: 1, timestamp: -1 });
LinkedInAdSchema.index({ 'metadata.scrapedAt': -1 });
LinkedInAdSchema.index({ title: 'text', description: 'text' });

export default mongoose.model<ILinkedInAd>('LinkedInAd', LinkedInAdSchema);

// src/database/connection.ts
import mongoose from 'mongoose';
import { logger } from '../utils/logger';

export class DatabaseManager {
  private static instance: DatabaseManager;
  private isConnected: boolean = false;

  private constructor() {}

  public static getInstance(): DatabaseManager {
    if (!DatabaseManager.instance) {
      DatabaseManager.instance = new DatabaseManager();
    }
    return DatabaseManager.instance;
  }

  async connect(): Promise<void> {
    if (this.isConnected) return;

    try {
      const mongoUri = process.env.MONGODB_URI || 'mongodb://localhost:27017/linkedin-scraper';
      
      await mongoose.connect(mongoUri, {
        maxPoolSize: 10,
        serverSelectionTimeoutMS: 5000,
        socketTimeoutMS: 45000,
        family: 4
      });

      this.isConnected = true;
      logger.info('MongoDB connected successfully');

      mongoose.connection.on('error', (error) => {
        logger.error('MongoDB connection error:', error);
      });

      mongoose.connection.on('disconnected', () => {
        logger.warn('MongoDB disconnected');
        this.isConnected = false;
      });

    } catch (error) {
      logger.error('Failed to connect to MongoDB:', error);
      throw error;
    }
  }

  async disconnect(): Promise<void> {
    if (!this.isConnected) return;

    await mongoose.disconnect();
    this.isConnected = false;
    logger.info('MongoDB disconnected');
  }

  isConnectionHealthy(): boolean {
    return this.isConnected && mongoose.connection.readyState === 1;
  }
}

Production Deployment and Monitoring

Deploy your Node.js LinkedIn scraper to production with proper monitoring, logging, and error handling.

Production-Ready Main Application

// src/index.ts
import express from 'express';
import { LinkedInScraper } from './scrapers/LinkedInScraper';
import { PlaywrightLinkedInScraper } from './scrapers/PlaywrightScraper';
import { DatabaseManager } from './database/connection';
import LinkedInAd from './database/models/LinkedInAd';
import { logger } from './utils/logger';
import { RateLimiterMemory } from 'rate-limiter-flexible';

const app = express();
app.use(express.json());

// Rate limiting
const rateLimiter = new RateLimiterMemory({
  points: 10, // 10 requests
  duration: 60, // per minute
});

// Health check endpoint
app.get('/health', (req, res) => {
  const db = DatabaseManager.getInstance();
  res.json({
    status: 'healthy',
    timestamp: new Date().toISOString(),
    database: db.isConnectionHealthy() ? 'connected' : 'disconnected',
  });
});

// Main scraping endpoint
app.post('/scrape', async (req, res) => {
  try {
    await rateLimiter.consume(req.ip);
    
    const { url, engine = 'puppeteer', maxResults = 50 } = req.body;
    
    if (!url || !url.includes('linkedin.com')) {
      return res.status(400).json({ 
        error: 'Valid LinkedIn URL required' 
      });
    }

    let scraper;
    let ads = [];

    if (engine === 'playwright') {
      scraper = new PlaywrightLinkedInScraper();
      ads = await scraper.scrapeWithPlaywright(url);
    } else {
      scraper = new LinkedInScraper();
      ads = await scraper.scrapeLinkedInAds(url, { maxResults });
    }

    // Save to database
    const savedAds = await Promise.all(
      ads.map(async (ad) => {
        const adDoc = new LinkedInAd({
          ...ad,
          scraperVersion: '2.0.0',
          source: engine,
          metadata: {
            scrapedAt: new Date(),
            userAgent: req.headers['user-agent'],
          },
        });

        try {
          return await adDoc.save();
        } catch (error) {
          if (error.code === 11000) {
            // Duplicate key, update existing
            return await LinkedInAd.findOneAndUpdate(
              { id: ad.id },
              adDoc.toObject(),
              { new: true, upsert: true }
            );
          }
          throw error;
        }
      })
    );

    await scraper.close();

    res.json({
      success: true,
      count: savedAds.length,
      ads: savedAds,
      engine,
      timestamp: new Date().toISOString(),
    });

  } catch (error) {
    logger.error('Scraping request failed', { 
      error: error.message,
      url: req.body.url,
      ip: req.ip 
    });

    if (error.name === 'RateLimiterError') {
      return res.status(429).json({
        error: 'Rate limit exceeded. Please try again later.',
        retryAfter: Math.round(error.msBeforeNext / 1000),
      });
    }

    res.status(500).json({
      error: 'Scraping failed',
      message: error.message,
    });
  }
});

// Analytics endpoint
app.get('/stats', async (req, res) => {
  try {
    const totalAds = await LinkedInAd.countDocuments();
    const todayAds = await LinkedInAd.countDocuments({
      'metadata.scrapedAt': {
        $gte: new Date(new Date().setHours(0, 0, 0, 0))
      }
    });

    const topCompanies = await LinkedInAd.aggregate([
      { $group: { _id: '$company', count: { $sum: 1 } } },
      { $sort: { count: -1 } },
      { $limit: 10 }
    ]);

    res.json({
      totalAds,
      todayAds,
      topCompanies,
      timestamp: new Date().toISOString(),
    });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

async function startServer() {
  try {
    // Connect to database
    const db = DatabaseManager.getInstance();
    await db.connect();

    const port = process.env.PORT || 3000;
    app.listen(port, () => {
      logger.info(`LinkedIn scraper server running on port ${port}`);
    });

  } catch (error) {
    logger.error('Failed to start server:', error);
    process.exit(1);
  }
}

// Graceful shutdown
process.on('SIGTERM', async () => {
  logger.info('SIGTERM received, shutting down gracefully');
  
  const db = DatabaseManager.getInstance();
  await db.disconnect();
  
  process.exit(0);
});

process.on('SIGINT', async () => {
  logger.info('SIGINT received, shutting down gracefully');
  
  const db = DatabaseManager.getInstance();
  await db.disconnect();
  
  process.exit(0);
});

startServer();

Advanced Anti-Detection Techniques

Professional LinkedIn scraping requires sophisticated anti-detection measures. Here are advanced techniques to avoid blocks.

// src/utils/stealth.ts
import { Page } from 'puppeteer';

export class StealthManager {
  static async applyStealth(page: Page) {
    // 1. Override webdriver property
    await page.evaluateOnNewDocument(() => {
      Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined,
      });
    });

    // 2. Mock plugins
    await page.evaluateOnNewDocument(() => {
      Object.defineProperty(navigator, 'plugins', {
        get: () => ({
          length: 1,
          0: {
            name: 'Chrome PDF Plugin',
            filename: 'internal-pdf-viewer',
            description: 'Portable Document Format',
          },
        }),
      });
    });

    // 3. Mock languages
    await page.evaluateOnNewDocument(() => {
      Object.defineProperty(navigator, 'languages', {
        get: () => ['en-US', 'en'],
      });
    });

    // 4. Override permissions
    await page.evaluateOnNewDocument(() => {
      const originalQuery = window.navigator.permissions.query;
      return window.navigator.permissions.query = (parameters) => (
        parameters.name === 'notifications' ?
          Promise.resolve({ state: Notification.permission }) :
          originalQuery(parameters)
      );
    });

    // 5. Remove automation traces
    await page.evaluateOnNewDocument(() => {
      delete window.cdc_adoQpoasnfa76pfcZLmcfl_Array;
      delete window.cdc_adoQpoasnfa76pfcZLmcfl_Promise;
      delete window.cdc_adoQpoasnfa76pfcZLmcfl_Symbol;
    });

    // 6. Mock chrome runtime
    await page.evaluateOnNewDocument(() => {
      window.chrome = {
        runtime: {},
      };
    });

    // 7. Set realistic headers
    await page.setExtraHTTPHeaders({
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
      'Accept-Encoding': 'gzip, deflate, br',
      'Accept-Language': 'en-US,en;q=0.9',
      'Cache-Control': 'no-cache',
      'Pragma': 'no-cache',
      'Sec-Fetch-Mode': 'navigate',
      'Sec-Fetch-Site': 'none',
      'Sec-Fetch-User': '?1',
      'Upgrade-Insecure-Requests': '1',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
    });
  }

  static async humanizeInteractions(page: Page) {
    // Simulate human mouse movements
    const viewport = await page.viewport();
    if (viewport) {
      const { width, height } = viewport;
      
      // Random mouse movements
      for (let i = 0; i < 3; i++) {
        const x = Math.floor(Math.random() * width);
        const y = Math.floor(Math.random() * height);
        
        await page.mouse.move(x, y, {
          steps: Math.floor(Math.random() * 10) + 10
        });
        
        await this.randomDelay(100, 500);
      }
    }

    // Random scrolling patterns
    const scrollPatterns = [
      { x: 0, y: 300 },
      { x: 0, y: -150 },
      { x: 0, y: 500 },
      { x: 0, y: -100 },
    ];

    for (const scroll of scrollPatterns) {
      await page.mouse.wheel(scroll);
      await this.randomDelay(1000, 3000);
    }
  }

  static async randomDelay(min: number, max: number): Promise<void> {
    const delay = Math.floor(Math.random() * (max - min + 1)) + min;
    return new Promise(resolve => setTimeout(resolve, delay));
  }

  static generateRandomUserAgent(): string {
    const userAgents = [
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
    ];

    return userAgents[Math.floor(Math.random() * userAgents.length)];
  }
}

Error Handling and Retry Logic

Robust error handling is crucial for production LinkedIn scrapers. Implement comprehensive retry mechanisms and graceful degradation.

// src/utils/retryHandler.ts
import { logger } from './logger';

export interface RetryOptions {
  maxRetries: number;
  baseDelay: number;
  maxDelay: number;
  backoffFactor: number;
  retryCondition?: (error: any) => boolean;
}

export class RetryHandler {
  static async executeWithRetry<T>(
    operation: () => Promise<T>,
    options: RetryOptions
  ): Promise<T> {
    const {
      maxRetries,
      baseDelay,
      maxDelay,
      backoffFactor,
      retryCondition = () => true
    } = options;

    let lastError: any;
    
    for (let attempt = 0; attempt <= maxRetries; attempt++) {
      try {
        return await operation();
      } catch (error) {
        lastError = error;
        
        if (attempt === maxRetries || !retryCondition(error)) {
          throw error;
        }

        const delay = Math.min(
          baseDelay * Math.pow(backoffFactor, attempt),
          maxDelay
        );
        
        logger.warn(`Operation failed, retrying in ${delay}ms`, {
          attempt: attempt + 1,
          maxRetries,
          error: error.message
        });

        await new Promise(resolve => setTimeout(resolve, delay));
      }
    }

    throw lastError;
  }

  static isRetryableError(error: any): boolean {
    // Network errors
    if (error.code === 'ENOTFOUND' || error.code === 'ECONNRESET') {
      return true;
    }

    // Timeout errors
    if (error.message?.includes('timeout')) {
      return true;
    }

    // LinkedIn specific errors
    if (error.message?.includes('blocked') || error.message?.includes('rate limit')) {
      return true;
    }

    // HTTP status codes that should be retried
    if (error.response?.status >= 500) {
      return true;
    }

    if ([429, 503, 502, 504].includes(error.response?.status)) {
      return true;
    }

    return false;
  }
}

// Enhanced scraper with retry logic
export class ReliableLinkedInScraper extends LinkedInScraper {
  async scrapeWithRetry(searchUrl: string, options: any = {}) {
    return RetryHandler.executeWithRetry(
      () => this.scrapeLinkedInAds(searchUrl, options),
      {
        maxRetries: 3,
        baseDelay: 5000,
        maxDelay: 30000,
        backoffFactor: 2,
        retryCondition: RetryHandler.isRetryableError
      }
    );
  }
}

Testing and Quality Assurance

Test your JavaScript LinkedIn scraper thoroughly to ensure reliability and performance.

// tests/LinkedInScraper.test.ts
import { LinkedInScraper } from '../src/scrapers/LinkedInScraper';
import { DatabaseManager } from '../src/database/connection';

describe('LinkedInScraper', () => {
  let scraper: LinkedInScraper;
  
  beforeEach(() => {
    scraper = new LinkedInScraper();
  });

  afterEach(async () => {
    await scraper.close();
  });

  test('should initialize successfully', async () => {
    await expect(scraper.initialize({ headless: true })).resolves.not.toThrow();
  });

  test('should scrape LinkedIn ads', async () => {
    const testUrl = 'https://www.linkedin.com/ad-library/search?companyIds=1337';
    const results = await scraper.scrapeLinkedInAds(testUrl, { maxResults: 5 });
    
    expect(results).toBeInstanceOf(Array);
    expect(results.length).toBeGreaterThan(0);
    expect(results[0]).toHaveProperty('id');
    expect(results[0]).toHaveProperty('company');
    expect(results[0]).toHaveProperty('title');
  });

  test('should handle errors gracefully', async () => {
    const invalidUrl = 'https://invalid-linkedin-url.com';
    
    await expect(
      scraper.scrapeLinkedInAds(invalidUrl)
    ).rejects.toThrow();
  });
});

// Performance test
describe('Performance Tests', () => {
  test('should scrape 50 ads within 60 seconds', async () => {
    const scraper = new LinkedInScraper();
    const startTime = Date.now();
    
    const results = await scraper.scrapeLinkedInAds(
      'https://www.linkedin.com/ad-library/search?companyIds=1337',
      { maxResults: 50 }
    );
    
    const duration = Date.now() - startTime;
    
    expect(duration).toBeLessThan(60000);
    expect(results.length).toBe(50);
    
    await scraper.close();
  }, 65000);
});

Deployment with Docker

# Dockerfile
FROM node:18-alpine

# Install dependencies for Puppeteer
RUN apk update && apk add --no-cache \
    chromium \
    nss \
    freetype \
    freetype-dev \
    harfbuzz \
    ca-certificates \
    ttf-freefont

# Set Puppeteer to use installed Chromium
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \
    PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser

# Create app directory
WORKDIR /usr/src/app

# Copy package files
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Copy source code
COPY . .

# Build TypeScript
RUN npm run build

# Create non-root user
RUN addgroup -g 1001 -S nodejs
RUN adduser -S nodejs -u 1001

# Change ownership of app directory
RUN chown -R nodejs:nodejs /usr/src/app
USER nodejs

# Expose port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

# Start application
CMD ["npm", "start"]

# docker-compose.yml
version: '3.8'

services:
  linkedin-scraper:
    build: .
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=production
      - MONGODB_URI=mongodb://mongo:27017/linkedin-scraper
      - BRIGHTDATA_USERNAME=${BRIGHTDATA_USERNAME}
      - BRIGHTDATA_PASSWORD=${BRIGHTDATA_PASSWORD}
    depends_on:
      - mongo
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 2G
          cpus: '1.0'

  mongo:
    image: mongo:6
    volumes:
      - mongodb_data:/data/db
    ports:
      - "27017:27017"
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    restart: unless-stopped

volumes:
  mongodb_data:

Monitoring and Analytics

Set up comprehensive monitoring for your production Node.js LinkedIn scraper.

// src/utils/monitoring.ts
import { EventEmitter } from 'events';
import { logger } from './logger';

interface ScrapingMetrics {
  totalRequests: number;
  successfulScrapes: number;
  failedScrapes: number;
  averageResponseTime: number;
  proxyFailures: number;
  rateLimitHits: number;
}

export class MetricsCollector extends EventEmitter {
  private metrics: ScrapingMetrics = {
    totalRequests: 0,
    successfulScrapes: 0,
    failedScrapes: 0,
    averageResponseTime: 0,
    proxyFailures: 0,
    rateLimitHits: 0,
  };

  private responseTimes: number[] = [];

  recordRequest() {
    this.metrics.totalRequests++;
    this.emit('request', this.metrics);
  }

  recordSuccess(responseTime: number) {
    this.metrics.successfulScrapes++;
    this.responseTimes.push(responseTime);
    this.updateAverageResponseTime();
    this.emit('success', { responseTime, metrics: this.metrics });
  }

  recordFailure(error: any) {
    this.metrics.failedScrapes++;
    
    if (error.message?.includes('proxy')) {
      this.metrics.proxyFailures++;
    }
    
    if (error.status === 429) {
      this.metrics.rateLimitHits++;
    }
    
    this.emit('failure', { error, metrics: this.metrics });
  }

  private updateAverageResponseTime() {
    if (this.responseTimes.length > 0) {
      const sum = this.responseTimes.reduce((a, b) => a + b, 0);
      this.metrics.averageResponseTime = sum / this.responseTimes.length;
    }
  }

  getMetrics(): ScrapingMetrics & { successRate: number } {
    const successRate = this.metrics.totalRequests > 0 
      ? (this.metrics.successfulScrapes / this.metrics.totalRequests) * 100 
      : 0;

    return {
      ...this.metrics,
      successRate,
    };
  }

  reset() {
    this.metrics = {
      totalRequests: 0,
      successfulScrapes: 0,
      failedScrapes: 0,
      averageResponseTime: 0,
      proxyFailures: 0,
      rateLimitHits: 0,
    };
    this.responseTimes = [];
  }
}

// Usage in main scraper
const metricsCollector = new MetricsCollector();

// Log metrics every 5 minutes
setInterval(() => {
  const metrics = metricsCollector.getMetrics();
  logger.info('Scraping metrics', metrics);
}, 5 * 60 * 1000);

Best Practices and Ethical Considerations

Ethical Scraping Guidelines

• Rate Limiting: Implement delays between requests (2-5 seconds minimum)

• Respect robots.txt: Always check and follow robots.txt guidelines

• Use APIs First: Prefer official APIs when available

• User Agent: Use descriptive, honest user agents

• Data Usage: Only collect necessary data and respect privacy

• Terms of Service: Review and comply with platform terms

Performance Optimization

Optimize your JavaScript LinkedIn scraper for maximum performance and reliability.

// src/utils/performance.ts
export class PerformanceOptimizer {
  static async optimizePage(page: any) {
    // Disable images and CSS for faster loading
    await page.setRequestInterception(true);
    
    page.on('request', (req: any) => {
      const resourceType = req.resourceType();
      
      if (['image', 'stylesheet', 'font'].includes(resourceType)) {
        req.abort();
      } else {
        req.continue();
      }
    });

    // Set aggressive timeouts
    page.setDefaultNavigationTimeout(30000);
    page.setDefaultTimeout(15000);
  }

  static async enableCaching(browser: any) {
    // Enable cache for faster subsequent loads
    const context = await browser.createIncognitoBrowserContext();
    const page = await context.newPage();
    
    await page.setCacheEnabled(true);
    return { context, page };
  }

  static async parallelScraping(urls: string[], maxConcurrency = 3) {
    const results = [];
    const executing = [];

    for (const url of urls) {
      const promise = this.scrapeUrl(url).then(result => {
        executing.splice(executing.indexOf(promise), 1);
        return result;
      });

      results.push(promise);
      executing.push(promise);

      if (executing.length >= maxConcurrency) {
        await Promise.race(executing);
      }
    }

    return Promise.all(results);
  }

  private static async scrapeUrl(url: string) {
    const scraper = new LinkedInScraper();
    try {
      return await scraper.scrapeLinkedInAds(url);
    } finally {
      await scraper.close();
    }
  }
}

Troubleshooting Common Issues

🚫 Browser Detection Issues

Problem: LinkedIn detects automated browsers
Solution: Use puppeteer-extra-plugin-stealth and randomize browser fingerprints

⚠️ Rate Limiting

Problem: Getting 429 errors or blocks
Solution: Implement exponential backoff and rotate proxies more frequently

🔧 Memory Issues

Problem: High memory usage with long-running scrapers
Solution: Close browser instances regularly and implement memory monitoring

Conclusion

You've now built a comprehensive JavaScript LinkedIn scraper using Node.js and Puppeteer with enterprise-grade features. This tutorial covered everything from basic scraping to advanced anti-detection techniques, proxy integration, and production deployment.

Key Takeaways:

✅ Puppeteer provides powerful browser automation for LinkedIn scraping
✅ Proxy rotation is essential for avoiding IP blocks
✅ Anti-detection techniques help maintain stealth operations
✅ Error handling and retry logic ensure reliability
✅ Database integration enables data persistence and analytics
✅ Monitoring and metrics help optimize performance

Remember to always scrape responsibly, respect robots.txt files, implement appropriate delays, and consider the legal and ethical implications of your scraping activities. Happy scraping!

Ready to scrape LinkedIn ads?

Try our managed LinkedIn scraper API - no setup required, enterprise-grade infrastructure, and built-in proxy rotation.

Start scraping now