JavaScript LinkedIn Ads Scraper: Complete Node.js Tutorial 2025
Build a production-ready LinkedIn ads scraper with Node.js, Puppeteer, and advanced anti-detection techniques
Educational Purpose: This tutorial is for educational and research purposes. Always respect LinkedIn's Terms of Service and implement appropriate rate limiting and ethical practices.
What You'll Build
In this comprehensive tutorial, you'll learn to build a professional-grade JavaScript LinkedIn scraper using Node.js and Puppeteer. This isn't just another basic scraping tutorial - we'll cover everything from initial setup to production deployment, including advanced anti-detection techniques and proxy integration.
What You'll Learn:
- ✅ Complete Node.js LinkedIn scraper implementation
- ✅ Puppeteer and Playwright browser automation
- ✅ Advanced anti-detection and stealth techniques
- ✅ Proxy rotation with BrightData, Oxylabs, and SmartProxy
- ✅ Production deployment and monitoring
- ✅ Database integration and data persistence
- ✅ Error handling and retry mechanisms
- ✅ Rate limiting and ethical scraping practices
Prerequisites and Environment Setup
Before we dive into building our JavaScript LinkedIn scraper, let's ensure you have the right environment. You'll need Node.js 18+ and basic familiarity with JavaScript async/await patterns.
Initial Project Setup
# Create new Node.js project
mkdir linkedin-scraper-js
cd linkedin-scraper-js
npm init -y
# Install core dependencies
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
npm install playwright axios cheerio mongodb mongoose
npm install dotenv winston rate-limiter-flexible
# Install development dependencies
npm install --save-dev @types/node typescript ts-node nodemon
npm install --save-dev jest @types/jest
# Initialize TypeScript
npx tsc --init
Create your project structure:
linkedin-scraper-js/
├── src/
│ ├── scrapers/
│ │ ├── LinkedInScraper.ts
│ │ └── PlaywrightScraper.ts
│ ├── proxies/
│ │ ├── ProxyManager.ts
│ │ └── providers/
│ ├── database/
│ │ ├── models/
│ │ └── connection.ts
│ ├── utils/
│ │ ├── stealth.ts
│ │ └── logger.ts
│ └── index.ts
├── config/
│ └── default.json
├── tests/
└── logs/
Building the Core LinkedIn Scraper
Let's start with our main Puppeteer LinkedIn scraper class. This will handle browser management, navigation, and data extraction with advanced anti-detection measures.
Advanced Puppeteer LinkedIn Scraper
// src/scrapers/LinkedInScraper.ts
import puppeteer, { Browser, Page } from 'puppeteer';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';
import { ProxyManager } from '../proxies/ProxyManager';
import { logger } from '../utils/logger';
import { RateLimiterMemory } from 'rate-limiter-flexible';
interface LinkedInAd {
id: string;
company: string;
title: string;
description: string;
imageUrl?: string;
landingUrl?: string;
timestamp: Date;
location?: string;
demographics?: string[];
}
interface ScrapingOptions {
headless?: boolean;
proxy?: string;
maxResults?: number;
delay?: number;
retries?: number;
}
export class LinkedInScraper {
private browser: Browser | null = null;
private page: Page | null = null;
private proxyManager: ProxyManager;
private rateLimiter: RateLimiterMemory;
constructor() {
this.proxyManager = new ProxyManager();
// Rate limiting: 2 requests per minute per IP
this.rateLimiter = new RateLimiterMemory({
keyGenerator: (req) => req.ip || 'anonymous',
points: 2,
duration: 60,
});
}
async initialize(options: ScrapingOptions = {}) {
try {
const proxy = options.proxy || await this.proxyManager.getRotatingProxy();
// Advanced browser configuration with stealth
const launchOptions = {
headless: options.headless ?? true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--disable-gpu',
'--disable-features=TranslateUI',
'--disable-ipc-flooding-protection',
`--proxy-server=${proxy}`,
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
],
ignoreDefaultArgs: ['--enable-automation'],
defaultViewport: null,
};
this.browser = await puppeteer.launch(launchOptions);
this.page = await this.browser.newPage();
// Advanced stealth configurations
await this.configureStealth();
logger.info('LinkedIn scraper initialized successfully', { proxy });
} catch (error) {
logger.error('Failed to initialize scraper', error);
throw error;
}
}
private async configureStealth() {
if (!this.page) throw new Error('Page not initialized');
// Override webdriver detection
await this.page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
// Remove automation indicators
delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Array;
delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Promise;
delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Symbol;
});
// Set realistic headers
await this.page.setExtraHTTPHeaders({
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Cache-Control': 'max-age=0',
});
// Randomize viewport
const viewports = [
{ width: 1366, height: 768 },
{ width: 1920, height: 1080 },
{ width: 1440, height: 900 },
{ width: 1536, height: 864 },
];
const randomViewport = viewports[Math.floor(Math.random() * viewports.length)];
await this.page.setViewport(randomViewport);
}
async scrapeLinkedInAds(searchUrl: string, options: ScrapingOptions = {}): Promise<LinkedInAd[]> {
if (!this.page) {
await this.initialize(options);
}
const ads: LinkedInAd[] = [];
const maxResults = options.maxResults || 100;
const delay = options.delay || 2000;
let retries = options.retries || 3;
while (retries > 0) {
try {
await this.rateLimiter.consume('scrape');
logger.info('Navigating to LinkedIn Ad Library', { url: searchUrl });
// Navigate with timeout and wait for network idle
await this.page!.goto(searchUrl, {
waitUntil: 'networkidle2',
timeout: 30000
});
// Wait for ads to load
await this.page!.waitForSelector('.ad-item, [data-testid="ad-item"]', {
timeout: 15000
});
// Scroll to load more ads
await this.infiniteScroll(maxResults);
// Extract ad data
const extractedAds = await this.extractAdsData();
ads.push(...extractedAds);
logger.info(`Scraped ${extractedAds.length} ads successfully`);
break;
} catch (error) {
retries--;
logger.warn(`Scraping attempt failed, ${retries} retries left`, { error: error.message });
if (retries === 0) {
throw new Error(`Failed to scrape after multiple attempts: ${error.message}`);
}
await this.randomDelay(5000, 10000);
}
}
return ads.slice(0, maxResults);
}
private async infiniteScroll(targetCount: number) {
let previousCount = 0;
let stableCount = 0;
while (stableCount < 3) {
// Scroll to bottom
await this.page!.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
await this.randomDelay(2000, 4000);
// Check current ad count
const currentCount = await this.page!.$$eval(
'.ad-item, [data-testid="ad-item"]',
els => els.length
);
logger.info(`Current ad count: ${currentCount}`);
if (currentCount >= targetCount) {
break;
}
if (currentCount === previousCount) {
stableCount++;
} else {
stableCount = 0;
}
previousCount = currentCount;
}
}
private async extractAdsData(): Promise<LinkedInAd[]> {
return await this.page!.evaluate(() => {
const adElements = document.querySelectorAll('.ad-item, [data-testid="ad-item"]');
const ads: LinkedInAd[] = [];
adElements.forEach((element, index) => {
try {
const titleElement = element.querySelector('.ad-title, [data-testid="ad-title"]');
const companyElement = element.querySelector('.company-name, [data-testid="company-name"]');
const descriptionElement = element.querySelector('.ad-description, [data-testid="ad-description"]');
const imageElement = element.querySelector('img');
const linkElement = element.querySelector('a[href*="linkedin.com"]');
const ad: LinkedInAd = {
id: `ad-${Date.now()}-${index}`,
title: titleElement?.textContent?.trim() || '',
company: companyElement?.textContent?.trim() || '',
description: descriptionElement?.textContent?.trim() || '',
imageUrl: imageElement?.src || undefined,
landingUrl: linkElement?.href || undefined,
timestamp: new Date(),
};
if (ad.title && ad.company) {
ads.push(ad);
}
} catch (error) {
console.warn('Error extracting ad data:', error);
}
});
return ads;
});
}
private async randomDelay(min: number, max: number) {
const delay = Math.floor(Math.random() * (max - min + 1)) + min;
await new Promise(resolve => setTimeout(resolve, delay));
}
async close() {
if (this.browser) {
await this.browser.close();
this.browser = null;
this.page = null;
logger.info('LinkedIn scraper closed');
}
}
}
Proxy Integration for Production
Professional LinkedIn scraping requires reliable proxy rotation. Let's implement a robust proxy manager that works with major providers.
Advanced Proxy Manager
// src/proxies/ProxyManager.ts
import axios from 'axios';
import { logger } from '../utils/logger';
interface ProxyConfig {
host: string;
port: number;
username?: string;
password?: string;
country?: string;
sticky?: boolean;
}
interface ProxyProvider {
name: string;
getProxy(): Promise<ProxyConfig>;
healthCheck(proxy: ProxyConfig): Promise<boolean>;
}
class BrightDataProvider implements ProxyProvider {
name = 'BrightData';
private endpoint: string;
private username: string;
private password: string;
constructor() {
this.endpoint = process.env.BRIGHTDATA_ENDPOINT || '';
this.username = process.env.BRIGHTDATA_USERNAME || '';
this.password = process.env.BRIGHTDATA_PASSWORD || '';
}
async getProxy(): Promise<ProxyConfig> {
// BrightData sticky session format
const sessionId = Math.random().toString(36).substring(7);
return {
host: 'zproxy.lum-superproxy.io',
port: 22225,
username: `${this.username}-session-${sessionId}`,
password: this.password,
country: 'US',
sticky: true,
};
}
async healthCheck(proxy: ProxyConfig): Promise<boolean> {
try {
const proxyUrl = `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`;
const response = await axios.get('https://httpbin.org/ip', {
proxy: false,
httpsAgent: new (require('https-proxy-agent'))(proxyUrl),
timeout: 10000,
});
return response.status === 200;
} catch {
return false;
}
}
}
class OxylabsProvider implements ProxyProvider {
name = 'Oxylabs';
private username: string;
private password: string;
constructor() {
this.username = process.env.OXYLABS_USERNAME || '';
this.password = process.env.OXYLABS_PASSWORD || '';
}
async getProxy(): Promise<ProxyConfig> {
return {
host: 'pr.oxylabs.io',
port: 7777,
username: this.username,
password: this.password,
country: 'US',
sticky: false,
};
}
async healthCheck(proxy: ProxyConfig): Promise<boolean> {
try {
const proxyUrl = `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`;
const response = await axios.get('https://ipinfo.io/json', {
proxy: false,
httpsAgent: new (require('https-proxy-agent'))(proxyUrl),
timeout: 10000,
});
return response.status === 200 && response.data.ip;
} catch {
return false;
}
}
}
class SmartProxyProvider implements ProxyProvider {
name = 'SmartProxy';
private username: string;
private password: string;
constructor() {
this.username = process.env.SMARTPROXY_USERNAME || '';
this.password = process.env.SMARTPROXY_PASSWORD || '';
}
async getProxy(): Promise<ProxyConfig> {
const endpoints = [
'gate.smartproxy.com:7000',
'gate.smartproxy.com:7001',
'gate.smartproxy.com:7002',
];
const randomEndpoint = endpoints[Math.floor(Math.random() * endpoints.length)];
const [host, port] = randomEndpoint.split(':');
return {
host,
port: parseInt(port),
username: this.username,
password: this.password,
country: 'US',
sticky: false,
};
}
async healthCheck(proxy: ProxyConfig): Promise<boolean> {
try {
const proxyUrl = `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`;
const response = await axios.get('https://ipinfo.io/json', {
proxy: false,
httpsAgent: new (require('https-proxy-agent'))(proxyUrl),
timeout: 8000,
});
return response.status === 200;
} catch {
return false;
}
}
}
export class ProxyManager {
private providers: ProxyProvider[];
private currentProviderIndex: number = 0;
private proxyPool: ProxyConfig[] = [];
private failedProxies: Set<string> = new Set();
constructor() {
this.providers = [
new BrightDataProvider(),
new OxylabsProvider(),
new SmartProxyProvider(),
];
this.initializeProxyPool();
}
private async initializeProxyPool() {
logger.info('Initializing proxy pool...');
for (const provider of this.providers) {
try {
const proxy = await provider.getProxy();
const isHealthy = await provider.healthCheck(proxy);
if (isHealthy) {
this.proxyPool.push(proxy);
logger.info(`Added ${provider.name} proxy to pool`, {
host: proxy.host,
port: proxy.port
});
}
} catch (error) {
logger.warn(`Failed to initialize ${provider.name} proxy`, { error: error.message });
}
}
logger.info(`Proxy pool initialized with ${this.proxyPool.length} proxies`);
}
async getRotatingProxy(): Promise<string> {
if (this.proxyPool.length === 0) {
await this.initializeProxyPool();
}
// Find healthy proxy
for (let i = 0; i < this.proxyPool.length; i++) {
const proxy = this.proxyPool[this.currentProviderIndex % this.proxyPool.length];
this.currentProviderIndex++;
const proxyId = `${proxy.host}:${proxy.port}`;
if (!this.failedProxies.has(proxyId)) {
if (proxy.username && proxy.password) {
return `http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`;
} else {
return `http://${proxy.host}:${proxy.port}`;
}
}
}
// If all proxies failed, reset and try again
this.failedProxies.clear();
logger.warn('All proxies marked as failed, resetting failure list');
return await this.getRotatingProxy();
}
markProxyAsFailed(proxyUrl: string) {
const proxyId = proxyUrl.split('@')[1] || proxyUrl.replace('http://', '');
this.failedProxies.add(proxyId);
logger.warn('Marked proxy as failed', { proxyId });
}
async refreshProxyPool() {
this.proxyPool = [];
this.failedProxies.clear();
await this.initializeProxyPool();
}
}
Playwright Alternative Implementation
While Puppeteer is excellent, Playwright offers additional benefits for LinkedIn scraping, including better stealth capabilities and multi-browser support.
// src/scrapers/PlaywrightScraper.ts
import { chromium, Browser, Page, BrowserContext } from 'playwright';
import { ProxyManager } from '../proxies/ProxyManager';
import { logger } from '../utils/logger';
export class PlaywrightLinkedInScraper {
private browser: Browser | null = null;
private context: BrowserContext | null = null;
private page: Page | null = null;
private proxyManager: ProxyManager;
constructor() {
this.proxyManager = new ProxyManager();
}
async initialize(options: any = {}) {
const proxy = options.proxy || await this.proxyManager.getRotatingProxy();
const [, credentials, server] = proxy.match(/http:\/\/(.+)@(.+)/) || [];
const [username, password] = credentials?.split(':') || [];
const [host, port] = server?.split(':') || [];
this.browser = await chromium.launch({
headless: options.headless ?? true,
proxy: proxy ? {
server: `http://${host}:${port}`,
username,
password,
} : undefined,
});
this.context = await this.browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
viewport: { width: 1366, height: 768 },
locale: 'en-US',
geolocation: { longitude: -74.006, latitude: 40.7128 }, // New York
permissions: ['geolocation'],
extraHTTPHeaders: {
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
},
});
this.page = await this.context.newPage();
// Advanced stealth techniques
await this.applyStealth();
}
private async applyStealth() {
if (!this.page) return;
// Override webdriver detection
await this.page.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
// Remove automation indicators
delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Array;
delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Promise;
delete (window as any).cdc_adoQpoasnfa76pfcZLmcfl_Symbol;
// Mock plugins
Object.defineProperty(navigator, 'plugins', {
get: () => [
{
0: { type: 'application/x-google-chrome-pdf', suffixes: 'pdf', description: 'Portable Document Format' },
description: 'Portable Document Format',
filename: 'internal-pdf-viewer',
length: 1,
name: 'Chrome PDF Plugin'
}
],
});
});
// Simulate human-like mouse movements
await this.page.mouse.move(100, 100);
await this.page.mouse.move(200, 200);
}
async scrapeWithPlaywright(searchUrl: string): Promise<any[]> {
if (!this.page) await this.initialize();
try {
await this.page!.goto(searchUrl, {
waitUntil: 'networkidle',
timeout: 30000
});
// Wait for content with retry logic
await this.page!.waitForSelector('.ad-item, [data-testid="ad-item"]', {
timeout: 15000
});
// Scroll with human-like behavior
await this.humanScroll();
// Extract data
const ads = await this.page!.evaluate(() => {
const elements = document.querySelectorAll('.ad-item, [data-testid="ad-item"]');
return Array.from(elements).map((el, index) => ({
id: `playwright-ad-${Date.now()}-${index}`,
title: el.querySelector('.ad-title')?.textContent?.trim() || '',
company: el.querySelector('.company-name')?.textContent?.trim() || '',
description: el.querySelector('.ad-description')?.textContent?.trim() || '',
imageUrl: el.querySelector('img')?.src || '',
timestamp: new Date().toISOString(),
}));
});
logger.info(`Scraped ${ads.length} ads with Playwright`);
return ads;
} catch (error) {
logger.error('Playwright scraping failed', { error: error.message });
throw error;
}
}
private async humanScroll() {
const scrolls = Math.floor(Math.random() * 5) + 3; // 3-7 scrolls
for (let i = 0; i < scrolls; i++) {
const scrollDistance = Math.floor(Math.random() * 800) + 400;
await this.page!.evaluate((distance) => {
window.scrollBy(0, distance);
}, scrollDistance);
// Random delay between scrolls
await new Promise(resolve =>
setTimeout(resolve, Math.floor(Math.random() * 2000) + 1000)
);
}
}
async close() {
if (this.browser) {
await this.browser.close();
this.browser = null;
this.context = null;
this.page = null;
}
}
}
Database Integration and Data Persistence
A production JavaScript LinkedIn scraper needs robust data storage. Let's implement MongoDB integration with proper schemas and indexing.
// src/database/models/LinkedInAd.ts
import mongoose, { Schema, Document } from 'mongoose';
export interface ILinkedInAd extends Document {
id: string;
company: string;
title: string;
description: string;
imageUrl?: string;
landingUrl?: string;
timestamp: Date;
location?: string;
demographics?: string[];
scraperVersion: string;
source: 'puppeteer' | 'playwright';
metadata: {
scrapedAt: Date;
proxy?: string;
userAgent?: string;
viewport?: { width: number; height: number };
};
}
const LinkedInAdSchema: Schema = new Schema({
id: { type: String, required: true, unique: true },
company: { type: String, required: true, index: true },
title: { type: String, required: true },
description: { type: String, required: true },
imageUrl: { type: String },
landingUrl: { type: String },
timestamp: { type: Date, required: true },
location: { type: String },
demographics: [{ type: String }],
scraperVersion: { type: String, required: true },
source: { type: String, enum: ['puppeteer', 'playwright'], required: true },
metadata: {
scrapedAt: { type: Date, default: Date.now },
proxy: { type: String },
userAgent: { type: String },
viewport: {
width: { type: Number },
height: { type: Number },
},
},
}, {
timestamps: true,
collection: 'linkedin_ads'
});
// Indexes for performance
LinkedInAdSchema.index({ company: 1, timestamp: -1 });
LinkedInAdSchema.index({ 'metadata.scrapedAt': -1 });
LinkedInAdSchema.index({ title: 'text', description: 'text' });
export default mongoose.model<ILinkedInAd>('LinkedInAd', LinkedInAdSchema);
// src/database/connection.ts
import mongoose from 'mongoose';
import { logger } from '../utils/logger';
export class DatabaseManager {
private static instance: DatabaseManager;
private isConnected: boolean = false;
private constructor() {}
public static getInstance(): DatabaseManager {
if (!DatabaseManager.instance) {
DatabaseManager.instance = new DatabaseManager();
}
return DatabaseManager.instance;
}
async connect(): Promise<void> {
if (this.isConnected) return;
try {
const mongoUri = process.env.MONGODB_URI || 'mongodb://localhost:27017/linkedin-scraper';
await mongoose.connect(mongoUri, {
maxPoolSize: 10,
serverSelectionTimeoutMS: 5000,
socketTimeoutMS: 45000,
family: 4
});
this.isConnected = true;
logger.info('MongoDB connected successfully');
mongoose.connection.on('error', (error) => {
logger.error('MongoDB connection error:', error);
});
mongoose.connection.on('disconnected', () => {
logger.warn('MongoDB disconnected');
this.isConnected = false;
});
} catch (error) {
logger.error('Failed to connect to MongoDB:', error);
throw error;
}
}
async disconnect(): Promise<void> {
if (!this.isConnected) return;
await mongoose.disconnect();
this.isConnected = false;
logger.info('MongoDB disconnected');
}
isConnectionHealthy(): boolean {
return this.isConnected && mongoose.connection.readyState === 1;
}
}
Production Deployment and Monitoring
Deploy your Node.js LinkedIn scraper to production with proper monitoring, logging, and error handling.
Production-Ready Main Application
// src/index.ts
import express from 'express';
import { LinkedInScraper } from './scrapers/LinkedInScraper';
import { PlaywrightLinkedInScraper } from './scrapers/PlaywrightScraper';
import { DatabaseManager } from './database/connection';
import LinkedInAd from './database/models/LinkedInAd';
import { logger } from './utils/logger';
import { RateLimiterMemory } from 'rate-limiter-flexible';
const app = express();
app.use(express.json());
// Rate limiting
const rateLimiter = new RateLimiterMemory({
points: 10, // 10 requests
duration: 60, // per minute
});
// Health check endpoint
app.get('/health', (req, res) => {
const db = DatabaseManager.getInstance();
res.json({
status: 'healthy',
timestamp: new Date().toISOString(),
database: db.isConnectionHealthy() ? 'connected' : 'disconnected',
});
});
// Main scraping endpoint
app.post('/scrape', async (req, res) => {
try {
await rateLimiter.consume(req.ip);
const { url, engine = 'puppeteer', maxResults = 50 } = req.body;
if (!url || !url.includes('linkedin.com')) {
return res.status(400).json({
error: 'Valid LinkedIn URL required'
});
}
let scraper;
let ads = [];
if (engine === 'playwright') {
scraper = new PlaywrightLinkedInScraper();
ads = await scraper.scrapeWithPlaywright(url);
} else {
scraper = new LinkedInScraper();
ads = await scraper.scrapeLinkedInAds(url, { maxResults });
}
// Save to database
const savedAds = await Promise.all(
ads.map(async (ad) => {
const adDoc = new LinkedInAd({
...ad,
scraperVersion: '2.0.0',
source: engine,
metadata: {
scrapedAt: new Date(),
userAgent: req.headers['user-agent'],
},
});
try {
return await adDoc.save();
} catch (error) {
if (error.code === 11000) {
// Duplicate key, update existing
return await LinkedInAd.findOneAndUpdate(
{ id: ad.id },
adDoc.toObject(),
{ new: true, upsert: true }
);
}
throw error;
}
})
);
await scraper.close();
res.json({
success: true,
count: savedAds.length,
ads: savedAds,
engine,
timestamp: new Date().toISOString(),
});
} catch (error) {
logger.error('Scraping request failed', {
error: error.message,
url: req.body.url,
ip: req.ip
});
if (error.name === 'RateLimiterError') {
return res.status(429).json({
error: 'Rate limit exceeded. Please try again later.',
retryAfter: Math.round(error.msBeforeNext / 1000),
});
}
res.status(500).json({
error: 'Scraping failed',
message: error.message,
});
}
});
// Analytics endpoint
app.get('/stats', async (req, res) => {
try {
const totalAds = await LinkedInAd.countDocuments();
const todayAds = await LinkedInAd.countDocuments({
'metadata.scrapedAt': {
$gte: new Date(new Date().setHours(0, 0, 0, 0))
}
});
const topCompanies = await LinkedInAd.aggregate([
{ $group: { _id: '$company', count: { $sum: 1 } } },
{ $sort: { count: -1 } },
{ $limit: 10 }
]);
res.json({
totalAds,
todayAds,
topCompanies,
timestamp: new Date().toISOString(),
});
} catch (error) {
res.status(500).json({ error: error.message });
}
});
async function startServer() {
try {
// Connect to database
const db = DatabaseManager.getInstance();
await db.connect();
const port = process.env.PORT || 3000;
app.listen(port, () => {
logger.info(`LinkedIn scraper server running on port ${port}`);
});
} catch (error) {
logger.error('Failed to start server:', error);
process.exit(1);
}
}
// Graceful shutdown
process.on('SIGTERM', async () => {
logger.info('SIGTERM received, shutting down gracefully');
const db = DatabaseManager.getInstance();
await db.disconnect();
process.exit(0);
});
process.on('SIGINT', async () => {
logger.info('SIGINT received, shutting down gracefully');
const db = DatabaseManager.getInstance();
await db.disconnect();
process.exit(0);
});
startServer();
Advanced Anti-Detection Techniques
Professional LinkedIn scraping requires sophisticated anti-detection measures. Here are advanced techniques to avoid blocks.
// src/utils/stealth.ts
import { Page } from 'puppeteer';
export class StealthManager {
static async applyStealth(page: Page) {
// 1. Override webdriver property
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
});
// 2. Mock plugins
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'plugins', {
get: () => ({
length: 1,
0: {
name: 'Chrome PDF Plugin',
filename: 'internal-pdf-viewer',
description: 'Portable Document Format',
},
}),
});
});
// 3. Mock languages
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en'],
});
});
// 4. Override permissions
await page.evaluateOnNewDocument(() => {
const originalQuery = window.navigator.permissions.query;
return window.navigator.permissions.query = (parameters) => (
parameters.name === 'notifications' ?
Promise.resolve({ state: Notification.permission }) :
originalQuery(parameters)
);
});
// 5. Remove automation traces
await page.evaluateOnNewDocument(() => {
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Array;
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Promise;
delete window.cdc_adoQpoasnfa76pfcZLmcfl_Symbol;
});
// 6. Mock chrome runtime
await page.evaluateOnNewDocument(() => {
window.chrome = {
runtime: {},
};
});
// 7. Set realistic headers
await page.setExtraHTTPHeaders({
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
});
}
static async humanizeInteractions(page: Page) {
// Simulate human mouse movements
const viewport = await page.viewport();
if (viewport) {
const { width, height } = viewport;
// Random mouse movements
for (let i = 0; i < 3; i++) {
const x = Math.floor(Math.random() * width);
const y = Math.floor(Math.random() * height);
await page.mouse.move(x, y, {
steps: Math.floor(Math.random() * 10) + 10
});
await this.randomDelay(100, 500);
}
}
// Random scrolling patterns
const scrollPatterns = [
{ x: 0, y: 300 },
{ x: 0, y: -150 },
{ x: 0, y: 500 },
{ x: 0, y: -100 },
];
for (const scroll of scrollPatterns) {
await page.mouse.wheel(scroll);
await this.randomDelay(1000, 3000);
}
}
static async randomDelay(min: number, max: number): Promise<void> {
const delay = Math.floor(Math.random() * (max - min + 1)) + min;
return new Promise(resolve => setTimeout(resolve, delay));
}
static generateRandomUserAgent(): string {
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
];
return userAgents[Math.floor(Math.random() * userAgents.length)];
}
}
Error Handling and Retry Logic
Robust error handling is crucial for production LinkedIn scrapers. Implement comprehensive retry mechanisms and graceful degradation.
// src/utils/retryHandler.ts
import { logger } from './logger';
export interface RetryOptions {
maxRetries: number;
baseDelay: number;
maxDelay: number;
backoffFactor: number;
retryCondition?: (error: any) => boolean;
}
export class RetryHandler {
static async executeWithRetry<T>(
operation: () => Promise<T>,
options: RetryOptions
): Promise<T> {
const {
maxRetries,
baseDelay,
maxDelay,
backoffFactor,
retryCondition = () => true
} = options;
let lastError: any;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
lastError = error;
if (attempt === maxRetries || !retryCondition(error)) {
throw error;
}
const delay = Math.min(
baseDelay * Math.pow(backoffFactor, attempt),
maxDelay
);
logger.warn(`Operation failed, retrying in ${delay}ms`, {
attempt: attempt + 1,
maxRetries,
error: error.message
});
await new Promise(resolve => setTimeout(resolve, delay));
}
}
throw lastError;
}
static isRetryableError(error: any): boolean {
// Network errors
if (error.code === 'ENOTFOUND' || error.code === 'ECONNRESET') {
return true;
}
// Timeout errors
if (error.message?.includes('timeout')) {
return true;
}
// LinkedIn specific errors
if (error.message?.includes('blocked') || error.message?.includes('rate limit')) {
return true;
}
// HTTP status codes that should be retried
if (error.response?.status >= 500) {
return true;
}
if ([429, 503, 502, 504].includes(error.response?.status)) {
return true;
}
return false;
}
}
// Enhanced scraper with retry logic
export class ReliableLinkedInScraper extends LinkedInScraper {
async scrapeWithRetry(searchUrl: string, options: any = {}) {
return RetryHandler.executeWithRetry(
() => this.scrapeLinkedInAds(searchUrl, options),
{
maxRetries: 3,
baseDelay: 5000,
maxDelay: 30000,
backoffFactor: 2,
retryCondition: RetryHandler.isRetryableError
}
);
}
}
Testing and Quality Assurance
Test your JavaScript LinkedIn scraper thoroughly to ensure reliability and performance.
// tests/LinkedInScraper.test.ts
import { LinkedInScraper } from '../src/scrapers/LinkedInScraper';
import { DatabaseManager } from '../src/database/connection';
describe('LinkedInScraper', () => {
let scraper: LinkedInScraper;
beforeEach(() => {
scraper = new LinkedInScraper();
});
afterEach(async () => {
await scraper.close();
});
test('should initialize successfully', async () => {
await expect(scraper.initialize({ headless: true })).resolves.not.toThrow();
});
test('should scrape LinkedIn ads', async () => {
const testUrl = 'https://www.linkedin.com/ad-library/search?companyIds=1337';
const results = await scraper.scrapeLinkedInAds(testUrl, { maxResults: 5 });
expect(results).toBeInstanceOf(Array);
expect(results.length).toBeGreaterThan(0);
expect(results[0]).toHaveProperty('id');
expect(results[0]).toHaveProperty('company');
expect(results[0]).toHaveProperty('title');
});
test('should handle errors gracefully', async () => {
const invalidUrl = 'https://invalid-linkedin-url.com';
await expect(
scraper.scrapeLinkedInAds(invalidUrl)
).rejects.toThrow();
});
});
// Performance test
describe('Performance Tests', () => {
test('should scrape 50 ads within 60 seconds', async () => {
const scraper = new LinkedInScraper();
const startTime = Date.now();
const results = await scraper.scrapeLinkedInAds(
'https://www.linkedin.com/ad-library/search?companyIds=1337',
{ maxResults: 50 }
);
const duration = Date.now() - startTime;
expect(duration).toBeLessThan(60000);
expect(results.length).toBe(50);
await scraper.close();
}, 65000);
});
Deployment with Docker
# Dockerfile
FROM node:18-alpine
# Install dependencies for Puppeteer
RUN apk update && apk add --no-cache \
chromium \
nss \
freetype \
freetype-dev \
harfbuzz \
ca-certificates \
ttf-freefont
# Set Puppeteer to use installed Chromium
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \
PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser
# Create app directory
WORKDIR /usr/src/app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm ci --only=production
# Copy source code
COPY . .
# Build TypeScript
RUN npm run build
# Create non-root user
RUN addgroup -g 1001 -S nodejs
RUN adduser -S nodejs -u 1001
# Change ownership of app directory
RUN chown -R nodejs:nodejs /usr/src/app
USER nodejs
# Expose port
EXPOSE 3000
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
# Start application
CMD ["npm", "start"]
# docker-compose.yml
version: '3.8'
services:
linkedin-scraper:
build: .
ports:
- "3000:3000"
environment:
- NODE_ENV=production
- MONGODB_URI=mongodb://mongo:27017/linkedin-scraper
- BRIGHTDATA_USERNAME=${BRIGHTDATA_USERNAME}
- BRIGHTDATA_PASSWORD=${BRIGHTDATA_PASSWORD}
depends_on:
- mongo
restart: unless-stopped
deploy:
resources:
limits:
memory: 2G
cpus: '1.0'
mongo:
image: mongo:6
volumes:
- mongodb_data:/data/db
ports:
- "27017:27017"
restart: unless-stopped
redis:
image: redis:7-alpine
ports:
- "6379:6379"
restart: unless-stopped
volumes:
mongodb_data:
Monitoring and Analytics
Set up comprehensive monitoring for your production Node.js LinkedIn scraper.
// src/utils/monitoring.ts
import { EventEmitter } from 'events';
import { logger } from './logger';
interface ScrapingMetrics {
totalRequests: number;
successfulScrapes: number;
failedScrapes: number;
averageResponseTime: number;
proxyFailures: number;
rateLimitHits: number;
}
export class MetricsCollector extends EventEmitter {
private metrics: ScrapingMetrics = {
totalRequests: 0,
successfulScrapes: 0,
failedScrapes: 0,
averageResponseTime: 0,
proxyFailures: 0,
rateLimitHits: 0,
};
private responseTimes: number[] = [];
recordRequest() {
this.metrics.totalRequests++;
this.emit('request', this.metrics);
}
recordSuccess(responseTime: number) {
this.metrics.successfulScrapes++;
this.responseTimes.push(responseTime);
this.updateAverageResponseTime();
this.emit('success', { responseTime, metrics: this.metrics });
}
recordFailure(error: any) {
this.metrics.failedScrapes++;
if (error.message?.includes('proxy')) {
this.metrics.proxyFailures++;
}
if (error.status === 429) {
this.metrics.rateLimitHits++;
}
this.emit('failure', { error, metrics: this.metrics });
}
private updateAverageResponseTime() {
if (this.responseTimes.length > 0) {
const sum = this.responseTimes.reduce((a, b) => a + b, 0);
this.metrics.averageResponseTime = sum / this.responseTimes.length;
}
}
getMetrics(): ScrapingMetrics & { successRate: number } {
const successRate = this.metrics.totalRequests > 0
? (this.metrics.successfulScrapes / this.metrics.totalRequests) * 100
: 0;
return {
...this.metrics,
successRate,
};
}
reset() {
this.metrics = {
totalRequests: 0,
successfulScrapes: 0,
failedScrapes: 0,
averageResponseTime: 0,
proxyFailures: 0,
rateLimitHits: 0,
};
this.responseTimes = [];
}
}
// Usage in main scraper
const metricsCollector = new MetricsCollector();
// Log metrics every 5 minutes
setInterval(() => {
const metrics = metricsCollector.getMetrics();
logger.info('Scraping metrics', metrics);
}, 5 * 60 * 1000);
Best Practices and Ethical Considerations
Ethical Scraping Guidelines
• Rate Limiting: Implement delays between requests (2-5 seconds minimum)
• Respect robots.txt: Always check and follow robots.txt guidelines
• Use APIs First: Prefer official APIs when available
• User Agent: Use descriptive, honest user agents
• Data Usage: Only collect necessary data and respect privacy
• Terms of Service: Review and comply with platform terms
Performance Optimization
Optimize your JavaScript LinkedIn scraper for maximum performance and reliability.
// src/utils/performance.ts
export class PerformanceOptimizer {
static async optimizePage(page: any) {
// Disable images and CSS for faster loading
await page.setRequestInterception(true);
page.on('request', (req: any) => {
const resourceType = req.resourceType();
if (['image', 'stylesheet', 'font'].includes(resourceType)) {
req.abort();
} else {
req.continue();
}
});
// Set aggressive timeouts
page.setDefaultNavigationTimeout(30000);
page.setDefaultTimeout(15000);
}
static async enableCaching(browser: any) {
// Enable cache for faster subsequent loads
const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();
await page.setCacheEnabled(true);
return { context, page };
}
static async parallelScraping(urls: string[], maxConcurrency = 3) {
const results = [];
const executing = [];
for (const url of urls) {
const promise = this.scrapeUrl(url).then(result => {
executing.splice(executing.indexOf(promise), 1);
return result;
});
results.push(promise);
executing.push(promise);
if (executing.length >= maxConcurrency) {
await Promise.race(executing);
}
}
return Promise.all(results);
}
private static async scrapeUrl(url: string) {
const scraper = new LinkedInScraper();
try {
return await scraper.scrapeLinkedInAds(url);
} finally {
await scraper.close();
}
}
}
Troubleshooting Common Issues
🚫 Browser Detection Issues
Problem: LinkedIn detects automated browsers
Solution: Use puppeteer-extra-plugin-stealth and randomize browser fingerprints
⚠️ Rate Limiting
Problem: Getting 429 errors or blocks
Solution: Implement exponential backoff and rotate proxies more frequently
🔧 Memory Issues
Problem: High memory usage with long-running scrapers
Solution: Close browser instances regularly and implement memory monitoring
Conclusion
You've now built a comprehensive JavaScript LinkedIn scraper using Node.js and Puppeteer with enterprise-grade features. This tutorial covered everything from basic scraping to advanced anti-detection techniques, proxy integration, and production deployment.
Key Takeaways:
- ✅ Puppeteer provides powerful browser automation for LinkedIn scraping
- ✅ Proxy rotation is essential for avoiding IP blocks
- ✅ Anti-detection techniques help maintain stealth operations
- ✅ Error handling and retry logic ensure reliability
- ✅ Database integration enables data persistence and analytics
- ✅ Monitoring and metrics help optimize performance
Remember to always scrape responsibly, respect robots.txt files, implement appropriate delays, and consider the legal and ethical implications of your scraping activities. Happy scraping!
Ready to scrape LinkedIn ads?
Try our managed LinkedIn scraper API - no setup required, enterprise-grade infrastructure, and built-in proxy rotation.
Start scraping now