DLE Parser PRO
DLE Parser PRO — a professional module for automating the parsing and publishing of content from external sources in DataLife Engine. Supports three modes: HTML parsing (CSS selectors/XPath), import from RSS/Atom, and hybrid mode. Automatically detects CMSs (WordPress, Joomla, Drupal, etc.), downloads and converts images to WebP, and performs AI rewriting via DeepSeek. The built-in Round-Robin scheduler evenly distributes materials among sources.
Buy nowDLE Parser PRO is a comprehensive enterprise-level solution for website owners on DataLife Engine who need full automation of the process of filling a website with high-quality content. The module is a powerful system for extracting, processing, and publishing materials from external sources using advanced artificial intelligence technologies.
Module architecture: three parsing modes
HTML Parser — classic web scraping
- Extracting content directly from the HTML structure of web pages
- Support for complex pagination with customizable navigation patterns
- Automatic detection of the site structure and CMS
- Precise extraction via CSS selectors and XPath expressions
- Processing dynamic content and AJAX loads
- Support for bidirectional parsing (from newest to oldest / from oldest to newest)
- Configuring page ranges with automatic progress tracking
- Automatic downloading of files, images, videos, and galleries into DLE additional fields — via CSS selectors directly from the article HTML page
- Support for all extraction types: href, src, data-src, data-href, content, text, html
- Saving full HTML blocks (specification tables, formatted descriptions) into additional fields through the content cleaning filter
RSS/Atom Parser — working with news feeds
- Native support for RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 formats
- Intelligent image extraction from multiple sources (enclosure, media:content, media:thumbnail, media:group)
- Automatic processing of namespaces (media, content, dc, atom)
- Extracting metadаta: author, publication date, categories
- Support for full and short content (content:encoded, description)
- Filtering and cleaning RSS content from advertising blocks
- Priority retrieval of the main image via meta[property="og:image"] and meta[property="twitter:image"] directly from the article page; the RSS image is used as a fallback source
Hybrid Parser — the optimal combination of RSS and HTML
- Using RSS to obtain a list of current materials
- Parsing full content from the HTML version of the page
- Priority data selection system (HTML takes precedence over RSS)
- Merging metadata from both sources
- Optimal processing speed with maximum extraction quality
- Automatic detection of the most complete image source
Technological foundation and infrastructure
Intelligent CMS detection system
- Automatic recognition of 18+ popular CMSs and frameworks
- Supported platforms: WordPress, Joomla, Drupal, 1C-Bitrix, DLE, MODX, OpenCart
- Blogging platforms: Ghost, Medium, Blogger, Tilda, Webflow
- jаvascript frameworks: Next.js, Gatsby, Hugo, Jekyll
- E-commerce: Shopify, WooCommerce, Magento
- Analysis of HTTP headers and meta tags for accurate detection
- Automatic suggestion of optimal CSS selectors for each CMS
AI rewriting via DeepSeek API
- Integration with DeepSeek-V3 — an advanced language model with 671B parameters
- Chunk-based processing: splitting long articles into optimal fragments
- Preserving the HTML structure during rewriting (tags, formatting, lists)
- Three-level processing: headings, short description, full text
- Customizable prompts for each type of content
- Automatic removal of AI artifacts (code blocks, explanations)
- Rate limiting and API error handling with automatic retries
- Cost efficiency: processing cost 20 times lower than GPT-4
Two-level protection bypass system:
- Level 1: Enhanced cURL
- HTTP/2 support with full Chrome 131 emulation
- Sec-Fetch-* headers for bypassing basic filtering
- Cookie persistence between requests
- Automatic detection of Cloudflare challenges
- Level 2: FlareSolverr Integration (optional)
- Full-fledged headless Chrome for bypassing jаvascript challenges
- Automatic Cloudflare captcha solving
- Support for Turnstile and other protection mechanisms
- Transparent switching when a block is detected
- Intelligent detection of bypass necessity:
- Checking for \"Just a moment\", \"Checking your browser\"
- Detecting cf-browser-verification
- Automatic fallback to standard cURL when available
- System requirements for Cloudflare bypass:
- Docker (for FlareSolverr)
- Minimum 1GB RAM
- VPS with container support
Professional image processing
- Automatic image downloading with HTTPS and redirect support
- Conversion to WebP to save 30-50% of disk space
- Intelligent resize while preserving proportions (GD/Imagick)
- Support for multiple formats: JPEG, PNG, GIF, WebP
- Saving the main image in xfield with metadata
- Replacing all images in content with local copies
- Automatic generation of unique file names
- File structure organization by date (YYYY-MM)
Round-Robin task scheduler
- Even load distribution across all active sources
- Automatic source rotation for balanced import
- Progress tracking for each source individually
- Configuring the number of posts per CRON run
- Protecting the CRON endpoint with a Secret Key (32-character token)
- Detailed logging of all parsing operations
- Support for both old (engine/ajax/controller.php) and new (index.php?controller=ajax) DLE versions
Category management system
Intelligent category mapping
- Automatic collection of categories from RSS feeds and HTML structure
- Batch processing of articles to extract all unique categories
- Visual interface for mapping source categories to DLE categories
- Support for hierarchical DLE categories
- Default category for unmapped materials
- Multiple categories for a single item
Protection and reliability
Duplicate prevention system
- Checking for the existence of a material by the source URL in xfields
- Tracking the last processed position (page/URL)
- Automatic skipping of already imported materials
- Saving progress to the database for each source
Operational stability
- Automatic reconnection to the database on timeouts
- cURL error handling with detailed logging
- Support for SSL certificates and bypassing blocks
- User-Agent rotation to simulate browser requests
- Timeout control for long operations
Cloudflare Bypass via FlareSolverr
- Integration with FlareSolverr to bypass Cloudflare Bot Management
- Automatic switching to a headless browser when protection is detected
- Optional activation via settings (not required for all sources)
- Graceful degradation: works with regular sites when FlareSolverr is disabled
- Docker-based solution with automatic session management
- Support for jаvascript challenges and cookie-based checks
- Detailed logging of protection bypass attempts
Advanced features
Additional fields: downloading files, media, and galleries
- Configure any number of additional fields for each source прямо from the add/edit form
- For each field, set: the element CSS selector, the extraction attribute (href, src, data-src, data-href, content, text, html), and the action type
- Supported action types: saving URL/text, file download, image download with metadata, video download, external video link (YouTube/Vimeo), gallery with bulk image download, gallery from a URL list
- Gallery mode: automatic traversal of all found elements by selector, downloading each one and saving in DLE gallery format into a single field
- Video files and downloadable files are saved in uploads/public_files/ with date-based organization (YYYY-MM)
- Images from additional fields are saved in uploads/posts/ with automatic size detection and metadata generation in DLE format (width×height, file size)
- Video fields are formatted in native DLE format: type 3 (local video) or type 1 (external link)
- The extractExtraFieldsFromDom() method has been moved to the base BaseParser class (protected) — available for both HTML and Hybrid parsers without code duplication
Pagination and navigation setup
- Support for standard patterns: /page/{page}/, ?page={page}, /p/{page}, /offset/{page}
- Custom patterns for non-standard sites
- Query parameters and complex URL schemes
- Automatic construction of the next page URL
- Configurable page range (start_page, end_page)
- Specify the number of posts per page for accurate tracking
Flexible selector configuration
- Support for CSS selectors of any complexity (classes, IDs, attributes, pseudo-classes)
- XPath compatibility for complex structures
- Exclusion selectors to remove ads and noise
- Built-in tester with result preview
- Selector validation before saving
Administrative panel
- Intuitive interface for managing sources
- Detailed statistics for each source (processed materials, progress, last run)
- Quick enable/disable of sources
- Reset progress for reprocessing
- Editing sources while preserving progress
- Built-in module update checking system
- Logging all actions in admin_logs
Intelligent image preservation system during AI processing:
-
- Media element extraction before rewriting:
- Automatic detection of <img>, <figure>, <picture>, <iframe>, <video>
- Replacement with HTML comment placeholders
- Preserving positions in the document structure
- Three-level restoration system:
- Level 1: Direct matching by markers
- Level 2: Intelligent insertion between paragraphs
- Level 3: Appending to the end of the document in case of complete loss
- Final cleanup:
- Removing accidentally saved markers from title/description
- Normalizing HTML structure
- Validating media elements
- Media element extraction before rewriting:
Multiple sources for extracting the main image:
-
- Open Graph and Twitter meta tags:
- meta[property=\"og:image\"]
- meta[name=\"twitter:image\"]
- meta[name=\"twitter:image:src\"]
- Responsive images:
- Support for the srcset attribute
- Automatic selection of the highest resolution
- Fallback to data-src and data-lazy-src
- Nested structures:
- Extraction from <figure>, <picture> containers
- Searching for img inside wrapper elements
- Support for CSS background-image
- Open Graph and Twitter meta tags:
Advantages of use
- Time savings: full automation of the site content filling process — from parsing to publishing
- Content uniqueness: AI rewriting ensures the originality of texts that pass plagiarism checks
- SEO optimization: automatic generation of SEF URLs (alt_name), structured data
- Low cost: using DeepSeek reduces AI costs by 20 times compared to GPT-4
- Scalability: unlimited number of sources with Round-Robin balancing
- Reliability: protection against duplicates, automatic connection recovery
- Ease of setup: CMS auto-detection, built-in selector tester
- Versatility: support for any sites with HTML structure, RSS feeds and hybrid schemes
- Modularity: flexible architecture with the ability to disable unnecessary components
- Performance: chunk-based processing, optimized SQL queries
- Bypassing site protection: automatic bypass of Cloudflare and other anti-bot systems without proxy services
- Configuration flexibility: ability to work with both protected and regular sources
- Proxy savings: FlareSolverr — a free alternative to paid proxy services
Use Cases
- News aggregators: automatic collection of news from several regional sources
- Topical blogs: translation and adaptation of foreign content for a Russian-speaking audience
- Review portals: import of reviews of technologies, gadgets, and software
- Regional media: aggregation of local news followed by rewriting
- Entertainment resources: automatic filling of sections with articles, guides, and top lists
- Educational platforms: import of educational materials, articles, and manuals
- Business portals: collection of industry news and analytics
Technical requirements and compatibility
- DLE versions: 13.x, 14.x, 15.x, 16.x, 17.x, 18.x, 19.x, 19.1 (full compatibility)
- PHP: 7.4+ (recommended 8.0+)
- PHP extensions: CURL, DOM, XPath, libxml, GD or Imagick, JSON, mbstring
- MySQL: 5.7+ or MariaDB 10.2+
- Access rights: write access to /uploads/posts/, /engine/data/, /engine/cache/
- External APIs: DeepSeek API (optional, for AI rewriting)
- CRON: access to crontab task configuration
Screenshots
Choose a suitable plan
We offer flexible licensing options depending on your needs.













