DLEModModules for DLE • DLE Parser PRO

DLE Parser PRO

DLE Parser PRO — a professional module for automating the parsing and publishing of content from external sources in DataLife Engine. Supports three modes: HTML parsing (CSS selectors/XPath), import from RSS/Atom, and hybrid mode. Automatically detects CMSs (WordPress, Joomla, Drupal, etc.), downloads and converts images to WebP, and performs AI rewriting via DeepSeek. The built-in Round-Robin scheduler evenly distributes materials among sources.

Buy now
Module version3.0.0
PHP version7.4 - 8.4
DLE version13.x - 19․1

DLE Parser PRO is a comprehensive enterprise-level solution for website owners on DataLife Engine who need full automation of the process of filling a website with high-quality content. The module is a powerful system for extracting, processing, and publishing materials from external sources using advanced artificial intelligence technologies.

Module architecture: three parsing modes

HTML Parser — classic web scraping

  • Extracting content directly from the HTML structure of web pages
  • Support for complex pagination with customizable navigation patterns
  • Automatic detection of the site structure and CMS
  • Precise extraction via CSS selectors and XPath expressions
  • Processing dynamic content and AJAX loads
  • Support for bidirectional parsing (from newest to oldest / from oldest to newest)
  • Configuring page ranges with automatic progress tracking
  • Automatic downloading of files, images, videos, and galleries into DLE additional fields — via CSS selectors directly from the article HTML page
  • Support for all extraction types: href, src, data-src, data-href, content, text, html
  • Saving full HTML blocks (specification tables, formatted descriptions) into additional fields through the content cleaning filter

RSS/Atom Parser — working with news feeds

  • Native support for RSS 2.0, RSS 1.0 (RDF), and Atom 1.0 formats
  • Intelligent image extraction from multiple sources (enclosure, media:content, media:thumbnail, media:group)
  • Automatic processing of namespaces (media, content, dc, atom)
  • Extracting metadаta: author, publication date, categories
  • Support for full and short content (content:encoded, description)
  • Filtering and cleaning RSS content from advertising blocks
  • Priority retrieval of the main image via meta[property="og:image"] and meta[property="twitter:image"] directly from the article page; the RSS image is used as a fallback source

Hybrid Parser — the optimal combination of RSS and HTML

  • Using RSS to obtain a list of current materials
  • Parsing full content from the HTML version of the page
  • Priority data selection system (HTML takes precedence over RSS)
  • Merging metadata from both sources
  • Optimal processing speed with maximum extraction quality
  • Automatic detection of the most complete image source

Technological foundation and infrastructure

Intelligent CMS detection system

  • Automatic recognition of 18+ popular CMSs and frameworks
  • Supported platforms: WordPress, Joomla, Drupal, 1C-Bitrix, DLE, MODX, OpenCart
  • Blogging platforms: Ghost, Medium, Blogger, Tilda, Webflow
  • jаvascript frameworks: Next.js, Gatsby, Hugo, Jekyll
  • E-commerce: Shopify, WooCommerce, Magento
  • Analysis of HTTP headers and meta tags for accurate detection
  • Automatic suggestion of optimal CSS selectors for each CMS

AI rewriting via DeepSeek API

  • Integration with DeepSeek-V3 — an advanced language model with 671B parameters
  • Chunk-based processing: splitting long articles into optimal fragments
  • Preserving the HTML structure during rewriting (tags, formatting, lists)
  • Three-level processing: headings, short description, full text
  • Customizable prompts for each type of content
  • Automatic removal of AI artifacts (code blocks, explanations)
  • Rate limiting and API error handling with automatic retries
  • Cost efficiency: processing cost 20 times lower than GPT-4

Two-level protection bypass system:

  • Level 1: Enhanced cURL
    • HTTP/2 support with full Chrome 131 emulation
    • Sec-Fetch-* headers for bypassing basic filtering
    • Cookie persistence between requests
    • Automatic detection of Cloudflare challenges
  • Level 2: FlareSolverr Integration (optional)
    • Full-fledged headless Chrome for bypassing jаvascript challenges
    • Automatic Cloudflare captcha solving
    • Support for Turnstile and other protection mechanisms
    • Transparent switching when a block is detected
  • Intelligent detection of bypass necessity:
    • Checking for \"Just a moment\", \"Checking your browser\"
    • Detecting cf-browser-verification
    • Automatic fallback to standard cURL when available
  • System requirements for Cloudflare bypass:
    • Docker (for FlareSolverr)
    • Minimum 1GB RAM
    • VPS with container support

Professional image processing

  • Automatic image downloading with HTTPS and redirect support
  • Conversion to WebP to save 30-50% of disk space
  • Intelligent resize while preserving proportions (GD/Imagick)
  • Support for multiple formats: JPEG, PNG, GIF, WebP
  • Saving the main image in xfield with metadata
  • Replacing all images in content with local copies
  • Automatic generation of unique file names
  • File structure organization by date (YYYY-MM)

Round-Robin task scheduler

  • Even load distribution across all active sources
  • Automatic source rotation for balanced import
  • Progress tracking for each source individually
  • Configuring the number of posts per CRON run
  • Protecting the CRON endpoint with a Secret Key (32-character token)
  • Detailed logging of all parsing operations
  • Support for both old (engine/ajax/controller.php) and new (index.php?controller=ajax) DLE versions

Category management system

Intelligent category mapping

  • Automatic collection of categories from RSS feeds and HTML structure
  • Batch processing of articles to extract all unique categories
  • Visual interface for mapping source categories to DLE categories
  • Support for hierarchical DLE categories
  • Default category for unmapped materials
  • Multiple categories for a single item

Protection and reliability

Duplicate prevention system

  • Checking for the existence of a material by the source URL in xfields
  • Tracking the last processed position (page/URL)
  • Automatic skipping of already imported materials
  • Saving progress to the database for each source

Operational stability

  • Automatic reconnection to the database on timeouts
  • cURL error handling with detailed logging
  • Support for SSL certificates and bypassing blocks
  • User-Agent rotation to simulate browser requests
  • Timeout control for long operations

Cloudflare Bypass via FlareSolverr

  • Integration with FlareSolverr to bypass Cloudflare Bot Management
  • Automatic switching to a headless browser when protection is detected
  • Optional activation via settings (not required for all sources)
  • Graceful degradation: works with regular sites when FlareSolverr is disabled
  • Docker-based solution with automatic session management
  • Support for jаvascript challenges and cookie-based checks
  • Detailed logging of protection bypass attempts

Advanced features

Additional fields: downloading files, media, and galleries

  • Configure any number of additional fields for each source прямо from the add/edit form
  • For each field, set: the element CSS selector, the extraction attribute (href, src, data-src, data-href, content, text, html), and the action type
  • Supported action types: saving URL/text, file download, image download with metadata, video download, external video link (YouTube/Vimeo), gallery with bulk image download, gallery from a URL list
  • Gallery mode: automatic traversal of all found elements by selector, downloading each one and saving in DLE gallery format into a single field
  • Video files and downloadable files are saved in uploads/public_files/ with date-based organization (YYYY-MM)
  • Images from additional fields are saved in uploads/posts/ with automatic size detection and metadata generation in DLE format (width×height, file size)
  • Video fields are formatted in native DLE format: type 3 (local video) or type 1 (external link)
  • The extractExtraFieldsFromDom() method has been moved to the base BaseParser class (protected) — available for both HTML and Hybrid parsers without code duplication

Pagination and navigation setup

  • Support for standard patterns: /page/{page}/, ?page={page}, /p/{page}, /offset/{page}
  • Custom patterns for non-standard sites
  • Query parameters and complex URL schemes
  • Automatic construction of the next page URL
  • Configurable page range (start_page, end_page)
  • Specify the number of posts per page for accurate tracking

Flexible selector configuration

  • Support for CSS selectors of any complexity (classes, IDs, attributes, pseudo-classes)
  • XPath compatibility for complex structures
  • Exclusion selectors to remove ads and noise
  • Built-in tester with result preview
  • Selector validation before saving

Administrative panel

  • Intuitive interface for managing sources
  • Detailed statistics for each source (processed materials, progress, last run)
  • Quick enable/disable of sources
  • Reset progress for reprocessing
  • Editing sources while preserving progress
  • Built-in module update checking system
  • Logging all actions in admin_logs

Intelligent image preservation system during AI processing:

    • Media element extraction before rewriting:
      • Automatic detection of <img>, <figure>, <picture>, <iframe>, <video>
      • Replacement with HTML comment placeholders
      • Preserving positions in the document structure
    • Three-level restoration system:
      • Level 1: Direct matching by markers
      • Level 2: Intelligent insertion between paragraphs
      • Level 3: Appending to the end of the document in case of complete loss
    • Final cleanup:
      • Removing accidentally saved markers from title/description
      • Normalizing HTML structure
      • Validating media elements

Multiple sources for extracting the main image:

    • Open Graph and Twitter meta tags:
      • meta[property=\"og:image\"]
      • meta[name=\"twitter:image\"]
      • meta[name=\"twitter:image:src\"]
    • Responsive images:
      • Support for the srcset attribute
      • Automatic selection of the highest resolution
      • Fallback to data-src and data-lazy-src
    • Nested structures:
      • Extraction from <figure>, <picture> containers
      • Searching for img inside wrapper elements
      • Support for CSS background-image

Advantages of use

  • Time savings: full automation of the site content filling process — from parsing to publishing
  • Content uniqueness: AI rewriting ensures the originality of texts that pass plagiarism checks
  • SEO optimization: automatic generation of SEF URLs (alt_name), structured data
  • Low cost: using DeepSeek reduces AI costs by 20 times compared to GPT-4
  • Scalability: unlimited number of sources with Round-Robin balancing
  • Reliability: protection against duplicates, automatic connection recovery
  • Ease of setup: CMS auto-detection, built-in selector tester
  • Versatility: support for any sites with HTML structure, RSS feeds and hybrid schemes
  • Modularity: flexible architecture with the ability to disable unnecessary components
  • Performance: chunk-based processing, optimized SQL queries
  • Bypassing site protection: automatic bypass of Cloudflare and other anti-bot systems without proxy services
  • Configuration flexibility: ability to work with both protected and regular sources
  • Proxy savings: FlareSolverr — a free alternative to paid proxy services

Use Cases

  • News aggregators: automatic collection of news from several regional sources
  • Topical blogs: translation and adaptation of foreign content for a Russian-speaking audience
  • Review portals: import of reviews of technologies, gadgets, and software
  • Regional media: aggregation of local news followed by rewriting
  • Entertainment resources: automatic filling of sections with articles, guides, and top lists
  • Educational platforms: import of educational materials, articles, and manuals
  • Business portals: collection of industry news and analytics


Technical requirements and compatibility

  • DLE versions: 13.x, 14.x, 15.x, 16.x, 17.x, 18.x, 19.x, 19.1 (full compatibility)
  • PHP: 7.4+ (recommended 8.0+)
  • PHP extensions: CURL, DOM, XPath, libxml, GD or Imagick, JSON, mbstring
  • MySQL: 5.7+ or MariaDB 10.2+
  • Access rights: write access to /uploads/posts/, /engine/data/, /engine/cache/
  • External APIs: DeepSeek API (optional, for AI rewriting)
  • CRON: access to crontab task configuration

Screenshots

Choose a suitable plan

We offer flexible licensing options depending on your needs.

Standard

5000 ₽
  • Unlimited number of sites
  • Open source code
  • Basic
  • No further updates

Extended

6000 ₽
  • Unlimited number of sites
  • Open source code
  • priority
  • Free updates — (12 months)

Premium

11000 ₽
  • Unlimited number of sites
  • Open source code
  • Priority support + consultation
  • Free updates — forever
  • Module installation and setup
  • Adapted to your website (including reasonable code refinement for individual requirements)

История изменений

Все версии (9)
Все версии (9)
Версия 3.0.0
Версия 2.1.4
Версия 2.1.3
Версия 2.1.2
Версия 2.1.1
Версия 2.1.0
Версия 2.0.0
Версия 1.0.1
Версия 1.0.0
Релизов: 9
Функций: 21
Исправлений: 14
Улучшений: 13
Версия 3.0.0 27.04.2026
Новое
Добавлена полноценная интеграция с DLE Multi-Language: автоматическое сохранение переводов в title_{iso}, short_story_{iso}, full_story_{iso} и tags_{iso}.
Новое
Добавлен новый режим парсинга Sitemap с поддержкой больших sitemap-файлов, вложенных sitemap index и кеширования списка URL.
Новое
Добавлен реальный dry-run режим тестирования: проверка теперь выполняет симуляцию полного парсинга без записи в базу данных и показывает итоговый publish payload.
Новое
Добавлены структурированные логи парсинга со стадиями обработки, статусами, временем выполнения, source_id, item_url и информацией об ошибках.
Новое
Добавлен мониторинг состояния источников: health status, fail streak, duplicate rate, average fetch/run time и время последнего успешного запуска.
Улучшение
Полностью переработана логика HTML-парсинга списка материалов: теперь обрабатываются все найденные контейнеры, а не только первый matched node.
Улучшение
HTML progress переведен на URL/cursor модель вместо count-based прогресса, что снижает риск пропуска новых материалов.
Улучшение
Исправлена стратегия cursor для RSS, Hybrid и Sitemap в режиме new_to_old, чтобы новые материалы в верхней части источника не пропускались.
Улучшение
Улучшен Hybrid режим: добавлена обработка ошибок по материалам, advancement cursor при сбоях и защита от бесконечного застревания на одном item.
Улучшение
Добавлена поддержка HTML category selector в Hybrid режиме и политика объединения категорий RSS/HTML.
Улучшение
Усилен механизм поиска дублей: добавлена нормализация URL, GUID/external id, fingerprint заголовка и hash контента.
Улучшение
Улучшена нормализация URL перед проверкой дублей: учитываются trailing slash, fragment, tracking-параметры и различия в формате ссылок.
Улучшение
Усилен CSS selector engine: добавлена поддержка групп, комбинаторов, атрибутных селекторов и ряда pseudo-селекторов.
Улучшение
Добавлены предупреждения о поддерживаемом subset CSS-селекторов в help-разделе и test result.
Улучшение
Улучшена AI-обработка HTML: сохранение структуры тегов, защита media/code/pre блоков, повторная проверка неполных переводов и более стабильная работа с длинным контентом.
Улучшение
Улучшена генерация и перевод тегов, включая fallback-механизм, если AI не вернул корректный результат.
Исправление
Исправлено сохранение изображений при отключенном reformat: теперь сохраняется реальный исходный формат файла.
Исправление
Исправлены случаи, когда AI мог вернуть ссылки или HTML, не соответствующие настройкам очистки контента.
Исправление
Исправлена обработка figure/img блоков: изображения корректно извлекаются, очищаются и могут быть загружены на сервер.
Исправление
Исправлены случаи, когда code/pre блоки могли быть пропущены или удалены во время AI-обработки.
Исправление
Исправлены проблемы с незакрытыми ul/ol/li тегами в AI-переводах.
Исправление
Исправлена совместимость DB reconnect check с PHP 8 и mysqli.
Исправление
Найдены и исправлены другие мелкие ошибки.
Версия 2.1.4 11.03.2026
Исправление
Обнаружены и исправлены некоторые баги.
Comments 4
  1. Прошу уточнить приблизительную стоимость рерайта ДипСиком за 1000 знаков. Возможна ли настройка на сокращение текста, например рерайт статьи 5000 знаков в статью 1000 знаков. Если уже есть рерайт модули, работающие на других ИИ, можно ли их включить в данный модуль и выбирать в настройках оптимальную нейронку под определенную тему. Например, по теме медицины пишет Джемини, по теме ИТ - Клод, по искусству - ДипСик и тд. Или же исходя из стоимости токенов - где выгоднее, на тот ИИ и переключиться.
  2. Здравствуйте. Я какая будет цена AI-рерайта через DeepSeek к примеру за одну новость. И как пополнят DeepSeek
  3. +3
    Здравствуйте! Планирую купить DLE Parser PRO. Подскажите, смогу ли я использовать одну лицензию на 2-3 сайтах одновременно? Или она строго привязывается к одному домену?
    1. +1
      Здравствуйте. Да, можете. Модуль с открытым исходным кодом и не привязан к домену. Купив один раз, вы можете использовать его на неограниченном количестве сайтов.