2 minute read

The Export Pipeline System

Exporting data usually involves a number of tasks: Data needs to be collected, converted and organised into a target structure such as XML files or folders containing images and text data.

Epigraf allows you to configure the output process in pipelines. A pipeline is a collection of tasks. In a pipeline, the type of tasks and their sequence are defined with the necessary parameters. Common tasks include:

  • Collecting data from the database in XML, JSON or CSV format
  • Perform search and replace operations using regular expressions
  • Transforming data with XSLT
  • Copying files from the file system
  • Bundling multiple output files into a single file
  • Creating a ZIP archive

Running a pipeline generates a job. In the job, the defined tasks are processed one after the other. Each task may contribute content or transformations to an output file. Finally, the resulting file is stored in the file system or made available for download.

Batch Manipulation

Epigraf’s job system allows multiple records to be imported, transferred, exported or modified in a batch process. Import, export and mutate operations are implemented in job classes to be found in the src/Model/Entity/Jobs directory. Transfer operations are an import into a target database from a source database.

Each job contains tasks. The available task classes can be found in the src/Model/Entity/Tasks directory.

The possible mutate operations, i.e. batch operations for modifying records, are listed in the table classes. For example, the fulltext index can be regenerated, articles can be moved to another project or be deleted in batches (see ArticlesTable.php).

Background Jobs

Background jobs can be used for long-running tasks that should not block the user interface. To run jobs as background tasks:

  • In the jobs configuration in config/app.php, configure the connection to a Redis server. Redis is used to manage queues and job statuses
  • In the jobs configuration set the delayed flag to true. With this flag, jobs are not executed immediately but are put into a queue, see JobsController::execute() for details.
  • Run the worker which will process jobs by starting bin/cake jobs process. See the JobsCommand class in the src/command folder for implementation details.

API Packages

The experimental Epigraf package (R) and the Epygraf package (Python) are currently developed to facilitate data work with Epigraf. They provide functions for data transfer using the Epigraf APIs: Preparing data imports, e.g. from social media datasets, and preparing data analyses.