Backup considerations

Managed by | Updated .

A Funnelback installation contains both software files and data files:

  • Software files are created and unpacked during the installation, and are not dependent on the collections and indexes configured on the system (e.g. Java JAR & WAR files, Perl scripts, patches, etc.)
  • Data files encompass:
    • Global configuration
    • User accounts
    • Collection-specific data: configuration, indexes, WARC data files, query logs, etc.

Different backup strategies apply to these file types and are detailed thereafter.

Backup priorities

Backup strategies will usually be chosen depending on the priority of restoring services provided by Funnelback:

  • Public UI: Query processing functionality
  • Admin UI: Access to Analytics Reports
  • Admin UI: Ability to run updates & crawls
  • Admin UI / File system: Access to query and click logs

The query processing functionality is usually the one with the highest priority - restoring the search service of the organisation website or intranet is more important than providing access to Analytics Reports. Similarly, crawls can often be postponed without major impacts.

Backup strategies

Backup the whole Funnelback folder

This is the simplest strategy, it consists a taking a complete copy of the Funnelback installation folder, which will contain both the software files and the data files.

Restoring such a backup is simple as the archive just needs to be unpacked in place. If the archive is unpacked on a different server than the original one, the Funnelback services will need to be re-created using $SEARCH_HOME/bin/setup/start_funnelback_on_boot.pl.

Additionally, individual files can be restored selectively from the archive.

The main drawback of this approach is that it may waste disk space:

  • The software files never change after installation (except after patching) and don't need to be repeatedly backed up
  • Different backup intervals cannot be implemented depending on the files to backup, resulting in files that may not change between 2 backups to be archived. For instance, it may be desirable to backup small configuration files every hour even if the actual collection data and indexes (large files) change only once a week.

Backup files individually

This strategy applies different backup intervals depending on the types of the files being backed up.

Software files

Software files need to be backed up only once after the installation, and every time a patch is applied. The list of software files is every file that's not listed as a data file thereafter (Generally speaking every folder except conf/ and data/)

Data files

Data files are located at different places in the Funnelback installation folder. The table below lists the different data files location, their purpose, and suggested backup intervals.

DirectoryPurposeSuggested backup Interval
conf/Server-wide configurationWeekly - Global configuration usually doesn't change often
conf/<collection>/Collection-specific configurationHourly during implementation / development
Daily afterwards - To account for daily changes in best bets, synonyms, curator, etc.
admin/users/User accountsWeekly
data/<collection>/Collection-specific dataConsider backing-up only specific folders (below) rather than the whole data folder
data/<collection>/archive/Collection query & click logsDaily - Meaning at worst 1 day of logs will be lost
Note that the archiving of logs is tied to the collection update schedule. 
data/<collection>/live/Collection live index and dataTied to the crawl / update interval (usually daily) - This is needed to restore query processing functionality
data/<collection>/live/logs/Latest query & click logs since last updateHourly - Only if restoring query & logs data from the last hour is required. Otherwise the backup of the archive folder is enough
data/<collection>/offline/Staging area for the current updateBacking up this folder is not necessary as it's not used for query processing, and only contains data from the currently running update, or the previous one.
admin/reports/<collection>/reports.sqlitedbAnalytics databaseTied to the Analytics update interval (usually daily) - The database can be reconstructed from the query & click logs present in the collection data archive folder.

Collection update schedules

Funnelback's collection update schedule uses the operating system scheduler (Task Scheduler under Windows, crontab under Linux). These OS-level configuration files should also be backed up.

Data files restoration

Data files can be restored by simply copying them in place in a Funnelback installation.

In addition, the live/ and offline/ folders inside a collection data folder ($SEARCH_HOME/data/<collection>/) are symbolic links and may need to be re-created if they were not part of the backup. These links point to the one/ and two/ sibling folders under the collection data folder. To identify which folder should be linked to live/, inspect the one/log/ and two/log/ folders to investigate which holds the most recent update. Create the live/ symbolic link pointing to this folder, and the offline/ link to the other.

Symlink creation on Linux
ln -s $SEARCH_HOME/data/<collection>/one $SEARCH_HOME/data/<collection>/live
ln -s $SEARCH_HOME/data/<collection>/two $SEARCH_HOME/data/<collection>/offline
Symlink creation on Windows
mklink /d %SEARCH_HOME%\data\<collection>\live %SEARCH_HOME%\data\<collection>\one
mklink /d %SEARCH_HOME%\data\<collection>\offline %SEARCH_HOME%\data\<collection>\two

Backing up Push collections

Push collections need to be backed up differently as they maintain internal state: Simply copying the files at a given time will not result in a consistent backup.

Backing up Push collections should be done through the Snapshot API endpoint. This API will create a snapshot of the Push collections on the Funnelback server, the snapshot can be then copied to a separate location for backup.

Please see the corresponding documentation for more information about Push snapshots, backups and restoration procedures.

Was this artcle helpful?

Tags
Type:
Features:

Comments