Backup considerations
Managed by | Updated .
A Funnelback installation contains both software files and data files:
- Software files are created and unpacked during the installation, and are not dependent on the collections and indexes configured on the system (e.g. Java JAR & WAR files, Perl scripts, patches, etc.)
- Data files encompass:
- Global configuration
- User accounts
- Collection-specific data: configuration, indexes, WARC data files, query logs, etc.
Different backup strategies apply to these file types and are detailed thereafter.
Backup priorities
Backup strategies will usually be chosen depending on the priority of restoring services provided by Funnelback:
- Public UI: Query processing functionality
- Admin UI: Access to Analytics Reports
- Admin UI: Ability to run updates & crawls
- Admin UI / File system: Access to query and click logs
The query processing functionality is usually the one with the highest priority - restoring the search service of the organisation website or intranet is more important than providing access to Analytics Reports. Similarly, crawls can often be postponed without major impacts.
Backup strategies
Backup the whole Funnelback folder
This is the simplest strategy, it consists a taking a complete copy of the Funnelback installation folder, which will contain both the software files and the data files.
Restoring such a backup is simple as the archive just needs to be unpacked in place. If the archive is unpacked on a different server than the original one, the Funnelback services will need to be re-created using $SEARCH_HOME/bin/setup/start_funnelback_on_boot.pl.
Additionally, individual files can be restored selectively from the archive.
The main drawback of this approach is that it may waste disk space:
- The software files never change after installation (except after patching) and don't need to be repeatedly backed up
- Different backup intervals cannot be implemented depending on the files to backup, resulting in files that may not change between 2 backups to be archived. For instance, it may be desirable to backup small configuration files every hour even if the actual collection data and indexes (large files) change only once a week.
Backup files individually
This strategy applies different backup intervals depending on the types of the files being backed up.
Software files
Software files need to be backed up only once after the installation, and every time a patch is applied. The list of software files is every file that's not listed as a data file thereafter (Generally speaking every folder except conf/ and data/)
Data files
Data files are located at different places in the Funnelback installation folder. The table below lists the different data files location, their purpose, and suggested backup intervals.
Directory | Purpose | Suggested backup Interval |
---|---|---|
conf/ | Server-wide configuration | Weekly - Global configuration usually doesn't change often |
conf/<collection>/ | Collection-specific configuration | Hourly during implementation / development Daily afterwards - To account for daily changes in best bets, synonyms, curator, etc. |
admin/users/ | User accounts | Weekly |
data/<collection>/ | Collection-specific data | Consider backing-up only specific folders (below) rather than the whole data folder |
data/<collection>/archive/ | Collection query & click logs | Daily - Meaning at worst 1 day of logs will be lost Note that the archiving of logs is tied to the collection update schedule. |
data/<collection>/live/ | Collection live index and data | Tied to the crawl / update interval (usually daily) - This is needed to restore query processing functionality |
data/<collection>/live/logs/ | Latest query & click logs since last update | Hourly - Only if restoring query & logs data from the last hour is required. Otherwise the backup of the archive folder is enough |
data/<collection>/offline/ | Staging area for the current update | Backing up this folder is not necessary as it's not used for query processing, and only contains data from the currently running update, or the previous one. |
admin/reports/<collection>/reports.sqlitedb | Analytics database | Tied to the Analytics update interval (usually daily) - The database can be reconstructed from the query & click logs present in the collection data archive folder. |
Collection update schedules
Funnelback's collection update schedule uses the operating system scheduler (Task Scheduler under Windows, crontab under Linux). These OS-level configuration files should also be backed up.
Data files restoration
Data files can be restored by simply copying them in place in a Funnelback installation.
In addition, the live/ and offline/ folders inside a collection data folder ($SEARCH_HOME/data/<collection>/) are symbolic links and may need to be re-created if they were not part of the backup. These links point to the one/ and two/ sibling folders under the collection data folder. To identify which folder should be linked to live/, inspect the one/log/ and two/log/ folders to investigate which holds the most recent update. Create the live/ symbolic link pointing to this folder, and the offline/ link to the other.
ln -s $SEARCH_HOME/data/<collection>/one $SEARCH_HOME/data/<collection>/live
ln -s $SEARCH_HOME/data/<collection>/two $SEARCH_HOME/data/<collection>/offline
mklink /d %SEARCH_HOME%\data\<collection>\live %SEARCH_HOME%\data\<collection>\one
mklink /d %SEARCH_HOME%\data\<collection>\offline %SEARCH_HOME%\data\<collection>\two
Backing up Push collections
Backing up Push collections should be done through the Snapshot API endpoint. This API will create a snapshot of the Push collections on the Funnelback server, the snapshot can be then copied to a separate location for backup.
Please see the corresponding documentation for more information about Push snapshots, backups and restoration procedures.