Alfresco Data Migration: All you need to know - Part 1

Alfresco is one of the most used document and content management system of the world. It is being picked up by startups and it is extensively used by enterprises. If hardware permits and database is solid, it can be used to store infinite amount of data. However before we can even begin using Alfresco, the first challenge to face is the migration of the data to Alfresco. A content management system first needs content to manage.

Most Alfresco use cases can be divided into two parts:

1. New content is being generated actively and stored directly into Alfresco
2. The old content first needs to be migrated to Alfresco from legacy systems and then all the new content is added to Alfresco directly.

For first use cases, all you have to do is install Alfresco. However for second cases, installing Alfresco is just the first step for a much larger problem of data migration. It’s even harder for those companies having data >100+GB.

The best way to migrate content to Alfresco

Alfresco, being a very flexible tool, has given us multiple ways by which we can interact with the software. And all of them can be used to inject data into the system. Some of them include:

Alfresco Bulk Import Tool – http://docs.alfresco.com/4.2/concepts/bulk-import-importing.html
Alfresco JLan Server – http://sourceforge.net/projects/alfresco/files/JLAN/
Alfresco APIs (CMIS, RESTful, SOAP) – https://hub.alfresco.com/t5/alfresco-content-services-hub/cmis/ba-p/289965,
https://hub.alfresco.com/t5/alfresco-content-services-hub/restful-api/ba-p/290318#Alfresco_RESTful_API_Reference,
Open Migrate Tool – http://www.tsgrp.com/Open_Source/OpenMigrate/open-migrate.jsp
Bulk Filesystem Import – https://code.google.com/p/alfresco-bulk-filesystem-import/

UPDATE: The project at code.Google.com was moved to GitHub and was then updated to a new version. The new GitHub URL is https://github.com/pmonks/alfresco-bulk-import.

At one point there were some popular ways to migrate data, however each one of them have some inherent problems:

Alfresco JLan Server and ACP transfer are quite difficult to configure. Their performance is also not as good, and they are a pain to use if you have to migrate 100GB+ data.
Alfresco API means coding in extreme. It gives great performance but has security risks. The development cost is more compared to other solutions as well. They are best in those use cases where you may have to migrate data in real time from one system to another, or in other words integrate another system with Alfresco.
Open Migrate and Bulk File System are third party tools.

So in the end we are left with one solution, and it is the most used tool for large scale data migration- Alfresco’s Bulk Import Tool. Since it is built by the Alfresco team itself, it is secure and tested.

Bulk Import Tool

The bulk import tool is a great way to import existing content into repository of Alfresco Server file system. It can copy new content as well as replace existing content but it is not designed to fully synchronize the repository with local file systems and therefore does not perform deletion tasks. Not only content, it can also migrate metadata and version history of files.

This bulk import tool has two versions:

Streaming Bulk Import (available to all editions)
In-Place Bulk Import (available to Enterprise and above edition only)

Since in-place is for enterprise edition only, let’s skip this for now and focus on Streaming Bulk Import. You can perform Streaming Bulk Import or even In-Place Bulk Import using web user interface or using a coded program. Both these ways are helpful and as the names suggest, user web interface way is easier, whereas coded program way is faster (once you have coded the program that is!).

Streaming Bulk import using User Interface
The bulk import tool can be triggered by exposing two web scripts:

An HTTP GET web script to start the tool manually: http://localhost:8080/alfresco/service/bulkfsimport
To start the tool via a program, you can use an initiating web script that launches the tool and performs import based on parameters passed to it. The parameters can include source directory, target space, and so on. It is an HTTP POST web script and its path is: http://localhost:8080/alfresco/service/bulkfsimport/initiate

Exposing the web script will open a simple data migration form. Something like this

The form has simple fields that include import directory, target path, batch size, NodeRef path, number of threads, etc. A point to note here is the NodeRef path. For those who are not familiar, it is an alternative path to Target Path and indicates the target NodeRef to load the content into.

And that’s it. You have migrated your files successfully.

You can check the status of the import process by another simple web script:
http://localhost:8080/alfresco/service/bulkfsimport/status.

The script will show the status of any in-progress import process and if there is no in-progress process, it will return the status of the last import process.

Some last words for now

Though the process above looks simple, it also has its drawbacks. For example the file structure and hierarchy remains the same as the imported data. This is good for many use cases but if you want to re-structure the file hierarchy to best optimize your new Alfresco based business process, it is not a very good choice. That’s where expert Alfresco developers come in. For more custom data migration a custom program is the best way to migrate.

If you follow the link I had shared above on bulk import tool, it will be of some help. Watch out for this space around the same time next month as I will be writing on things to ignore and things to be cautious of besides other details in my next blog on Alfresco.

Bio
Latest Posts

Pratyush Kumar

Co-Founder & President at Algoworks, Open-Source | Salesforce | ECM

Pratyush is Co-Founder and President at Algoworks. He is responsible for managing, growing open source technologies team and has spearheaded more than 200 projects in Salesforce CRM alone. He provides consulting and advisory to clients looking for services relating to CRM(Customer Relationship Management) and ECM(Enterprise Content Management). In the past, Pratyush has held consulting roles with various global technology leaders, such as Globallogic & HCL in India. He holds an Engineering graduate degree from Indian Institute of Technology, Roorkee.