Alfresco Data Migration: All you need to know – Part 2

Alfresco Data Migration: All you need to know – Part 2

We once had a project that required migrating 2TB+ data from a legacy document storage to Alfresco document management system. The project was straight forward, we had a predefined folder structure, a simple metadata fileset, and good hardware to do the migration. However there was one tiny little problems, that the client had 15 different servers that were mutually exclusive to each other and each server had multiple instances of Alfresco installed. So to successfully migrate data from legacy to Alfresco, the guy handling migration needed to use the bulk transfer user interface tool multiple times – 76 times to be exact – wasting valuable time in repeating the same configurations and keeping track of every transfer.

And this was the case when the file structure was very simple, imagine the same use case with a complex file structure, or with an even larger amount of data spread over even a larger server cluster.

Wouldn’t have the process become much simple if we had a program to automatically transfer data?

Fortunately we have that option in the Alfresco Bulk Import tool itself.

In our previous post we talked about the different ways you can migrate data to Alfresco system. We talked about third party tools and talked in detail about how we can migrate data to Alfresco using Alfresco’s Bulk Import Tool. In that post we discussed how to leverage the user interface of the tool for migration.

User Interface option for the migration tool is obviously and truly user friendly. But as mentioned above, we frequently get cases where the tool itself becomes a difficult option. For those cases its best to use the Bulk Import tool programmatically.

Bulk Import Tool – A Code Based Approach

Alfresco bulk import tool comes with in-built classes that can help you with migration. They are JAVA based and thus if you are using any other language for your framework, you may have to adjust accordingly. So let’s check out the simple code to initiate data transfer programmatically. This code is for Streaming Bulk Import.

UserTransaction txn = transactionService.getUserTransaction();
txn.begin();

AuthenticationUtil.setRunAsUser("USER_NAME");

StreamingNodeImporterFactory streamingNodeImporterFactory = (StreamingNodeImporterFactory)ctx.getBean("streamingNodeImporterFactory");
NodeImporter nodeImporter = streamingNodeImporterFactory.getNodeImporter(new File("importdirectory"));
BulkImportParameters bulkImportParameters = new BulkImportParameters();
bulkImportParameters.setTarget(folderNode);
bulkImportParameters.setReplaceExisting(true);
bulkImportParameters.setBatchSize(40);
bulkImportParameters.setNumThreads(4);
bulkImporter.bulkImport(bulkImportParameters, nodeImporter);
txn.commit();

The important thing to note in this code is the bulkImportParameters values. Using this class you can set the target folder, set the number of files to include per batch for import, and can set the number of simultaneous batch threads to process. If a file of same name is already present, you can set the condition to replace it using the setReplaceExisting value of bulkImportParameters.
For In-Place Bulk importing use the following code:

txn = transactionService.getUserTransaction();
txn.begin();

AuthenticationUtil.setRunAsUser("USER_NAME");

InPlaceNodeImporterFactory inPlaceNodeImporterFactory = (InPlaceNodeImporterFactory)ctx.getBean("inPlaceNodeImporterFactory");
NodeImporter nodeImporter = inPlaceNodeImporterFactory.getNodeImporter("default", "2015");
BulkImportParameters bulkImportParameters = new BulkImportParameters();
bulkImportParameters.setTarget(folderNode);
bulkImportParameters.setReplaceExisting(true);
bulkImportParameters.setBatchSize(150);
bulkImportParameters.setNumThreads(4);
bulkImporter.bulkImport(bulkImportParameters, nodeImporter);
txn.commit();

To know more about values and fields that bulk import tool uses, checkout the link below
http://docs.alfresco.com/4.1/references/bulk-import-table.html

Setting Up FileSystem
It was not mentioned in the previous post, but the Bulk Import Tool only works when the to-be-imported file system follows UTF-8 encoding. So if your filesystem is in any other format, you may have to convert it into UTF-8 encoding filesystem.

Metadata
Bulk import tool can be used to transfer metadata files as well. You just have to take care of the filename syntax. The standard for a metadata filename is FILENAME.metadata.properties.xml. This syntax is the same for files and directory metadata.

Version History
Just like metadata, version history can also be automatically transferred, and like metadata you just have to keep the syntax of the version files correct. The syntax is FILENAME.v#.
For example the version of example.pdf would be like

example.pdf.v1
example.pdf.v2
example.pdf.v3

Alfresco automatically assumes that example.pdf is the latest version. The same goes for different versions of metadata. For the similar example.pdf, the metadata would be

example.pdf.metadata.properties.xml.v1
example.pdf.metadata.properties.xml.v2
example.pdf.metadata.properties.xml.v3

A point to note here is that in Alfresco, unless you do some pretty complicated customization, the version is always in full numbers, i.e. there is no 1.1, 1.2, 2.3 as versions.
Those who have different kind of version than above, do share in comments, I ‘ll be curious!

Custom Programs To Import Complex Data
Importing large repetitive complex data is always easy via programmatically defined approach. However the program should also be expertly coded. Errors in migration through a custom code can have time consuming consequences as you would only get to know about the error after the batch processes have completely run their course.

So we suggest taking expert guidance. We here at Algoworks have completed a lot of migration projects for both large scale and small scale clients. We are the foremost experts in this field. So if you are going to migrate your content, feel free to contact us for expert advice.

Reference
http://docs.alfresco.com/4.0/concepts/bulk-import-diagnostics.html
http://docs.alfresco.com/4.0/concepts/bulk-import-programmatically.html
https://wiki.alfresco.com/wiki/Bulk_Importer#In-Place_Bulk_Import_.28Enterprise_Only.29

The following two tabs change content below.
Pratyush Kumar

Pratyush Kumar

Co-Founder & Director at Algoworks, Open-Source | Salesforce | ECM
Pratyush is Co-Founder and Director at Algoworks. He is responsible for managing, growing open source technologies team and has spearheaded more than 200 projects in Salesforce CRM alone. He provides consulting and advisory to clients looking for services relating to CRM(Customer Relationship Management) and ECM(Enterprise Content Management). In the past, Pratyush has held consulting roles with various global technology leaders, such as Globallogic & HCL in India. He holds an Engineering graduate degree from Indian Institute of Technology, Roorkee.
Pratyush Kumar

Latest posts by Pratyush Kumar (see all)

Pratyush KumarAlfresco Data Migration: All you need to know – Part 2
  • Gian Domenico Bonazzoli

    Well done. Among the tools you have talked about, is there any one that can set a correct CreationDate of the content you are uploading ?

  • http://www.algoworks.com/ Algoworks

    Hi Gian

    There is a easy work around to the problem.
    Before migrating the data, disable the auditable aspects of the transactions by tweaking the policy behaviors filters.

    Contact us if you are looking for a code to do that for you.