I SIP, you ZIP, we DIP: Archivematica Camp Vancouver, 2019

Last week I had the pleasure of attending Artefactual’s Archivematica Camp held at the University of British Columbia. It was a three day intensive introduction to the Archivematica software and I was really impressed by the learning outcomes from the three days. First things first – for those of you who might not be familiar with Archivematica…here’s an overview. Essentially, Archivematica “provides an integrated suite of free and open-source tools that allows users to process digital objects from ingest to archival storage and access in compliance with the ISO-OAIS functional model and other digital preservation standards and best practices.

I learned A LOT over the three days so I wanted to share a taste of the experience. Many thanks to the organizers, ‘camp counselors,’ and the folks at UBC for hosting the event. The room was full of passionate people from at least three countries all trying to solve this ‘wicked problem’ we call digital preservation. The best thing about the camp was that we got to feel like we are all in this together, facing a challenge, and finding innovative solutions. A very close second for my favourite aspect of the camp was that there was an acknowledgement that we all come to this with different skills and we are all at different stages in our digital preservation journey. Everyone was made to feel included regardless of skill level. I want to commend Artefactual for this because I don’t think it is easy to make so many different levels feel welcome and at ease. The camp felt like a safe space to explore and ask questions.

Kelly Stewart provides an introduction to Archivematica to start us off on Day One of camp

DAY ONE

Day one was designed to get us all relatively on the same page so we could then work with the software over the next few days. To start the day, Kelly Stewart provided us with an overview of how Archivematica fits into the digital preservation landscape. It is important to understand that there is no ‘one-size-fits-all’ tool for digital preservation and that Archivematica is meant to be part of a toolbox that incorporates a number of open-source products.

We started with the fundamentals and talked about what digital preservation actually means. The Library of Congress defines it as “active management of digital content over time to ensure ongoing access.” It is important to recognize that digital preservation is not a ‘one and done’ kind of practice. It is ongoing maintenance of digital content so that we can ensure readability, discoverability, and use over time. None of this digital dark age stuff. We want to be able to access content long after we’re gone. And that takes ongoing, active, effort.

Archivematica was built on the framework of the OAIS model (“everybody drink”) and so we quickly went over the fundamentals of the OAIS modelbefore discussing Archivematica’s core functionality. Archivematica is designed to create OAIS-compliant, systems agnostic archival information packages (AIPs), save those AIPs to any storage system, and create Dissemination Information Packages (DIPs) that can populate access systems. The concept is fairly simple, the execution decidedly un-simple. 

The second half of the morning focused on how Archivematica works. The system is built as a microservice architecture, so there is a whole series of scripts running different microservices that tick along, moving the digital information from one part of the process to the next and providing the opportunity for the user to have input along the way. Microservices are strung together as workflows and there are decision points that can be automated if desired.

A screenshot of Archivematica’s microservices architecture (Source: https://wiki.archivematica.org/Micro-services)

In addition to being OAIS-compliant, Archivematica is standards-based. It would take WAY too much time to get into all of the tools and standards, but I thought it might be helpful to have them all listed with hyperlinks for anyone who wants to follow-up: BagIt, METS, PREMIS, Dublin Core, PRONOM.

Archivematica has been designed to perform the following:

  • fixity checks (computing checksums)
  • identification of file formats (this examines the file signature not the file extension, as there is often no file extension)
  • characterization (producing technical metadata for an object)
  • validation (determining the level of compliance of a digital object to the specification for its purported format, or in other words…is it what it is supposed to be?)
  • normalization (transferring digital materials into preservation-friendly formats)

The result of all of these microservices is the creation of an Archival Information Package (AIP), which contains information files, tool output logs, master and derivative digital objects, and metadata files that describe the AIP and its contents. All of this is to ensure that going forward we know what the objects are and what was done to them, with a view to ensuring authenticity, transferability, readability, and discoverability over time.

OK, so that was just the morning…wow. The afternoon continued the discussion on Archivematica’s core functionality and we looked at how to transfer content, how to create an AIP, and how to automate some of the transfer and ingest processes.

Archivematica can accept different transfer types including zipped or unzipped bags, DSpace transfers, Disk images, Standard transfers of items placed in a Directory, and most recently, Archivematica has been programmed to accept Dataverse transfers (Achivematica 1.8+).

What I found most useful at the camp were the hands-on exercises put together by Sara Allain that allowed us to get stuck in and try the software for ourselves. We got to start a standard transfer, review the microservices, create an AIP, start a disk image transfer, automate the workflow through configuration, examine microservices failures, add descriptive and rights metadata, normalize a Submission Information Package (SIP) for preservation and access, add item-level metadata, and upload a Dissemination Information Package (DIP) with item-level metadata to AtoM. We were encouraged to be creative with our transfer names so I decided that mine would all be Simpsons characters.

These exercises were invaluable for translating the knowledge we gained in the first half of the day into practice. I have attended several presentations about Archivematica but this was really the first time I have been able to try it out for myself and I cannot overstate how useful this was for my understanding. I am not a naturally technical person and so I find all of this quite challenging. However, breaking it down into smaller, achievable chunks made it make SO much more sense to me. This was the value of the camp experience. What a first day!

DAY TWO

Day two catered to two different groups attending the camp. There was a stream that continued to explore the functionality of Archivematica and there was a second stream that explored the systems side (installing, supporting, understanding technical logs, etc) for all the system administrators in the room. I think this was a great way of making the camp relevant for all levels. It was great to have everyone in the room – developers, administrators, and end users.

Not surprisingly, I was among the group that attended the first stream and we went further into understanding AIPs. We looked at the structure of the AIP, the structure of the accompanying METS files, and examined PREMIS objects. We learned about the BagIt specification, which defines both the structure and the contents of a package of material that is headed for storage. It literally puts everything into a virtual bag so that Archivematica understands what is being transferred.

An example of xml metadata. What was once gobbledygook is now (sort of) understandable for me

XML files have always looked like pure gobbledygook (wow, never had to write that word before) but during this camp, I learned how to read a METS xml file. I am not going to say I am now an expert, but it now seems like something I can decipher. The important thing for me to learn was that the information I need is contained within METS sections that are indicated by tags in the xml file (e.g. the descriptive metadata will be found in this section: <dmdSec>). METS (Metadata Encoding and Transmission Standard) is a key component of Archivematica. It provides the wrapper for all the other metadata that is used to describe the digital objects in a transfer (like PREMIS and Dublin Core). This was the first time I had ever had xml broken down for me and I found this extremely helpful! We downloaded and reviewed the METS file for our demo transfer and it was really useful to go through the different parts of the xml file to understand its structure.

If reading xml files isn’t your thing, I also learned that Tim Walsh (formerly of the Canadian Centre for Architecture, now Concordia University) created a web application for human-friendly exploration of Archivematica METS files – yay Tim! Check out METSFlask if reading xml files makes you sleepy/dopey/grumpy/sneezy…or any of the other seven Disney dudes.

This doesn’t feel like the place to go through all of the metadata elements in depth, but needless to say, this was a brain-melting but very useful morning.

The morning continued with a presentation from Richard Dancy from Simon Fraser University (SFU), an early adopter of Archivematica and a contributor to its continued development. Richard provided an overview of SFU’s technical architecture for digital preservation. Most of the AIPs at SFU come from digitized materials but they have worked with transfers of born digital materials as well. Richard explained that the SFU processes have developed over time based on the unique needs of the organization and he recommended thinking about all of the potential types of transfers an institution might encounter and create procedures for those different scenarios.

In the afternoon, the group split once again into two streams. I learned about some of Archivematica’s specialized workflows, while the technical group examined Archivematica’s logs and performance evaluation.

We learned how you can perform manual normalization if you are working with a digitized collection where TIFFs and JPEGs have already been created. You can tell Archivematica not to redo that normalization work. By structuring the transfer in a certain way, you can make Archivematica recognize the work that was already completed. We also learned about the Format Policy Registry that is accessed through the Preservation Planning tab in Archivematica. It contains a canonical list of formats that Archivematica can recognize and act on. The formats are largely pulled from PRONOM (an online registry of technical information about formats created by the UK National Archives) but users can add additional formats if they encounter something out of the ordinary.

We finished the day with a presentation from three representatives from the University of British Columbia who are using Archivematica in a number of contexts. It was helpful to have these community presentations to hear about how Archivematica is being implemented. The universities seem to have quite complicated instances so I would have found it helpful to have a presentation from a smaller institution that is implementing Archivematica – something for next time…maybe it will be me! 🙂

DAY THREE

Now – full disclosure…I was in the middle of buying a house on the morning of Day Three so I was more than a little bit distracted and may not have fully absorbed all the content. But – it was really fun to share my stress and excitement about the house with my tablemates – thanks to them for being supportive and sharing in my joy once the deal went through.

The morning began with an excellent tour of the UBC Digitization Centre (so much tech jealousy!!) and the Rare Books and Special Collections reading room and storage area. The walls of the Digitization Centre are plastered with items from the collection, which creates a lively work space despite being nearly underground.

I can’t believe this was my first experience at RBSC. I loved touring the Chung Collection exhibit space and I fully intend to spend more time in there when I am next at UBC.  And what a beautiful reading room. Tech jealousy and reading room jealousy! The highlight of the RBSC tour was getting to see the automated retrieval system. Yes, it breaks down a lot, and yes, you can only store items with low retrieval rates, but HOW COOL is this?!

 

After the tours, we split again into the two streams. The stream I attended looked at Archivematica’s extended functionality, while the technical group learned how to access Archivematica’s data and other GitHub highlights.

We explored the functionality of the Backlog Tab, which was a sponsored addition to Archivematica that enables an institution to gain bit-level control over digital material when you don’t necessarily have time to do the appraisal or full transfer immediately. We got to explore this use-case with some more hands-on exercises. We sent a transfer to backlog and arranged it into multiple SIPs and we examined the contents using Bulk Extractor.

We then looked at AIP re-ingest. Say you want to add some descriptive metadata or rights metadata to an existing AIP, or you didn’t originally create a DIP when you created the AIP but now you want to send the objects to your access system. This is when you might look to re-ingest the AIP. We got to try that out in one of the exercises.

In the afternoon of Day Three, we learned about the larger Archivematica and open source communities and how we might contribute to open source projects. It is important to realize that you do not have to be a developer to help. You can submit bugs or issues, suggest new features, prepare documentation, or perform translation. You can continue to learn about open source software and advocate for its use in your institution.

All of the Archivematica documentation is hosted on GitHubArtefactual also hosts an active discussion board in the form of a Google Group where users can discuss issues, features, and events relating to Archivematica. There are also a number of regional groups that are community-driven forums for discussion and networking. This session really hit home for me that we are all in this together. 

We finished the day with short presentations from a number of attendees who presented ideas or overviews of their implementation of Archivematica, or simply shared what they had learned over the three days. It was a nice way to wrap up a great three days.

Again, big thanks to the Artefactual folks (Justin Simpson, Kelly Stewart, Sara Allain) for such a great camp. Now I just have to get the digi pres ball rolling at work…