Research Data

How to preserve, cite, and design websites for the long-term future (part three)

Deborah Thorpe
June 8, 2023

Referencing, designing and building websites for the long term future

In addition to formal publications in books and journals, scholarly sources also include works written by scholars in the press or online, in blogs, and uploaded to institutional repositories… when these sources are delivered by the same access mechanism, the web, the boundary becomes blurred. (Helen Hockx-Yu, 2014)

REFERENCING WEBSITES IN YOUR RESEARCH

Thinking about ‘reference rot’ when you cite websites: Perma.cc

When a researcher bases an argument or a statement on a certain source, they are expected to cite this source – whether it is a primary text, a journal article, or a blog. However a web page poses a problem for citation. Unlike an academic article or an edition, a website is not stable. In fact, it is the opposite. As mentioned in part one, the web is expected to be dynamic – ‘perpetually a work in progress’.

In addition, there are unintentional changes – such as the ending of web hosting or ‘content drift’ where the content on a website changes (perhaps very gradually) ‘to such an extent as to cease to be representative of that originally referenced’. Your references may rot – not a good thing for the trustworthiness and citability of your research outputs.

When you link to a website in your publications, you have no way of knowing if and when it might disappear or change. So, how do you ensure that what the reader sees when they visit the web link in your article is what you saw when you cited it? UCC Library has a subscription to Perma.cc. You can provide Perma.cc with the URL that you want to cite, and it visits that URL and ‘captures’ what is there, depositing it into its own collection of preserved webpages. You then receive a new permanent link, which you can use in your citation.

Just as we use Persistent Identifiers (PIDs) such as DOIs to cite and avoid link rot for scholarly articles and datasets, services like Perma.cc allow us to do the same for information on the ‘open web’ where DOIs or other PIDs aren’t in common use. You can be confident that when your reader visits a Perma.cc link, it will be an accurate record of the source that your intended to reference, even if the original disappears or has been replaced with something else (e.g. if that domain is purchased by someone else).

As a UCC member of staff or student, you can become a user of Perma.cc for free, just contact aoife.coffey@ucc.ie or s.bowman@ucc.ie

PLANNING FOR THE FUTURE WHEN DESIGNING PROJECT WEBSITES

Link rot stinks, but Websites Don’t Have to Wither. (Pollak Library, California State University, Fullerton)

The ‘messy end’ of a project is not the best time to decide what should be done about your project website. If you have left it until the very end, it’s not the end of the world – it is better to take action now by exploring the options that have been presented in this blog post, than not at all. However, there are so many other tasks to be done at the end of the project, and web archiving is something that might be easily neglected. There is no ‘one size fits all’ solution to web archiving, and so considering the options available is something that should be given serious thought.

The ongoing life/afterlife of web content and its associated data is as important as its active life during the project – and is something that can be made easier with some forward planning. Some ideas to consider:

Consider planning to close your website, a certain time after the end of your project, with all of the content that you need or want to be preserved (could be WARC files, images, videos) ingested into a repository such as Zenodo and/or the Digital Repository of Ireland.

You could plan to replace the website with a ‘tombstone’, with an explanation that the project has ended and links out to the relevant repositories. This still depends on someone continuing to pay for hosting, but there will be no expectation from visitors that it is being actively maintained.

In the case of websites that serve primarily as an ‘advert’ for the project or a team workspace, think about separating ‘data’ from ‘display’, i.e. identify depositable outputs early (i.e. reports; presentations; datasets) and ideally link to their deposited versions via Persistent Identifiers from the website from day one. That way, you do not have to worry about the website going down or the content or contact details becoming out of date.

Where the website, interactive web resource or database IS the research output ‘preservation by design’ relates to building a resource with long-term preservation at its heart. The C21 Editions project, a partnership between UCC, the University of Sheffield and the University of Glasgow is considering this issue, working on the issue of preservable by design born digital scholarly editions, including what role digital cultural heritage repositories should play as custodians of these web-based resources.

When you link to your website in your publications during the project, link to a snapshot of relevant pages of that website generated by Perma.cc. This will ensure that whatever was on that website at that time is preserved for your reader, rather than what might be at that URL in the future.

Think about what web crawlers will miss, such as streaming media and search and filtering tools. Think about how you will preserve this data separately if it is your own material, for instance by creating a collection of videos in a data repository.

Poor website structure with broken links and any content that is not linked to makes web content difficult or impossible to crawl.

Think about what you are putting online. When content is put online, or third parties give you permission to put something on your website, ‘the fact that it may be preserved for future study and dissemination is often not taken into account’. Because of the current fragility of web content, people are likely not to see web content as something that might still be around in 20 or 30 years. This is another reason to plan for the preservation of your web content, so that you can build it with consideration for these ethical and/or legal concerns.

Be ‘mindful of the power of the archive and the ability of its content to re-traumatise those who feature in it’. Therefore, there is a need to manage the expectations of those who contribute to your research, make it clear that this web content is going to be preserved for the long-term future, and bear web archiving into consideration when designing your consent forms.

Make sure that you have included a license on your website, ideally an open license such as one of the more open Creative Commons licenses, that makes it clear how the content can be used and to which content that license applies.

Document your strategy. Write about the decisions that you have made, and include any descriptions and explanations in the metadata that you deposit in any repository. Your Data Management Plan is the first step, but the plan should be revisited frequently, and the process of documentation continues until the project end.

There is so much more that could be written about web preservation, but hopefully this will provide some ideas and practical information about how to ensure the long-term availability of your web content. In addition, this is a fast-developing area, so this should be considered a ‘snapshot’ of the situation at the time of writing. A future blog post will look more specifically at preservation of digital scholarly content such as research articles; e-books; and open educational resources (OERs). For more information and support relating to research data management and sharing, visit the UCC Research Data Service LibGuide.

A meme on internet archiving of websites. It reads: Link rot problems: the internet archive remembers — [Figure: ‘Link Rot problems’ meme, from the Internet Archive blog: https://perma.cc/D2QR-C6WA]

Useful resources

More detailed information about the steps to archiving web sites and making a site more suitable for archiving is outlined in Helen Cooper’s useful post, How to Archive your Website

Guidance on the importance of web archiving and tips on how to make websites archive-friendly from the Scottish government: Digital Web archiving – what you need to know

A PDF document with Basic Web Archiving Guidance from the UK National Archives: Basic Web Archiving Guidance (nationalarchives.gov.uk)

Creating Preservable Websites which provides a series of best practices to keep in mind when designing websites from the Library of Congress, USA

By Deborah Thorpe, Research Data Steward, University College Cork

Acknowledgements

I would like to thank Eoghan Ó Carragáin, Joanna Finegan, Maria Ryan (The National Library of Ireland), Paul Davidson, Aoife Coffey, Elaine Harrington, and James Smith (University College Cork) for their comments and suggestions on the first draft of this blog post series. I have been fortunate to learn from their expertise and insight, and they have each given me ideas for future blog posts on related topics – this is a huge and fast-developing area to continue to research and better understand. Any errors or omissions in this piece remain my own. This is a fast-moving and fascinating area, which I hope to continue to research and follow, in close collaboration with my collaborators and colleagues, as it develops.