A Python tool to automatically archive web content (videos, images, social media) from Google Sheets and other sources in a secure, verifiable way.
Auto Archiver is a Python tool that automatically archives web content—such as videos, images, social media posts, and webpages—from sources like Google Sheets or CSV files. It solves the problem of preserving online content securely and verifiably, ensuring data integrity and accessibility for future reference.
Journalists, researchers, archivists, and organizations needing to systematically preserve online content from social media and other web sources for verification or archival purposes.
Developers choose Auto Archiver for its automation capabilities, support for multiple content types and storage backends, and its focus on secure, verifiable archiving—making it a robust solution for critical data preservation workflows.
Automatically archive links to videos, images, and social media content from Google Sheets (and more).
Supports multiple input sources like Google Sheets, CSV files, and command-line, enabling integration with various data pipelines and workflows.
Archives social media posts, videos, images, and webpages from URLs, covering most common web content types for comprehensive preservation.
Allows saving to remote storage backends such as S3 buckets and Google Drive, facilitating scalable and accessible archive management.
Appends archiving status back to the source spreadsheet or CSV report, providing built-in tracking and audit trails for batch processes.
Requires setup of configuration files (e.g., orchestration.yaml) and secrets management, which can be time-consuming and error-prone for new users.
Archiving is limited to supported content types and platforms; unsupported sites may require custom extensions or development effort.
Users must rely heavily on external documentation for installation and troubleshooting, which might not cover all edge cases or advanced scenarios.
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
⬛️ CLI tool and library for saving complete web pages as a single HTML file
💾 dn - offline full-text search and archiving for your Chromium-based browser.
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.