Back
WEB SCRAPINGGOVERNMENT DATA

BORA Scraper Downloader

Automated extraction of 14,742 legal PDFs from Argentina's Official Gazette.

The Problem

Legal and compliance teams in Argentina need access to historical records from the BORA (Boletín Oficial). But the official website has no bulk download option, blocks automated requests, and requires specific session handling to access PDFs. Manually downloading years of documents is not viable.

Technical Challenges

Session Priming

The server rejects direct PDF requests. You must first visit the page (GET), receive cookies, then POST to the download endpoint with exact parameters.

Anti-Bot Detection

Repeated requests trigger IP blocks. The server fingerprints User-Agent strings and request patterns.

Data Integrity

Some downloads fail silently, producing corrupted files (<1KB).

My Solution

GET /view
Set Cookies
POST /download
Verify Size
Save PDF

Request flow per document

  • Replicate exact browser flow: GET page → set cookies → POST download
  • Rotate through 10+ modern User-Agents per session
  • Random delays between requests (10-30 seconds) to mimic human behavior
  • Destroy and recreate HTTP session every 5 days to avoid cookie accumulation
  • Verify file size post-download, flag corrupted files for retry

Results

14,742
PDFs extracted
0
IP bans
18 months
of gazette data
5 min
vs 20 hrs/week manual search

Stack

Python 3.10requestsdatetime