5 Useful Python Scripts to Automate Boring PDF Tasks

# Introduction

PDF files are widely used in many workflows. You may need to merge reports, split large files, extract text or tables, add watermarks, or modify sensitive content. These are all routine tasks, but handling them manually for multiple files can be slow and error-prone. These five Python scripts automate the process. They run from the command line, support batch processing, and are easy to configure.

You can find all the scripts on GitHub.

# 1. Merging and Splitting PDF Files

// pain point

Combining multiple PDF files into one, or splitting a large PDF into separate files according to page range, is one of the most common PDF tasks. Both are difficult to do manually, especially when dealing with multiple files or large page numbers.

// what does the script do

Merges a folder of PDF files into a single output file in a configurable order, or splits each PDF into separate files according to a certain page range. N By page, or by a list of specific page numbers. Both operations are controlled by the same script via a mode flag.

// how it works

uses script pypdf For all page-level operations. In merge mode, it reads all the PDFs from an input folder, sorts them by file name (or a custom order defined in the text file), and writes them sequentially into a single output PDF. In split mode, it accepts either a page range list, a fixed segment size, or a list of page numbers to split. Each split segment is written to a numbered output file. Metadata from the first input file is preserved in merge mode.

⏩ Get PDF Merge and Split Script

# 2. Extracting Text and Tables from PDF

// pain point

Getting usable data from a PDF – whether it’s text from a report or tabular data from a statement – is something that needs to happen before further processing can begin. Copy-pasting from a PDF viewer is impractical for anything beyond a few pages, and the output is rarely clean.

// what does the script do

Extracts text and tables from one or more PDF files and writes the results to structured output files. Text is written in plain text or Markdown files. Tables are written in CSV or Excel, with each table occupying a sheet. Supports both text-based PDF and basic layout-preserving extraction.

// how it works

Script uses pypdf for raw text extraction pdfplumber For layout-aware extraction and table identification. For each input file, it goes page by page, extracts text blocks and locates table regions using PDFPlumber’s Table Finder. The extracted tables are normalized – empty rows removed, headers detected – and written to separate output files. A summary report lists how many pages and tables were found in each file, and marks all pages where the extraction produced no output.

⏩ Get PDF Text and Table Extractor Script

# 3. Stamping, Watermarking, and Adding Page Numbers

// pain point

Adding a watermark, a stamp or page numbers before distributing a batch of PDFs is simple in concept but doing it one file at a time through a graphical user interface (GUI) is slow. When the batch is large or the requirement comes frequently, it needs to be automated.

// what does the script do

Applies a text or image stamp to each page of one or more PDF files. Supports diagonal watermarks, header/footer text, page numbers, and image overlays. Position, font size, opacity and color are all configurable. Processes entire folders in batch.

// how it works

Script uses pypdf for page manipulation reportlab To generate the stamp layer. For each input PDF, it creates a single-page stamped PDF in memory using ReportLab. It renders text at configured position, angle, font and opacity, or places an image at specified coordinates. This stamp page is then merged to each page of the source PDF using pypdf’s page merging. The result is written to a new output file, leaving the original unchanged. Page numbers are handled as a special case, generating a unique stamp per page.

⏩ Get PDF Marker Script

# 4. Editing sensitive content

// pain point

Before a PDF can be shared externally, sensitive content – such as names, reference numbers, financial figures and addresses – often needs to be removed. Manually creating a black box over text in a PDF editor works, but doesn’t actually remove the underlying text in all tools, and it’s impractical for more than a few pages.

// what does the script do

Scans PDF pages for text matching patterns you define – regex patterns, exact strings, or predefined ranges like email addresses and phone numbers – and permanently modifies the matching content by replacing it with black rectangles. Outputs a new PDF by removing the underlying text, not just visually obscuring it.

// how it works

uses script pymupdfWhich provides both text search with bounding box coordinates and the ability to draw edit annotations that when applied permanently delete the underlying content. For each page, the script searches for all matches of each configured pattern, marks the bounding rectangles as reduction annotations, then applies them – which removes the text from the page content stream. A report is written listing each modification made, including the page number, the text matched (before modification), and the pattern that triggered it.

⏩ Get PDF Reduction Script

# 5. Extracting Metadata and Creating a PDF Inventory

// pain point

When working with large collections of PDF files, it’s often useful to know the basic facts about each one – page number, file size, creation date, author, whether it’s encrypted, whether it contains text or a scanned image. Checking each file individually through a viewer is not practical on a large scale.

// what does the script do

Scans a folder of PDF files and extracts metadata from each, including page numbers, file size, creation and modification dates, author, creator, encryption status, and whether the document contains searchable text or scanned images. Writes everything to a CSV or Excel inventory file.

// how it works

The script uses pypdf to read document metadata from the PDF information dictionary and pdfplumber for sample pages for text content. For each file, it tries to open the PDF and read the standard metadata fields. It samples the first few pages to determine whether the file contains extractable text as opposed to scanned image pages. Encrypted files that can’t be opened are flagged instead of silently discarded. The output inventory includes one row per file with all extracted fields and a summary row with the total and average at the bottom.

⏩ Get PDF Inventory Script

# wrapping up

These five Python scripts handle PDF tasks that typically turn into repetitive manual work: splitting files, extracting content, processing batches, and cleaning up document workflows. Each script is designed to work safely on single files or entire folders while generating new output rather than modifying the original files.

Start with a small batch, verify the output, then scale up to larger folders when everything looks right. Most setup involves simply installing the listed dependencies and adjusting the configuration section for your file paths and settings.

Bala Priya C is a developer and technical writer from India. She likes to work in the fields of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, Data Science, and Natural Language Processing. She loves reading, writing, coding, and coffee! Currently, she is working on learning and sharing her knowledge with the developer community by writing tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

5 Useful Python Scripts to Automate Boring PDF Tasks

# Introduction

# 1. Merging and Splitting PDF Files

// pain point

// what does the script do

// how it works

# 2. Extracting Text and Tables from PDF

// pain point

// what does the script do

// how it works

# 3. Stamping, Watermarking, and Adding Page Numbers

// pain point

// what does the script do

// how it works

# 4. Editing sensitive content

// pain point

// what does the script do

// how it works

# 5. Extracting Metadata and Creating a PDF Inventory

// pain point

// what does the script do

// how it works

# wrapping up

Google has released Gemini 3.5 Live Translate, a streaming speech-to-speech audio model that covers more than 70 languages ​​in Meet, Translate, and the Live API.

Build a zero-cost web automation pipeline with OpenRouter, OpenClaw, and MediaUse

Related Articles

Google has released Gemini 3.5 Live Translate, a streaming speech-to-speech audio model that covers more than 70 languages in Meet, Translate, and the Live API.