How to delete a certain element present on all pages of a pdf?

happeningtofry99158@lemmy.world · edit-2 1 day ago

How to delete a certain element present on all pages of a pdf?

HelloRoot@lemy.lol · edit-2 1 day ago

to add to what Elvith wrote:

you can read the HTML like structures inside a PDF and then find out details about the elements you want to remove and then remove them by using that found common property.

This is very hard to do by hand. But if you are curious you can download https://file-examples.com/wp-content/storage/2017/10/file-sample_150kB.pdf

and open it with a text editor like kate. You will see a lot of encoded content data, but also the “html-like” structure in plaintext (in between the encoded stuff but also more at the bottom)

You might find that editing the PDF by hand will break it completely, that is because it is complicated. Iirc you’d need to fix the index, recalculate the checksum or do some other magic bullshit. But that is often taken care of by the library.

Here is a shitty python example for that demo pdf that redacts the image at the last page by drawing a white rectangle over it. There is no way in pymupdf to delete an image or a textblock … but this is just an example. Other libraries might be able to do it (the one I used a decade ago in java could). I just wanted to point you in the general direction, hope you can see from here how iterating over all the pages, picking the right element and redacting it would work.

import pymupdf  # PyMuPDF

# Open the PDF
doc = pymupdf.open("./file-sample_150kB.pdf")

# Get the last page
page = doc[-1]

# Get all images on the page
images = page.get_images(full=True)

if images:
    # Get the xref of the first image
    xref = images[0][0]

    # Find all instances of the image and redact their bounding boxes
    for info in page.get_image_info(xrefs=True):
        if info["xref"] == xref:
            rect = pymupdf.Rect(info["bbox"])
            page.add_redact_annot(rect, fill=(1, 1, 1))  # white fill

    page.apply_redactions()

# Save the modified PDF
doc.save("./modified.pdf")
doc.close()

A way simpler approach might be to crop all pages at the bottom.

import pymupdf  # PyMuPDF

doc = pymupdf.open("input.pdf")  # open the PDF

for page in doc:
    rect = page.rect  # original page size
    new_rect = pymupdf.Rect(rect.x0, rect.y0 + 100, rect.x1, rect.y1)  # crop bottom 100px
    page.set_cropbox(new_rect)

doc.save("output.pdf")  # save the cropped PDF
doc.close()

Here are the docs: https://pymupdf.readthedocs.io/en/latest/the-basics.html

happeningtofry99158@lemmy.world · 1 day ago

Thanksalot!

How to delete a certain element present on all pages of a pdf?

How to delete a certain element present on all pages of a pdf?

How to delete a certain element present on all pages of a pdf? - Lemmy.World