Python Remove Active Content From Pdf

Posted on

I am using the terrific library. I notice that the has many examples of how to do something without explaining the why.

For instance, both r.text and r.content are shown as examples of how to get the server response. But where is it explained what these properties do? For instance, when would I choose one over the other? I see thar r.text returns a unicode object sometimes, and I suppose that there would be a difference for a non-text response. But where is all this documented? Note that the linked document does state:You can also access the response body as bytes, for non-text requests:But then it goes on to show an example of a text response!

I can only suppose that the quote above means to say non-text responses instead of non-text requests, as a non-text request does not make sense in HTTP.In short, where is the proper documentation of the library, as opposed to the (excellent) tutorial on the Python Requests site?

I'm creating a python script to edit text from PDFs.I have this Python code which allows me to add text into specific positions of a PDF file. A general purpose algorithm for replacing text in a PDF is a difficult problem. I'm not saying it can't ever be done, because I've demonstrated doing so with the Adobe PDF Library albeit with a very simple input file with no complications, but I'm not sure that pyPDF2 has the facilities required to do so. In part, just finding the text can be a challenge.You (or more realistically your PDF library) has to parse the page contents and keep track of the changes to the graphic state, specifically changes to the current transformation matrix in case the text is in a Form XObject, and the text transformation matrix, and changes to the font; you have to use the font resource to get character widths to figure out where the text cursor may be positioned after inserting a string.

You may need to handle standard-14 fonts which don't contain that information in their font resources (the application -your program- is expected to know their metrics)After all that, removing the text is easy if you don't need to break up a Tj or TJ (show text) instruction into different parts. Preventing the text after from shifting, if that's what's desired, may require inserting a new Tm instruction to reposition the text after to where it would have been.Inserting new text can be challenging. If you want to stay consistent with the font being used and it is embedded and subset, it may not necessarily contain the glyphs you need for your text insertion. And after insertion, you then have to decide whether you need to reflow the text that comes after the text you inserted.And lastly, you will need your PDF library to save all the changes. Quite frankly, using Adobe Acrobat's Redaction features would likely be cheaper and more cost-effective way of doing this than trying to program this from scratch.

Python Remove Active Content From Pdf Free

File

Ac Run Active Content

You are confusing Reportlab, which can be used to create new content, with PyPDF2, which has some limited functionality for manipulating existing PDFs. From my perusing of the PyPDF2 documentation, I don't see that you can remove existing content, but you may be able to cover it up with a white-filled path prior to adding text in that position. If you go this route, a user might still see the original text before it get covered up, and text extraction would likely pull out both the original and new text for that area, and possibly intermingle it to something unintelligible.–Jul 21 '17 at 18:34. If you want to do a poor man's redaction with ReportLab and PyPDF2,you would create your replacement content with ReportLab.Given a Canvas, a rectangle indicating an area, a text string and a point where the text string would be inserted you would then: #set a fill color to white:c.setFillColorRGB(1,1,1)# draw a rectanglec.rect(your rectangle, fill=1)# change colorc.setFillColorRGB(0,0,0)c.drawString(text insert position, text string)save this PDF document you've created to a temporary file.Open this PDF document and the document you want to modify using the PyPDF2's PdfFileReader. Create a pdfFileWriter object, call it ModifiedDoc. Get page 0 of temporary PDF, call it updatePage.

Get page n of the other document, call it toModifyPage. ToModifyPage.mergePage(updatePage)after you are done updating pages: modifiedDoc.cloneDocumentFromReader(srcDoc)modifiedDoc.write(outStream)Again, if you go this route, a user might still see the original text before it gets covered up with the new content, and text extraction would likely pull out both the original and new text for that area, and possibly intermingle it to something unintelligible.