Download a PDF file such as
https://www.irs.gov/pub/irs-pdf/f1040.pdf
from your browser or with
curl https://www.irs.gov/pub/irs-pdf/f1040.pdf > f1040.pdf
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 146k 100 146k 0 0 961k 0 --:--:-- --:--:-- --:--:-- 963k
ls -l f1040.pdf
-rw-r--r-- 1 myname mygroup 149958 May 26 08:02 f1040.pdf
file f1040.pdf
f1040.pdf: PDF document, version 1.7
Install
pdfminer
for Python 2,
pdfminer.six
for Python3.
pip3 install pdfminer.six pip3 list pip3 show pdfminer.six which pdf2txt.py /Library/Frameworks/Python.framework/Versions/3.7/bin/pdf2txt.py pdf2txt.py --help dumppdf.py
In the directory that holds your downloaded
f1040.pdf,
create a new file
f1040.txt
and examine it with TextEdit.app.
pdf2txt.py -o f1040.txt f1040.pdf ls -l f1040.txt -rw-r--r-- 1 myname mygroup 5362 May 26 08:05 f1040.txt file f1040.txt f1040.txt: UTF-8 Unicode text
In the directory that holds your downloaded
f1040.pdf,
create a new file
f1040.html
and examine it with your browser.
In my Chrome browser, I pulled down
File → Open File…
pdf2txt.py -o f1040.html f1040.pdf ls -l f1040.html -rw-r--r-- 1 myname mygroup 121924 May 26 08:21 f1040.html file f1040.html f1040.html: HTML document text, UTF-8 Unicode text, with very long lines
You can also say
-o f1040.xml
instead of
-o f1040.html.
f1040.xml: XML 1.0 document text, UTF-8 Unicode text, with very long lines
Without the
laparams,
each page was one big line of text.
"Convert a PDF file to text and print it."
import sys
import io
import pdfminer.pdfinterp
import pdfminer.converter
import pdfminer.pdfpage
try:
pdfFile = open("f1040.pdf", "rb") #read binary
except:
print(sys.exc_info())
sys.exit(1)
resourceManager = pdfminer.pdfinterp.PDFResourceManager()
stringFile = io.StringIO()
layoutParameters = pdfminer.layout.LAParams(line_margin = 0.1)
textConverter = pdfminer.converter.TextConverter(resourceManager, stringFile, laparams = layoutParameters)
pageInterpreter = pdfminer.pdfinterp.PDFPageInterpreter(resourceManager, textConverter)
try:
pages = pdfminer.pdfpage.PDFPage.get_pages(pdfFile, caching = True, check_extractable = True)
for page in pages:
pageInterpreter.process_page(page)
oneBigString = stringFile.getvalue()
except:
print(sys.exc_info())
sys.exit(1)
finally:
pdfFile.close()
textConverter.close()
stringFile.close()
if len(oneBigString) == 0:
sys.exit(1)
print(oneBigString)
#Or print the text line by line:
#for i, line in enumerate(oneBigString.splitlines(), start = 1):
# print(i, line)
sys.exit(0)
m1040 Department of the Treasury—Internal Revenue Service U.S. Individual Income Tax Return 2018 OMB No. 1545-0074 etc.