Selbstverständlich. Hier die Beiden Bereiche aus dem OCR Prozess und der Datumserkennung
[CODE]
-----------------------------------------------------------------------------------
| processing PDF @ OCRmyPDF: |
-----------------------------------------------------------------------------------
➜ OCRmyPDF-LOG:
reading file from standard input
Start processing 4 pages concurrently
1 page is facing ⇧, confidence 14.29 - rotation appears correct
4 page is facing ⇧, confidence 14.87 - rotation appears correct
2 page is facing ⇧, confidence 15.51 - rotation appears correct
3 page is facing ⇧, confidence 15.86 - rotation appears correct
4 [tesseract] lots of diacritics - possibly poor OCR
6 [tesseract] Too few characters. Skipping this page
6 [tesseract] Too few characters. Skipping this page
6 [tesseract] Error during processing.
6 page is facing ⇧, confidence 0.00 - no change
5 page is facing ⇧, confidence 13.73 - no change
6 [tesseract] Empty page!!
6 [tesseract] Empty page!!
Postprocessing...
Optimize ratio: 1.00 savings: -0.0%
Image optimization did not improve the file - optimizations will not be used
Output sent to stdout
← OCRmyPDF-LOG-END
target file (OK): /tmp/tmp.shl28WamYI/step1_tmp_1727708403/300982024165537.pdf
-----------------------------------------------------------------------------------
| search for a valid date in ocr text: |
-----------------------------------------------------------------------------------
2024-09-30 17:00:50,955 - Date scanning started
2024-09-30 17:00:50,955 - Version: 1.04
2024-09-30 17:00:50,955 - Parameter minYear = 0
2024-09-30 17:00:50,955 - Parameter maxYear = 0
2024-09-30 17:00:50,955 - Parameter searchnearest = off
2024-09-30 17:00:50,955 - set searchnearest = off
2024-09-30 17:00:50,955 - Parameter fileWithTextFindings = /tmp/tmp.shl28WamYI/step2_tmp_1727708446//synOCR.txt
2024-09-30 17:00:50,955 - Parameter dateBlackLIst = off
2024-09-30 17:00:50,955 - start checking blacklist
2024-09-30 17:00:51,077 - end checking blacklist
2024-09-30 17:00:51,078 - Start searching for alphanumerical and numerical dates......
2024-09-30 17:00:55,206 - finish searching for alphanumerical and numerical dates......
2024-09-30 17:00:55,207 - found 0 dates
2024-09-30 17:00:55,207 - no dates found
2024-09-30 17:00:55,207 - found date None
2024-09-30 17:00:55,207 - Date scanning ended
Date not found in OCR text - use file date:
day: 30
month:09
year: 2024[/CODE]
Im Anhang ist der obere Teil der Pdf angehängt. Komischerweise wird auch die Leerseite nicht entfernt.