Auf Thema antworten

Selbstverständlich. Hier die Beiden Bereiche aus dem OCR Prozess und der Datumserkennung


[CODE]


 -----------------------------------------------------------------------------------

  | processing PDF @ OCRmyPDF:                                                      |

  -----------------------------------------------------------------------------------


                ➜ OCRmyPDF-LOG:

                  reading file from standard input

                  Start processing 4 pages concurrently

                      1 page is facing ⇧, confidence 14.29 - rotation appears correct

                      4 page is facing ⇧, confidence 14.87 - rotation appears correct

                      2 page is facing ⇧, confidence 15.51 - rotation appears correct

                      3 page is facing ⇧, confidence 15.86 - rotation appears correct

                      4 [tesseract] lots of diacritics - possibly poor OCR

                      6 [tesseract] Too few characters. Skipping this page

                      6 [tesseract] Too few characters. Skipping this page

                      6 [tesseract] Error during processing.

                      6 page is facing ⇧, confidence 0.00 - no change

                      5 page is facing ⇧, confidence 13.73 - no change

                      6 [tesseract] Empty page!!

                      6 [tesseract] Empty page!!

                  Postprocessing...

                  Optimize ratio: 1.00 savings: -0.0%

                  Image optimization did not improve the file - optimizations will not be used

                  Output sent to stdout

                ← OCRmyPDF-LOG-END


                target file (OK): /tmp/tmp.shl28WamYI/step1_tmp_1727708403/300982024165537.pdf



-----------------------------------------------------------------------------------

  | search for a valid date in ocr text:                                            |

  -----------------------------------------------------------------------------------


2024-09-30 17:00:50,955 - Date scanning started

2024-09-30 17:00:50,955 - Version: 1.04

2024-09-30 17:00:50,955 - Parameter minYear = 0

2024-09-30 17:00:50,955 - Parameter maxYear = 0

2024-09-30 17:00:50,955 - Parameter searchnearest = off

2024-09-30 17:00:50,955 - set searchnearest = off

2024-09-30 17:00:50,955 - Parameter fileWithTextFindings = /tmp/tmp.shl28WamYI/step2_tmp_1727708446//synOCR.txt

2024-09-30 17:00:50,955 - Parameter dateBlackLIst = off

2024-09-30 17:00:50,955 - start checking blacklist

2024-09-30 17:00:51,077 - end checking blacklist

2024-09-30 17:00:51,078 - Start searching for alphanumerical and numerical dates......

2024-09-30 17:00:55,206 - finish searching for alphanumerical and numerical dates......

2024-09-30 17:00:55,207 - found 0 dates

2024-09-30 17:00:55,207 - no dates found

2024-09-30 17:00:55,207 - found date None

2024-09-30 17:00:55,207 - Date scanning ended

                  Date not found in OCR text - use file date:

                  day:  30

                  month:09

                  year: 2024[/CODE]


Im Anhang ist der obere Teil der Pdf angehängt. Komischerweise wird auch die Leerseite nicht entfernt.


Additional post fields