Auf Thema antworten

[USER=118999]@Gthorsten[/USER] Ich habe ca. 40 verschiedenste Dokumente gescannt, um mein yaml-File damit zu erstellen und zu verfizieren. Und dabei habe ich viel hin und her probiert, u.a. habe ich auch die Datumssuche mal von python auf regex umgestellt.

Das Datum 21.09.2022 vom Dokument, was ich oben benutzt habe, wird von der python-Datumssuche gefunden.


Allerdings habe ich ein Dokument mit einem Datum "17/12/22", dass von der python Suche nicht gefunden wird aber von der regex Suche schon.

Mit regex Suche sieht das log so aus:

[CODE]                rename tag is: "#MoveDance"



  -----------------------------------------------------------------------------------

  | search for a valid date in ocr text:                                            |

  -----------------------------------------------------------------------------------


                run RegEx date search - search for date format: 1 (1 = dd mm [yy]yy; 2 = [yy]yy mm dd; 3 = mm dd [yy]yy)

                  Dates found: 2

                  check date (dd mm [yy]yy): 17/12/22

                  ➜ valid

                      day:  17

                      month:12

                      year: 2022


 [/CODE]


Mit python Suche so:

[CODE]                rename tag is: "#MoveDance"



  -----------------------------------------------------------------------------------

  | search for a valid date in ocr text:                                            |

  -----------------------------------------------------------------------------------


2023-06-28 15:36:35,273 - Date scanning started

2023-06-28 15:36:35,273 - Version: 1.04

2023-06-28 15:36:35,273 - Parameter minYear = 0

2023-06-28 15:36:35,273 - Parameter maxYear = 0

2023-06-28 15:36:35,273 - Parameter searchnearest = on

2023-06-28 15:36:35,273 - set searchnearest = on

2023-06-28 15:36:35,273 - Parameter fileWithTextFindings = /tmp/tmp.dTjOYDuFju/step2_tmp_1687959386//synOCR.txt

2023-06-28 15:36:35,273 - Parameter dateBlackLIst = 2021-07-02

2023-06-28 15:36:35,274 - start checking blacklist

2023-06-28 15:36:35,420 - end checking blacklist

2023-06-28 15:36:35,420 - Start searching for alphanumerical and numerical dates......

2023-06-28 15:36:35,448 - finish searching for alphanumerical and numerical dates......

2023-06-28 15:36:35,448 - found 0 dates

2023-06-28 15:36:35,448 - no dates found

2023-06-28 15:36:35,448 - found date None

2023-06-28 15:36:35,448 - Date scanning ended

                  Dates found: 1

                  check date ([yy]yy mm dd): None

./synOCR.sh: line 1215: printf: ERROR: invalid number

./synOCR.sh: line 1215: printf: at: invalid number

./synOCR.sh: line 1215: printf: line: invalid number

./synOCR.sh: line 1215: printf: 1215:: invalid number

./synOCR.sh: line 1215: printf: let: invalid number

./synOCR.sh: line 1215: printf: |: invalid number

./synOCR.sh: line 1215: printf: awk: invalid number

./synOCR.sh: line 1215: printf: -F'[./-]': invalid number

./synOCR.sh: line 1215: printf: $3}': invalid number

./synOCR.sh: line 1215: printf: |: invalid number

./synOCR.sh: line 1215: printf: grep: invalid number

./synOCR.sh: line 1215: printf: -o: invalid number

ERROR at line 1215: date_dd=$(printf '%02d' $(let "n=10#$(echo "${currentFoundDate}" | awk -F'[./-]' '{print $3}' | grep -o '[0-9]*')"; echo $((n)) ) )

./synOCR.sh: line 1218: printf: ERROR: invalid number

./synOCR.sh: line 1218: printf: at: invalid number

./synOCR.sh: line 1218: printf: line: invalid number

./synOCR.sh: line 1218: printf: 1218:: invalid number

./synOCR.sh: line 1218: printf: let: invalid number

./synOCR.sh: line 1218: printf: |: invalid number

./synOCR.sh: line 1218: printf: awk: invalid number

./synOCR.sh: line 1218: printf: -F'[./-]': invalid number

./synOCR.sh: line 1218: printf: $2}')": invalid number

ERROR at line 1218: date_mm=$(printf '%02d' $(let "n=10#$(echo "${currentFoundDate}" | awk -F'[./-]' '{print $2}')"; echo $((n)) ) )

ERROR at line 1240: date "+%d/%m/%Y" -d "${date_mm}"/"${date_dd}"/"${date_yy}" > /dev/null 2>&1

                  ➜ invalid format

                  Date not found in OCR text - use file date:

                  day:  28

                  month:06

                  year: 2023

[/CODE]


Dummerweise scheint der Fallback auf regex dann auch nicht korrekt zu funktionieren, sodass am Ende mit python kein Datum gefunden wird.


Ich hätte auch hier einen Vorschlag für einen Fix:

[CODE]bash-4.4# diff synOCR.sh synOCR.sh.org

1174,1180d1173

<         # enable full fallback to regex in case nothing was found with python

<         if [ "$founddatestr" = "None" ]; then

<             format=1

<             tmp_date_search_method="regex"

<         fi

<     fi

<

1183c1176

<     if [ "${tmp_date_search_method}" = "regex" ]; then

---

>     elif [ "${tmp_date_search_method}" = "regex" ]; then[/CODE]


Damit funktioniert der fallback, siehe diese log:

[CODE]                rename tag is: "#MoveDance"



  -----------------------------------------------------------------------------------

  | search for a valid date in ocr text:                                            |

  -----------------------------------------------------------------------------------


2023-06-28 16:06:12,929 - Date scanning started

2023-06-28 16:06:12,930 - Version: 1.04

2023-06-28 16:06:12,930 - Parameter minYear = 0

2023-06-28 16:06:12,930 - Parameter maxYear = 0

2023-06-28 16:06:12,930 - Parameter searchnearest = on

2023-06-28 16:06:12,930 - set searchnearest = on

2023-06-28 16:06:12,930 - Parameter fileWithTextFindings = /tmp/tmp.Uu1E7LvFIc/step2_tmp_1687961163//synOCR.txt

2023-06-28 16:06:12,930 - Parameter dateBlackLIst = 2021-07-02

2023-06-28 16:06:12,930 - start checking blacklist

2023-06-28 16:06:13,077 - end checking blacklist

2023-06-28 16:06:13,077 - Start searching for alphanumerical and numerical dates......

2023-06-28 16:06:13,105 - finish searching for alphanumerical and numerical dates......

2023-06-28 16:06:13,105 - found 0 dates

2023-06-28 16:06:13,105 - no dates found

2023-06-28 16:06:13,105 - found date None

2023-06-28 16:06:13,105 - Date scanning ended

                run RegEx date search - search for date format: 1 (1 = dd mm [yy]yy; 2 = [yy]yy mm dd; 3 = mm dd [yy]yy)

                  Dates found: 2

                  check date (dd mm [yy]yy): 17/12/22

                  ➜ valid

                      day:  17

                      month:12

                      year: 2022[/CODE]


Additional post fields