[USER=118999]@Gthorsten[/USER] Ich habe ca. 40 verschiedenste Dokumente gescannt, um mein yaml-File damit zu erstellen und zu verfizieren. Und dabei habe ich viel hin und her probiert, u.a. habe ich auch die Datumssuche mal von python auf regex umgestellt.
Das Datum 21.09.2022 vom Dokument, was ich oben benutzt habe, wird von der python-Datumssuche gefunden.
Allerdings habe ich ein Dokument mit einem Datum "17/12/22", dass von der python Suche nicht gefunden wird aber von der regex Suche schon.
Mit regex Suche sieht das log so aus:
[CODE] rename tag is: "#MoveDance"
-----------------------------------------------------------------------------------
| search for a valid date in ocr text: |
-----------------------------------------------------------------------------------
run RegEx date search - search for date format: 1 (1 = dd mm [yy]yy; 2 = [yy]yy mm dd; 3 = mm dd [yy]yy)
Dates found: 2
check date (dd mm [yy]yy): 17/12/22
➜ valid
day: 17
month:12
year: 2022
[/CODE]
Mit python Suche so:
[CODE] rename tag is: "#MoveDance"
-----------------------------------------------------------------------------------
| search for a valid date in ocr text: |
-----------------------------------------------------------------------------------
2023-06-28 15:36:35,273 - Date scanning started
2023-06-28 15:36:35,273 - Version: 1.04
2023-06-28 15:36:35,273 - Parameter minYear = 0
2023-06-28 15:36:35,273 - Parameter maxYear = 0
2023-06-28 15:36:35,273 - Parameter searchnearest = on
2023-06-28 15:36:35,273 - set searchnearest = on
2023-06-28 15:36:35,273 - Parameter fileWithTextFindings = /tmp/tmp.dTjOYDuFju/step2_tmp_1687959386//synOCR.txt
2023-06-28 15:36:35,273 - Parameter dateBlackLIst = 2021-07-02
2023-06-28 15:36:35,274 - start checking blacklist
2023-06-28 15:36:35,420 - end checking blacklist
2023-06-28 15:36:35,420 - Start searching for alphanumerical and numerical dates......
2023-06-28 15:36:35,448 - finish searching for alphanumerical and numerical dates......
2023-06-28 15:36:35,448 - found 0 dates
2023-06-28 15:36:35,448 - no dates found
2023-06-28 15:36:35,448 - found date None
2023-06-28 15:36:35,448 - Date scanning ended
Dates found: 1
check date ([yy]yy mm dd): None
./synOCR.sh: line 1215: printf: ERROR: invalid number
./synOCR.sh: line 1215: printf: at: invalid number
./synOCR.sh: line 1215: printf: line: invalid number
./synOCR.sh: line 1215: printf: 1215:: invalid number
./synOCR.sh: line 1215: printf: let: invalid number
./synOCR.sh: line 1215: printf: |: invalid number
./synOCR.sh: line 1215: printf: awk: invalid number
./synOCR.sh: line 1215: printf: -F'[./-]': invalid number
./synOCR.sh: line 1215: printf: $3}': invalid number
./synOCR.sh: line 1215: printf: |: invalid number
./synOCR.sh: line 1215: printf: grep: invalid number
./synOCR.sh: line 1215: printf: -o: invalid number
ERROR at line 1215: date_dd=$(printf '%02d' $(let "n=10#$(echo "${currentFoundDate}" | awk -F'[./-]' '{print $3}' | grep -o '[0-9]*')"; echo $((n)) ) )
./synOCR.sh: line 1218: printf: ERROR: invalid number
./synOCR.sh: line 1218: printf: at: invalid number
./synOCR.sh: line 1218: printf: line: invalid number
./synOCR.sh: line 1218: printf: 1218:: invalid number
./synOCR.sh: line 1218: printf: let: invalid number
./synOCR.sh: line 1218: printf: |: invalid number
./synOCR.sh: line 1218: printf: awk: invalid number
./synOCR.sh: line 1218: printf: -F'[./-]': invalid number
./synOCR.sh: line 1218: printf: $2}')": invalid number
ERROR at line 1218: date_mm=$(printf '%02d' $(let "n=10#$(echo "${currentFoundDate}" | awk -F'[./-]' '{print $2}')"; echo $((n)) ) )
ERROR at line 1240: date "+%d/%m/%Y" -d "${date_mm}"/"${date_dd}"/"${date_yy}" > /dev/null 2>&1
➜ invalid format
Date not found in OCR text - use file date:
day: 28
month:06
year: 2023
[/CODE]
Dummerweise scheint der Fallback auf regex dann auch nicht korrekt zu funktionieren, sodass am Ende mit python kein Datum gefunden wird.
Ich hätte auch hier einen Vorschlag für einen Fix:
[CODE]bash-4.4# diff synOCR.sh synOCR.sh.org
1174,1180d1173
< # enable full fallback to regex in case nothing was found with python
< if [ "$founddatestr" = "None" ]; then
< format=1
< tmp_date_search_method="regex"
< fi
< fi
<
1183c1176
< if [ "${tmp_date_search_method}" = "regex" ]; then
---
> elif [ "${tmp_date_search_method}" = "regex" ]; then[/CODE]
Damit funktioniert der fallback, siehe diese log:
[CODE] rename tag is: "#MoveDance"
-----------------------------------------------------------------------------------
| search for a valid date in ocr text: |
-----------------------------------------------------------------------------------
2023-06-28 16:06:12,929 - Date scanning started
2023-06-28 16:06:12,930 - Version: 1.04
2023-06-28 16:06:12,930 - Parameter minYear = 0
2023-06-28 16:06:12,930 - Parameter maxYear = 0
2023-06-28 16:06:12,930 - Parameter searchnearest = on
2023-06-28 16:06:12,930 - set searchnearest = on
2023-06-28 16:06:12,930 - Parameter fileWithTextFindings = /tmp/tmp.Uu1E7LvFIc/step2_tmp_1687961163//synOCR.txt
2023-06-28 16:06:12,930 - Parameter dateBlackLIst = 2021-07-02
2023-06-28 16:06:12,930 - start checking blacklist
2023-06-28 16:06:13,077 - end checking blacklist
2023-06-28 16:06:13,077 - Start searching for alphanumerical and numerical dates......
2023-06-28 16:06:13,105 - finish searching for alphanumerical and numerical dates......
2023-06-28 16:06:13,105 - found 0 dates
2023-06-28 16:06:13,105 - no dates found
2023-06-28 16:06:13,105 - found date None
2023-06-28 16:06:13,105 - Date scanning ended
run RegEx date search - search for date format: 1 (1 = dd mm [yy]yy; 2 = [yy]yy mm dd; 3 = mm dd [yy]yy)
Dates found: 2
check date (dd mm [yy]yy): 17/12/22
➜ valid
day: 17
month:12
year: 2022[/CODE]