{"id":118,"date":"2007-10-27T16:46:00","date_gmt":"2007-10-27T16:46:00","guid":{"rendered":"http:\/\/catalin.francu.com\/blog\/?p=118"},"modified":"2007-10-27T16:46:00","modified_gmt":"2007-10-27T16:46:00","slug":"ocr-gtk-si-alte-acronime","status":"publish","type":"post","link":"https:\/\/catalin.francu.com\/blog\/2007\/10\/ocr-gtk-si-alte-acronime\/","title":{"rendered":"OCR, GTK \u015fi alte acronime"},"content":{"rendered":"<p>Iat\u0103 ce lucruri \u00eemi produc mie bucurii profesionale.<\/p>\n<p>De c\u00e2teva zile caut un OCR bun (OCR este un nume generic pentru un program c\u0103ruia \u00eei dai o pagin\u0103 scanat\u0103 \u015fi identific\u0103 textul scris pe ea, analiz\u00e2nd pixelii pentru fiecare liter\u0103 \u00een parte). Evident, exist\u0103 OCR-uri mai bune \u015fi mai proaste, dup\u0103 c\u00e2te gre\u015feli fac la recunoa\u015ftere. Gre\u015felile pot \u00eensemna c\u0103 OCR-ul recunoa\u015fte un 1 (unu) sau un i (i mic) acolo unde este tip\u0103rit un l (l mic), sau c\u0103 nu \u00een\u0163elege diacriticele rom\u00e2ne\u015fti, sau c\u0103 pierde informa\u0163iile de formatare gen bold\/italic\/subliniat.<\/p>\n<p>OCR-urile de\u015ftepte pot fi antrenate pe un font anume. Adic\u0103, dac\u0103 documentul scanat are 100 de pagini, stai \u015fi antrenezi OCR-ul pe primele 10-20 de pagini \u015fi \u00eei spui tu &#8222;\u0103sta e un a, \u0103sta e un b&#8221;, liter\u0103 cu liter\u0103. El \u00eenva\u0163\u0103 modelele \u015fi face mult mai pu\u0163ine gre\u015feli pentru restul paginilor. Avantajul antren\u0103rii e c\u0103 OCR-ul \u00eenva\u0163\u0103 s\u0103 se descurce \u00een special pe fontul \u015fi h\u00e2rtia pe care e tip\u0103rit documentul dat, unde altfel ar face gre\u015feli sistematice (de exemplu,<span> <\/span>dac\u0103 contrastul e prost, <span>h<\/span> poate fi u\u015for confundat cu <span>li).<\/span><\/p>\n<p>Eu am deja un OCR de Windows, care, ca tot softul de Windows, este o mare cutie neagr\u0103. Dac\u0103 nu merge perfect, n-ai cum s\u0103-l \u00eembun\u0103t\u0103\u0163e\u015fti. Se descurc\u0103 bini\u015for pentru limba rom\u00e2n\u0103, dar are limit\u0103ri. A\u015fa c\u0103 am zis s\u0103 caut ni\u015fte OCR-uri de Linux, s\u0103 v\u0103d \u00een ce stadiu mai sunt. Ultima oar\u0103 le-am \u00eencercat \u00een 2003 \u015fi era jale, practic erau nefolosibile. Acum \u00eens\u0103 am descoperit <a href=\"http:\/\/code.google.com\/p\/tesseract-ocr\/\">Tesseract<\/a>, un OCR despre care lumea zice c\u0103 ar fi bun. \u015ei deci m-am apucat s\u0103-l antrenez pe documentul meu. Ocazie cu care am descoperit c\u0103 Tesseract e al naibii de greu de antrenat. Ca s\u0103 \u00eel antrenezi pe o pagin\u0103, trebuie s\u0103 creezi un fi\u015fier unde \u00eei spui &#8222;\u00eentre pixelii <span>x1<\/span> \u015fi <span>x2<\/span> \u015fi <span>y1<\/span> \u015fi <span>y2<\/span> e un dreptunghi care con\u0163ine litera <span>m<\/span>&#8221; \u015fi asta pentru fiecare liter\u0103 din pagina scanat\u0103. Tesseract identific\u0103 aceste dreptunghiuri, dar de multe ori identific\u0103 dou\u0103 litere \u00eentr-un singur dreptungi (de exemplu <span>m<\/span> \u00een loc de <span>ni)<\/span>. Pentru a \u00eel corecta, trebuie s\u0103 editezi un fi\u015fier text cu multe numere \u015fi litere. Mi-a\u015f fi dorit o unealt\u0103 vizual\u0103 unde s\u0103 v\u0103d dreptunghiurile \u015fi s\u0103 trag de ele sau s\u0103 le sparg \u00een dou\u0103 cu mouse-ul.<\/p>\n<p>\u015ei, ca \u00een Linux, dac\u0103 nu g\u0103se\u015fti un program, \u00eel scrii tu \ud83d\ude42 Doar c\u0103 eu, de\u015fi programez sub Linux de 8 ani, habar n-am s\u0103 fac o aplica\u0163ie cu o interfa\u0163\u0103 grafic\u0103 (butoane, ferestre, meniuri). A\u015fa c\u0103 am decis c\u0103 e momentul s\u0103-mi corectez lacuna asta \u015fi m-am apucat s\u0103 \u00eenv\u0103\u0163 <a href=\"http:\/\/www.gtk.org\/\">GTK<\/a>. GTK este un set de module, disponibile cam pentru orice limbaj de programare, cu care po\u0163i crea foarte u\u015for aplica\u0163ii grafice. De exemplu, creezi o fereastr\u0103 principal\u0103, un buton, \u015fi specifici ce vrei s\u0103 se \u00eent\u00e2mple c\u00e2nd utilizatorul apas\u0103 pe acel buton. GTK st\u0103 la baza multor aplica\u0163ii de Linux, cum ar fi Gimp, Firefox sau Gnome.<\/p>\n<p>Deocamdat\u0103 sunt la stadiul \u00een care, \u00een vreo 80 de linii de cod, aplica\u0163ia mea \u015ftie s\u0103 afi\u015feze o imagine \u015fi po\u0163i s\u0103 defilezi st\u00e2nga-dreapta-sus-jos prin ea. \ud83d\ude42 E distractiv s\u0103 \u00eenve\u0163i un limbaj nou, c\u00e2nd ai un manual bine scris. Nu ai dec\u00e2t s\u0103 copiezi buc\u0103\u0163i din exemplele date \u015fi totul pare s\u0103 mearg\u0103. Cu timpul, descoperi c\u0103 de fapt nu \u00een\u0163elegi de ce codul merge cum merge \u015fi acolo dai de greu. E fascinant cum porne\u015fti la drum cu atitudinea &#8222;trebuie s\u0103 \u00eenv\u0103\u0163 limbajul \u0103sta, numai at\u00e2t c\u00e2t s\u0103-mi pot face treaba cu el&#8221; \u015fi descoperi c\u0103, de fapt, \u015fi \u00eenv\u0103\u0163atul \u00een sine e distractiv \u015fi merit\u0103 s\u0103 \u00een\u0163elegi \u00een detaliu ce se \u00eent\u00e2mpl\u0103.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Iat\u0103 ce lucruri \u00eemi produc mie bucurii profesionale. De c\u00e2teva zile caut un OCR bun (OCR este un nume generic pentru un program c\u0103ruia \u00eei dai o pagin\u0103 scanat\u0103 \u015fi identific\u0103 textul scris pe ea, analiz\u00e2nd pixelii pentru fiecare liter\u0103 \u00een parte). Evident, exist\u0103 OCR-uri mai bune \u015fi mai proaste, dup\u0103 c\u00e2te gre\u015feli fac la [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[1],"tags":[],"class_list":["post-118","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/catalin.francu.com\/blog\/wp-json\/wp\/v2\/posts\/118","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/catalin.francu.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/catalin.francu.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/catalin.francu.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/catalin.francu.com\/blog\/wp-json\/wp\/v2\/comments?post=118"}],"version-history":[{"count":0,"href":"https:\/\/catalin.francu.com\/blog\/wp-json\/wp\/v2\/posts\/118\/revisions"}],"wp:attachment":[{"href":"https:\/\/catalin.francu.com\/blog\/wp-json\/wp\/v2\/media?parent=118"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/catalin.francu.com\/blog\/wp-json\/wp\/v2\/categories?post=118"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/catalin.francu.com\/blog\/wp-json\/wp\/v2\/tags?post=118"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}