Digital case report types technology from pathology experiences by ARGO, computerized report generator for onco-hematology

Knowledge assortment

General, 332 histopathology paper-based experiences had been collected between 2014 and 2020 on the Pathology Unit of the IRCCS Istituto Tumori ‘Giovanni Paolo II’ in Bari, Italy (239) and from six totally different Italian facilities (93) from Unit of Hematology, Azienda Ospedaliero-Universitaria Policlinico Umberto I in Rome, Italy, Hematology, AUSL/IRCCS of Reggio Emilia in Reggio Emilia, Italy, Division of Hematology 1, AOU “Città della Salute e della Scienza di Torino” in Turin, Italy, Division of Hematology, Azienda Ospedaliero-Universitaria Maggiore della Carità di Novara in Novara, Italy, Division of Medication, Part of Hematology, College of Verona in Verona, Italy, and Division of Diagnostic Haematopathology, IRCCS European Institute of Oncology in Milan, Italy. The inner sequence included 106 DLBCL, 79 FL, and 54 MCL, whereas the exterior one comprised 49 DLBCL, 24 FL, and 20 MCL.

A novel ID code was assigned to every report. In accordance with the diagnostic standards for every lymphoma subtype, experiences included IHC outcomes obtained from LN, EN, BM or PB specimens. Qualitative and quantitative data for IHC markers together with MYC, BCL2, BCL6, CD10, CD20, Cyclin-D1 had been reported. Some experiences additionally included molecular information from FISH evaluation, whereas some experiences included both FISH outcomes or the extent tumor cell infiltration as addendum. For DLBCL, molecular classification in accordance with the COO estimated by the Hans algorithm was additionally included24. Ki-67 proliferation index was additionally reported as quantitative worth starting from 5 to 100%.

The work was authorised by the Institutional Assessment Board of the IRCCS Istituto Tumori “Giovanni Paolo II” hospital in Bari, Italy. All strategies had been carried out in accordance with related native laws and after obtainment of devoted knowledgeable consent.

Automated detection of related phrases in paper-based experiences

We aimed this step of the workflow at automating the detection of related phrases to be extracted from the textual content fields of paper-based experiences. ARGO exploits OCR25 and NLP26 methods to transform pictures of experiences into textual content and detect related phrases within the textual content primarily based on an “ad-hoc” thesaurus.

The conversion from picture to textual content has been carried out in Tesseract OCR© (model 4.1.1-rc2-20-g01fb). To enhance conversion efficiency, every pathology report was firstly transformed from pdf to picture via Poppler library (model 0.26.5). Then, the picture was translated in a gray scale of 8 bits (from 0 to 255 ranges of gray).

Picture transformation was developed in Python by OpenCV© software program (model 4.2.0).

In ARGO, NLP methods had been adopted to mechanically extract related phrases for the illness prognosis, to be transferred into the digitalized eCRFs. Thus, a set of NLP common expressions had been utilized to extract data in regards to the prognosis, date of the report, report ID, kind of the specimen, execution of BM biopsy, IHC, and FISH analyses, in addition to quantitative and qualitative information of chosen IHC markers (MYC, BCL2, BCL6, CD10, CD20, Cyclin-D1), COO subtypes and Ki-67 proliferation index (paragraph “ARGO operate and NLP guidelines”).

The illness nomenclature was assigned primarily based on the very best match between the sample of detected biomarkers in every report and a reference sample, as reported within the “Hematopoietic and Lymphoid Neoplasm Coding Handbook pointers” from the “Surveillance, Epidemiology and Finish Outcomes (SEER) program” of the Nationwide Institute of Well being27. The ultimate prognosis nomenclature was referred to the ICD10 classification23. Communication between ARGO and SEER official servers was flexibly dealt by way of API.

ARGO was developed in Flask©, model 1.1.2, the webserver was an Oracle© Linux Server 7.8 with kernel 4.14.35–1902.303.5.3.el7uek.x86_64. We used MariaDB© 5.5.68 as database. NLP algorithms had been developed in Python 3.6.8. Translation from English to Italian language was dealt by way of API instrument MyMemory© (model 3.5.0). To extend the detectability of biomarkers within the experiences we additionally constructed three thesauri in Phyton with NLP common expressions (Supplementary Appendix Supply code S1 and Desk S2). Regardless of the area specificity of such thesauri, the approach of data extraction by flexibly introducing a brand new thesaurus is a basic function of ARGO.

ARGO features and NLP guidelines

ARGO was developed in accordance with three features:,, and was the primary operate and included (1) the decision to the operate to acknowledge the report template as enter, (2) the set of NLP expressions to determine each biomarker and prognosis description, and (3) the decision to the operate which included two API tokens, the primary to take information on biomarkers and prognosis from the SEER database and the second supplied from the REDCap venture ID to permit computerized information entry. Supplementary Fig. S2A particulars the pseudocode to course of a pathology report. ARGO embedded two important actions, specifically i) the popularity of the template from the header part together with the fields “BIOPSY DATE” and “ID NUMBER”, the demographical affected person data (“NAME”, “SURNAME”, “DATE OF BIRTH”, “PLACE OF BIRTH”, “SEX”, and “SSN” [Social Security Number]), and the “SPECIMEN TYPE” (by way of, and ii) the popularity of the “IHC MARKERS” (“POSITIVITY/NEGATIVITY” or “QUANTITY”) from the organic samples, the fields “FISH”, “DIAGNOSIS”, and “CELL OF ORIGIN” from the illness part (by way of Supplementary Fig. S2B exhibits an instance of NLP enter from the interior sequence. The common expressions used to mechanically acknowledge the header part for inside experiences are reported in Desk 4. These for the exterior experiences are detailed in Supplementary Desk S3.

Desk 4 Set of NLP common expressions embedded into the for the interior experiences.

Regarding, we recognized the set of pathological description patterns in accordance with the next 4 eventualities:

  1. 1.

    description of qualitative markers by symbolic qualifiers in a free textual content kind (e.g. “ + ” for positivity and “-” for negativity);

  2. 2.

    description of qualitative markers by textual qualifiers in a free textual content kind (e.g. “constructive”, “reactive” or “immunoreactive” for positivity and “adverse” or “immunonegative” for negativity);

  3. 3.

    description of each qualitative and quantitative markers by symbolic or textual qualifiers in a bullet kind;

  4. 4.

    description of pure quantitative markers (as Ki-67).

Desk 5 exhibits three consultant patterns of description with their relative NLP pseudocodes and anticipated outcomes. The entire set of patterns is detailed in Supplementary Desk S4.

Desk 5 Consultant units of NLP guidelines embedded into the for patterns 1.1, 3.2, and 4.1.

Knowledge-mapping and computerized inhabitants of eCRFs

For a scientific assortment of the diagnostic variables on this examine, we designed devoted eCRFs on REDcap17,18. eCRFs had been suited to the synoptic templates supplied and authorised by the CAP. We referred to DLBCL, FL, and MCL templates28,29. The info-mapping between ARGO and the eCRFs was carried out by offering the related information fields from the REDCap dictionary as a versatile enter to the applying (Supplementary Desk S5). Lastly, we used API know-how for the automated information entry and last add of the data of curiosity into the eCRFs.

Validation metrics

ARGO efficiency, considered the extent of consistency between information included within the unique pathology experiences and people mechanically transferred into eCRFs, was assessed by way of accuracy, precision, recall and F1 rating30. To calculate every measure, we outlined the instances within the following (1) true-positive: instances during which ARGO detected appropriately the anticipated variables; (2) false-positive: instances during which ARGO detected variables even when not current within the unique report; (3) true-negative: instances during which ARGO didn’t detect a variable not current within the unique report; and (4) false-negative: instances during which ARGO failed in detecting a variable current within the unique report.

Outcomes for every data-field of inside and exterior sequence had been statistically in contrast by a chi-square take a look at.

Related Articles

Back to top button