Ergobyte Success Story: Extracting Semi-Structured Data Using Large Language Models (LLMs)
We are happy to present the successful collaboration between Ergobyte and smartHEALTH. Through this partnership, Ergobyte effectively addressed its digital challenges, exploring innovative approaches based on artificial intelligence and large language models (LLMs) to more effectively and efficiently extract semi-structured data.
Key Information
Ergobyte develops and commercialises software solutions that empower health professionals by assisting them with clinical and pharmaceutical information at the point of care. Its vision is to foster eHealth innovation in order to support decision making, lower costs, increase automation and, as a result, make healthcare simpler.
- Technologies Used: Use of large language models (LLMs)
- Collaboration Period: October 2024 – May 2025
- Services Provided: IT Services and IT Consulting

The Challenge
One of the most reliable and essential sources of pharmaceutical information for Ergobyte’s services is the Summary of Product Characteristics (SPCs), as published by the European Medicines Agency (EMA). These regulatory documents accompany every medicinal product approved in the European Union and contain critical information regarding a drug’s composition, indications, contraindications, side effects, pharmacodynamics, and pharmacokinetics.
However, SPCs are only available in PDF format, which makes them extremely difficult to process programmatically or extract data from automatically. Their unstructured and inconsistent layout adds further complexity, as formatting and content organization can vary significantly across products and manufacturers.
Ergobyte’s need focused on converting the valuable content of these SPCs into structured formats, in order to integrate the data seamlessly into the Galinos database. This would enable more efficient search, linking, and analysis within their platform.
Until now, this process was largely manual—time-consuming and prone to human error. Given the growing volume of SPCs and the increasing demand for up-to-date and accurate drug information, automating this pipeline became a pressing challenge. To address this, Ergobyte turned to smartHEALTH EDIH and the eHealth Lab of INAB | CERTH, seeking to explore innovative approaches based on artificial intelligence and large language models (LLMs) to more effectively and efficiently extract semi-structured data.
The Solutions
Initially, classic natural language processing (NLP) methods were explored to extract structured information from SPC documents. However, these approaches yielded limited results due to the complex and heterogeneous layout of the documents—particularly the presence of intricate tables containing merged cells, special characters, and footnotes that were not accessible as regular text. In many cases, the tables lacked a consistent semantic structure, making it difficult to accurately capture relationships between entities such as active substances, dosage forms, and strengths.
This led to the exploration of LLMs as a more promising approach. The smartHEALTH EDIH, through the eHealth Lab of INAB | CERTH, collaborated with Ergobyte to assess whether advanced LLMs could semantically interpret and convert semi-structured SPC content—especially tables and enumerations—into machine-readable data.
The evaluation included state-of-the-art models such as OpenAI’s ChatGPT and Google DeepMind’s Gemini, both of which support multimodal input (text and images). Their ability to process complex table layouts and extract relevant content was tested using carefully designed prompts and real SPC examples.
The Implementation
The evaluation was carried out by the research team of the eHealth Lab at INAB | CERTH, in collaboration with Ergobyte, with the objective of identifying the most suitable technological solution for the semi-automated extraction of critical pharmaceutical information from SPC documents.
During the testing phase, the models were prompted with screenshots of SPC PDFs accompanied by carefully phrased questions. One such example was: “Can you provide a list of adverse event terms along with their corresponding MedDRA codes and frequencies?” The goal was to assess the models’ ability to interpret visual and semi-structured content and return it in a structured, usable format.
Among the models evaluated, Gemini demonstrated the most accurate and consistent performance, particularly in interpreting complex tables and extracting relevant information in a way that supports integration into structured environments. Its ability to process visual inputs combined with natural language queries was especially valuable for the use case at hand.
Based on these findings, the two partners proceeded with the design and development of a functional prototype. The tool allows end users — even those without technical expertise — to upload screenshots from SPC PDFs and receive the contained information (e.g., adverse reactions, contraindications, frequencies) in structured form. The application leverages LLM’s API to handle communication with the model and return the processed data.
The Benefits
As a result of the collaboration between the eHealth Lab of INAB | CERTH and the company Ergobyte, a prototype was developed that leverages LLMs, for the semi-automated extraction of critical pharmaceutical information from SPCs. This functional prototype allows end users to convert screenshots from semi-structured documents into structured data, significantly accelerating their integration into the company’s database. At the same time, we supported the definition of technical specifications for the semantic representation of the extracted data, aiming for their inclusion in Knowledge Graphs and alignment with the FAIR principles. Ergobyte’s participation in the project enhanced its research activities through access to advanced know-how transfer services provided by the smartHEALTH hub, facilitating the experimental integration of cutting-edge AI technologies. Furthermore, the acquired expertise and developed tools lay the foundation for the technological advancement of the company’s products and services.
In summary, the main benefits include:
- Development of a prototype for extracting semi-structured pharmaceutical information into structured data.
- Definition of specifications for semantic representation of the information.
- Strengthening of the company’s research capacity through access to advanced knowledge and smartHEALTH support services.
- Establishment of a foundation for new value-added services and participation in highly specialized research projects.
Lessons Learned
Throughout the collaboration, the importance of consistent and meaningful communication between teams became clear. Regular meetings enabled the provider’s team to develop a deep understanding of the company’s needs—not only what was required, but also why it was important and how it would integrate into the company’s daily operations.
Equally important was the early and active involvement of real end users in the design process. Their participation ensured that the technical solutions were aligned with actual workflows and user expectations, ultimately enhancing adoption and usability.
During the planning phase, it also proved valuable to take into account future steps and possible extensions of the solution. Anticipating technical requirements for scalability, interoperability, or integration with other systems early on supported the long-term sustainability of the results.
Finally, a key lesson learned was the importance of remaining open-minded and flexible when it comes to technical approaches. Although the initial plan involved using traditional natural language processing techniques to extract information from SPCs, this approach did not deliver the expected results. Promptly shifting to multimodal LLMs and adapting the methodology accordingly was a successful and impactful decision.
Follow “smartHEALTH” on social media for the latest news and updates: Facebook, LinkedIn, Χ.