Current Bachelor Thesis Topics
Bachelor Topics SS 2025
1. Text Mining and Machine Learning
Supervisor: Johann MitlöhnerText mining aims to turn written natural language into structured data that allow for various types of analysis which are hard or impossible on the text itself; machine learning aims to automate the process using a variety of adaptive methods, such as artificial neural nets which learn from training data. Typical goals of text mining are Classification, Sentiment Detection, and other types of Information Extraction, e.g. Named Entity Recognition: identify people, places, organizations; Relation Extraction, e.g. locations of organizations.
Connectionist methods and deep learning in particular have achieved much attention and success recently; these methods tend to work well on large training datasets which require ample computing power. Our institute has recently acquired high performance GPU units which are available for student use in thesis projects. It is highly recommended to use a framework such as PyTorch or Tensorflow/Keras for developing your deep learning application; the changes required to go from CPU to GPU computing will be minimal. This means that you can start developing using your PC or notebook, or the Jupyter notebook server of the department, with a small subset of the training data; when you later transition to the GPU server more performance will mean that larger datasets become feasible.
On text mining e.g.: Minqing Hu, Bing Liu: Mining and summarizing customer reviews. KDD '04: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 168-177, ACM, 2004
For a more recent work and overview e.g.: Percha B. Modern Clinical Text Mining: A Guide and Review. Annu Rev Biomed Data Sci. 2021 Jul 20;4:165-187. doi: 10.1146/annurev-biodatasci-030421-030931. Epub 2021 May 26. PMID: 34465177.
Datasets can be found e.g. at huggingface and kaggle.
keywords: artificial neural networks, machine learning, text mining
2. Visualizing Data in Virtual and Augmented Reality
Supervisor: Johann MitlöhnerHow can AR and VR be used to improve exploration of data? Developing new methods for exploring and analyzing data in virtual and augmented reality presents many opportunities and challenges, both in terms of software development and design inspiration. There are various hardware options, starting with Google Cardboard, to more sophisticated and expensive, such as Rift, Quest, and many others. Taking part in this challenge demands programming skills as well as creativity. A basic VR or AR application for exploring a specific type of (open) data will be developed by the student. The use of a platform-independent kit such as A-Frame is essential, as the application will be compared in a small user study to its non-VR version in order to identify advantages and disadvantages of the visualization method implemented. Details will be discussed with supervisor.
Some References:
Butcher, Peter WS, and Panagiotis D. Ritsos. "Building Immersive Data Visualizations for the Web." Proceedings of International Conference on Cyberworlds (CW'17), Chester, UK. 2017.
Teo, Theophilus, et al. "Data fragment: Virtual reality for viewing and querying large image sets." Virtual Reality (VR), 2017 IEEE. IEEE, 2017.
Millais, Patrick, Simon L. Jones, and Ryan Kelly. "Exploring Data in Virtual Reality: Comparisons with 2D Data Visualizations." Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 2018.
Yu Shu, Yen-Zhang Huang, Shu-Hsuan Chang, and Mu-Yen Chen (2019). Do virtual reality head-mounted displays make a difference? a comparison of presence and self-efficacy between head-mounted displays and desktop computer-facilitated virtual environments. Virtual Reality, 23(4):437-446.
Korkut, E. H., and Surer, E. (2023). Visualization in virtual reality: a systematic review. Virtual Reality, 27(2), 1447-1480.
keywords: virtual reality, augmented reality, data visualization, data exploration
3. The Evolution of Digital Humanism
Supervisors: Jennifer-Marieclaire Sturlese, Marta SabouAbstract:
Technology shapes our world, and Digital Humanism focuses on developing digital technologies and policies rooted in human rights, democracy, inclusion, and diversity [1, 2]. In 2019, researchers from around the world gathered to write the Vienna Manifesto on Digital Humanism [3] which highlights the lack of responsibility in the development of technologies that deeply impact our lives [4, 5].
The aim of this bachelor thesis is to explore the current state of the art in research related to Digital Humanism. For this, a bibliometric analysis with Python is to be conducted. Academic databases (for example, Scopus, IEEE Xplore, ACM Digital Library) are parsed to identify relevant literature on the topic. Then, a .xml file of the meta data (including abstracts, keywords, and publication information) will be extracted and loaded into a Python integrated development environment (JupyterNotebook, or GoogleCollab, for instance). Finally, a bibliometric analysis is to be conducted, aiming to identify patterns in research published dedicated to Digital Humanism. The results include visual representations of the findings (using packages such as matplotlib) along with tabular statistical output.
In order to receive this topic, you are required to demonstrate knowledge in: Exploratory Data Analysis with Python, Interest in Digital Humanism, Completed Course 5 of your specialization (Knowledge Management and Data Science, both welcomed, latter is preferred).
Keywords: Digital Humanism, Bibliometric Analysis
Sources:
[1] https://informatics.tuwien.ac.at/digital-humanism/
[2] Werthner, H., Ghezzi, C., Kramer, J., Nida-Rümelin, J., Nuseibeh, B., Prem, E., & Stanger, A. (2024). Introduction to Digital Humanism: A Textbook (p. 637). Springer Nature.
[3] Werthner, H. (2020). The Vienna manifesto on digital humanism. In Digital transformation and ethics (pp. 338-357). Ecowin.
[4] www.wu.ac.at/mobile-first/news/details/detail/unser-ziel-ist-eine-community-von-digitalen-humanistinnen
[5] Werthner, H., Prem, E., Lee, E. A., & Ghezzi, C. (2022). Perspectives on digital humanism (p. 342). Springer Nature.
4. Modeling Ethical Bias into Normative Semantic Web
Supervisors: Jennifer-Marieclaire Sturlese, Marta SabouAbstract:
Bias is often associated with negative outcomes, and rightfully so in many contexts. As such, bias may lead to unfair outcomes, reinforcing inequalities and excluding marginalized groups [1]. In the context of developing a normative Semantic Web based on humanistic values, bias can play an effective role by guiding technology to integrate fairness, transparency, and democracy [2]. Instead of being harmful, an ethical bias may be used to create a more accountable Semantic Web, ensuring that humanistic values are embedded in technological decision-making, reflecting the foundations of the Digital Humanism initiative [3].
In the theoretical part of the thesis, you will conduct a literature review on the evolving role of bias in digital technology, and link this to humanistic values focusing specifically on inclusion and democracy. In the empirical part of your thesis, you will develop a conceptual model that shows to which extent Semantic Web (for example, Linked Data, Wikidata, Recommender Systems) may act upon an ethical
bias that persists on humanistic values including inclusivity, fairness and democracy. In your conceptual prototype, you discuss features that address this ethical bias through processing algorithms, data selection, and data representation. The aim of this thesis is to demonstrate how ethical bias can create a more inclusive, transparent, and fair Semantic Web, by modeling empirical solutions to the pressing issues related to it.
In order to receive this topic, you are required to demonstrate knowledge in: Semantic Web (for instance, Linked Data, Wikidata, Recommender Systems), Completed Courses 2, 3, 5 of the specialization Knowledge Management.
Keywords: Ethical Bias, Normative Technology, Semantic Web, Digital Humanism
Preliminary References:
[1] Hanna, M., Pantanowitz, L., Jackson, B., Palmer, O., Visweswaran, S., Pantanowitz, J., ... & Rashidi, H. (2024). Ethical and Bias Considerations in Artificial Intelligence (AI)/Machine Learning. Modern Pathology, 100686.
[2] Reyero Lobo, P., Daga, E., Alani, H., & Fernandez, M. (2023). Semantic Web technologies and bias in artificial intelligence: A systematic literature review. Semantic Web, 14(4), 745-770.
[3] Werthner, H., Ghezzi, C., Kramer, J., Nida-Rümelin, J., Nuseibeh, B., Prem, E., & Stanger, A. (2024). Introduction to Digital Humanism: A Textbook (p. 637). Springer Nature.
Further Reading:
S. Tsaneva, S. Vasic, and M. Sabou, “LLM-driven Ontology Evaluation: Verifying Ontology Restrictions with ChatGPT,” in The Semantic Web: ESWC Satellite Events, 2024, 2024.
G. B. Herwanto, F. J. Ekaputra, G. Quirchmayr, and M. A. Tjoa, “Towards a Holistic Privacy Requirements Engineering Process: Insights from a Systematic Literature Review,” IEEE Access, 2024.
5. An extended analysis of user requirements for explainable smart energy systems
Supervisors: Katrin Schreiberhuber, Marta SabouKeywords: user requirements, explainability, smart energy systems, statistical data analysis
Context: Smart energy systems have emerged as a promising solution for optimizing energy consumption, reducing costs, and minimizing environmental impacts. These systems leverage advanced technologies such as IoT sensors, data analytics, and automation to efficiently manage energy resources. However, for the successful adoption and acceptance of these systems, it is crucial to understand the requirements and concerns of the end users, experts, and technicians who interact with them. One critical aspect that needs investigation is the importance of explainability in smart energy systems, as it directly impacts user trust and decision-making.
Problem: The research problem revolves around comprehending user requirements for smart energy systems and evaluating the significance of explainability to different types of end users, based on the results of a user survey.
Goal/expected results of the thesis
The primary objective of this thesis is to perform a detailed analysis of a user survey that has already been conducted by a previous student. The analysis should be complemented by a literature review on the importance of user-centered explainability in smart (energy) systems. The outcomes should provide insights into how different user groups perceive explainability and how it influences their interaction with smart energy systems.
Potential Research Questions:
What are the specific needs and expectations of users when interacting with smart energy systems in real-world scenarios?
How critical is explainability in fostering user trust and acceptance of smart energy systems? Does the importance of explainability vary among different user groups?
How does a user's background affect their need for explainability or the types of explanations they prefer?
Methodology:
Literature Review: Investigate existing research on explainable systems, user-centered explanations, and the role of explainability in enhancing user acceptance and trust in smart systems.
Statistical Analysis: Conduct a comprehensive statistical analysis of the survey results to validate hypotheses related to the importance of explainability for different user groups.
Required Skills:
Good understanding of statistical analysis methods and implementation tools (R or python preferably)
Literature review skills, including the ability to critically analyse and synthesize existing research.
References:
O’Dwyer, Edward, Indranil Pan, Salvador Acha, and Nilay Shah. Smart Energy Systems for Sustainable Smart Cities: Current Developments, Trends and Future Directions. Applied Energy 237 (March 1, 2019): 581–97. https://doi.org/10.1016/j.apenergy.2019.01.024.
Maguire, M., Bevan, N. (2002). User Requirements Analysis. In: Hammond, J., Gross, T., Wesson, J. (eds) Usability. IFIP WCC TC13 2002. IFIP — The International Federation for Information Processing, vol 99. Springer, Boston, MA. doi.org/10.1007/978-0-387-35610-5_9
Jha, S. S., Mayer, S., & García, K. (2021, November). Poster: Towards explaining the effects of contextual influences on cyber-physical systems. In Proceedings of the 11th International Conference on the Internet of Things (pp. 203-206).
6. Mapping the Unseen: Creating a Data Catalogue for Restricted Datasets in Austria
Supervisors: Hannah Schuster, Amin AnjomshoaaBackground and Motivations:
Austria’s public administration (but also private institutions) host a diverse range of data sources relevant to research across multiple disciplines. However, many datasets are not openly accessible, making it difficult for researchers to identify available data, understand access requirements, and navigate bureaucratic hurdles. Unlike open data platforms, which provide direct access to datasets, non-open datasets with potential research impact often lack centralized documentation, requiring individual researchers to invest significant time and effort in discovering and obtaining data.
This thesis aims to address this gap by developing a data catalog that lists non-open datasets, specifies their locations and sovereignty requirements, and outlines the necessary steps for access. By conducting interviews with researchers and data providers, this study will explore the challenges associated with data discovery and develop a structured framework to improve accessibility; as second part involves to think about processes how one could sustain such a data catalogue. A recent study about the Data Ecosystem and Strategy in Austria (Leo et al., 2024) may provide starting points for potential stakeholders to approach.
Research Question:
How can a data catalog be designed to effectively document non-open datasets in Austria, providing researchers with a comprehensive resource to locate and access relevant data?
Objectives:
Data Collection & Interviews:
Conduct interviews with researchers and data providers (mostly within public adminstration) to identify key datasets that are not publicly accessible.
Analyze the current challenges researchers face in finding and accessing these datasets.
Collect metadata about non-open datasets, including dataset owners, access requirements, and usage restrictions.
Catalogue Design & Development:
Design a structured data catalog that organizes information in a researcher-friendly manner.
Define a classification system to categorize datasets based on research domains, data sensitivity, and access conditions.
Explore potential technological solutions (e.g., knowledge bases, web-based platforms) to implement the catalog.
References:
Bruno Oliveira, A., Duarte, A., & Oliveira, Ó. (2024). Towards a data catalog for data analytics. Procedia Computer Science, 237, 691-700. https://doi.org/10.1016/j.procs.2024.05.155
Conde, J., Pozo, A., Muñoz-Arcentales, A., Choque, J., & Alonso, Á. (2024). Fostering the integration of European open data into data spaces through high-quality metadata. arXiv. https://arxiv.org/pdf/2402.06693
Maali, F., Cyganiak, R., & Peristeras, V. (2010). Enabling interoperability of government data catalogues. In M. A. Wimmer, J. L. Chappelet, M. Janssen, & H. J. Scholl (Eds.), Electronic Government: EGOV 2010. Lecture Notes in Computer Science (Vol. 6228). Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14799-9_29
Sheridan, H., Dellureficio, A. J., Ratajeski, M. A., Mannheimer, S., & Wheeler, T. R. (2021). Data curation through catalogs: A repository-independent model for data discovery. Journal of eScience Librarianship, 10(3), 4. https://doi.org/10.7191/jeslib.2021.1203
Stillerman, J., Fredian, T., Greenwald, M., & Manduchi, G. (2016). Data catalog project—A browsable, searchable, metadata system. Fusion Engineering and Design, 112, 995-998. https://doi.org/10.1016/j.fusengdes.2016.05.004
Hannes Leo, Axel Polleres, Tobias Polzer. Umfeldanalyse zur österreichischen Datenstrategie: Konsolidierter Abschlussbericht zur Begleitforschung (Studie im Auftrag des BMF), 2024. https://research.wu.ac.at/de/publications/umfeldanalyse-zur-%C3%B6sterreichischen-datenstrategie-konsolidierter-
7. Explanation Interface Design the SENSE Explainability Framework
Supervisors: Katrin Schreiberhuber, Fajar EkaputraSupervision: Katrin Schreiberhuber, Fajar Ekaputra
Keywords: explainability, smart energy systems, user interface design
Context: Smart energy systems have emerged as a promising solution for optimizing energy consumption, reducing costs, and minimizing environmental impacts. These systems leverage advanced technologies such as IoT sensors, data analytics, and automation to efficiently manage energy resources. However, for the successful adoption and acceptance of these systems, it is crucial to provide explainability for these systems. The SENSE project works on this issue, providing an explainability framework for the users of such a system. In a final step to make this framework more user-centered, a user interface is needed to show the derived explanations in an understandable and intuitive way.
Problem: The research problem revolves around a user interface design, where explanations in a machine-readable form (i.e., JSON) needs to be converted and shown in an intuitive user interface. It is important to focus on the usability and understandability of explanations for users of this interface.
Goal/expected results of the thesis.
The student will be expected to design a user interface to show explanations about anomalies in a smart grid. We can provide a similar interface for inspiration and guidance.
Potential Research Questions:
What are the requirements to design a user-friendly interface for conveying explanations of system anomalies in the smart grid domain?
How can a causal tree be shown to users to provide an intuitive and understandable explanation?
How to evaluate the usability of explanation user interface in the context of smart grid domain?
Methodology:
Literature Review: Investigate existing research on explanation design and interfaces, explainability of smart grids and cyber physical systems, user-centered explanations, and the role of explainability in smart systems.
User interface design: Investigate potential user interface design and come up with a prototype for showing explanations to a user.
Required Skills:
Programming skills for user interface design. (in python or java)
Literature review skills, including the ability to critically analyse and synthesize existing research.
References:
O’Dwyer, Edward, Indranil Pan, Salvador Acha, and Nilay Shah. Smart Energy Systems for Sustainable Smart Cities: Current Developments, Trends and Future Directions. Applied Energy 237 (March 1, 2019): 581–97. https://doi.org/10.1016/j.apenergy.2019.01.024.
Jha, S. S., Mayer, S., & García, K. (2021, November). Poster: Towards explaining the effects of contextual influences on cyber-physical systems. In Proceedings of the 11th International Conference on the Internet of Things (pp. 203-206).
Stone, Debbie & Jarrett, Caroline & Woodroffe, Mark & Minocha, Shailey. (2014). User Interface Design and Evaluation.
8. Generating Domain-Specific Ontologies with Controlled Characteristics Using LLMs
Supervisors: Majlinda Llugiqi, Marta SabouMain idea: The goal of this research is to systematically generate a benchmark consisting of multiple ontologies within the same domain while maintaining specific structural characteristic (e.g., hierarchy depth, degree distribution). The process starts with a predefined list of seed concepts and structural characteristics and ensuring consistency across generated ontologies while allowing controlled variations in their structural properties.
Motivation: Manually constructing ontologies requires significant human expertise and time, making it a costly and resource-intensive process. There is a growing need for systematic benchmarks to evaluate knowledge graph (KG) for different tasks. By automating ontology generation with controlled characteristics, we aim to create scalable and reproducible benchmarks, reducing expert involvement while maintaining semantic coherence.
Research Questions:
How can we systematically generate multiple ontologies within the same domain while controlling key structural characteristics based on a predefined concept seed list?
To what extent can large language models (LLMs) generate ontologies that align with human-created ontologies in terms of structural integrity and semantic fidelity?
How do different LLMs (e.g., GPT, DeepSeek, Llama) compare in their ability to generate ontologies, and which model demonstrates the highest structural consistency, semantic accuracy, and alignment with human-created ontologies?
Expected Tasks:
Read literature on ontology generation techniques, focusing on methods that allow control over structural characteristics and integration of predefined seed concepts.
Explore and evaluate different approaches, including LLMs, to generate ontologies with controlled characteristics.
Evaluate the ontologies comparing to a human-created ontology. (e.g. using CQs, human evaluation, statistical evaluation...)
Prior-Knowledge and Skills:
Familiarity with ontology structures, knowledge graph representations, and their applications.
Familiarity with Large Language Models (e.g., GPT, DeepSeek, Lama)
Data analysis and evaluation skills.
References:
[1] Babaei Giglou, Hamed, Jennifer D’Souza, and Sören Auer. "LLMs4OL: Large language models for ontology learning." International Semantic Web Conference. Cham: Springer Nature Switzerland, 2023.
[2] García Fernández, J. "Ontology Engineering with Large Language Models." (2024).
[3] Brank, Janez, Marko Grobelnik, and Dunja Mladenic. "A survey of ontology evaluation techniques." Proceedings of the conference on data mining and data warehouses (SiKDD 2005). 2005.
Keywords: Ontology Generation, Large Language Models, Graph Structures
9. A Literature Review of Knowledge Graph Characteristics and Their Effect on Embedding Methods
Supervisors: Majlinda Llugiqi, Marta SabouMain idea: Investigate how different knowledge graph (KG) characteristics influence the performance of embedding algorithms across various tasks by conducting a structured literature review.
Motivation: KG embeddings are widely used in ML applications, but the effect of KG structural properties on embedding performance remains underexplored. A systematic review can provide insights into key factors influencing embedding effectiveness.
Research Questions:
How do different KG characteristics (e.g., density, hierarchy depth, degree distribution) impact embedding algorithms?
How do embedding performance trends vary across different tasks (e.g., link prediction, node classification)?
What methodologies are used to assess the influence of KG properties on embeddings?
Expected Tasks:
Conduct a systematic literature review to identify key papers and works in the impact of KG characteristics on embedding algorithms
Read and extract relevant information from ~20 research papers, provided in a structured sheet, such as KG characteristics used, downstream task and domain.
Analyze and summarize key findings and patterns in a structured format.
Prior-Knowledge and Skills:
Basic understanding of knowledge graphs, their structure and embeddings (preferred but not required).
Ability to systematically analyze and summarize academic papers.
Knowledge of data analysis techniques.
References:
[1] Lame, Guillaume. "Systematic literature reviews: An introduction." Proceedings of the design socie
ty: international conference on engineering design. Vol. 1. No. 1. Cambridge University Press, 2019.
[2] Sardina, Jeffrey, John D. Kelleher, and Declan O'Sullivan. "A Survey on Knowledge Graph Structure and Knowledge Graph Embeddings." arXiv preprint arXiv:2412.10092 (2024).
[3] Rossi, Andrea, and Antonio Matinata. "Knowledge graph embeddings: Are relation-learning models learning relations?." EDBT/ICDT Workshops. Vol. 2578. 2020.
Keywords: Knowledge Graphs, Graph Structure and Properties, Knowledge Graph Embeddings
10. LLM-based verification of ontology restrictions
Supervisors: Stefani Tsaneva, Marta SabouKeywords: semantic web, ontology evaluation, large language models
Context: The knowledge corpus of AI systems typically relies on conceptual domain knowledge structures such as ontologies, which are conceptual data structures representing a domain of interest. Low-quality ontologies that include incorrectly represented information or controversial concepts modeled only from a single viewpoint can lead to invalid or biased system outputs, thus negatively impacting the trustworthiness of the enabled AI system.
To avoid such cases, intense work has been performed in the last decades in the area of ontology evaluation leading to a variety of automatic techniques (e.g., for the detection of syntax errors, hierarchy cycles, logical inconsistencies) as well as the realization that several quality aspects (e.g., unintended use of modeling elements, incorrect domain knowledge, viewpoints) can only be tested by involving a human-in-the-loop (HiL).
One particular example is the verification of ontology restrictions defined with universal and existential quantifiers. The use of these quantifiers is not trivial and often leads to ontology defects. Currently, such defects can only be detected and repaired by involving a human curator. Although HiL approaches achieve high accuracy for this task, they are typically time-consuming and resource-intensive.
Recently, there have been impressive advancements in AI-powered chatbots, including ChatGPT, which has demonstrated remarkable abilities in language processing and response generation. Thus, the question arises of whether ChatGPT can support ontology verification tasks.
Problem: There is currently limited experimental investigation of how large language models, such as ChatGPT, can support the verification of ontology restrictions.
Goal/expected results of the thesis.
The thesis is expected to provide insights into the effectiveness of ChatGPT in ontology restriction verification. The results of the study will help us understand the advantages and limitations of ChatGPT compared to a traditional HiL approach.
Research Question: How effective is ChatGPT in verifying ontology restrictions when provided with enough instructions and context? Does the performance vary based on the modeled domain?
Methodology:
Experiment A: Replication of a previous LLM-based investigation of ontology restriction [2,3]
Collection of additional ontology axioms
Experiment B: Differentiated replication of the first experiment with the new dataset
Comparison between the results obtained in Experiment A and B
Required Skills:
Good understanding of ontologies, especially ontology restrictions (completed K2 of SBWL Knowledge Management is a must!)
Some basic understanding of how large language models work
References:
[1] Rector, A. et al. (2004). OWL Pizzas: Practical Experience of Teaching OWL-DL: Common Errors & Common Patterns. In: Motta, E., Shadbolt, N.R., Stutt, A., Gibbins, N. (eds) Engineering Knowledge in the Age of the Semantic Web. EKAW 2004. Lecture Notes in Computer Science(), vol 3257. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30202-5_5
[2] S. Tsaneva, S. Vasic, and M. Sabou, “LLM-driven Ontology Evaluation: Verifying Ontology Restrictions with ChatGPT,” in The Semantic Web: ESWC Satellite Events, 2024, 2024. https://dqmlkg.github.io/assets/paper_1.pdf
[3] S. Vasic, “ChatGPT vs Human-in-the-loop: An approach towards automated verification of ontology restrictions”, Vienna University of Economics and Business, 2023 ,Bachelor Theiss https://drive.google.com/file/d/1mvKmTS3dcOe_nbZzn5FP1EDAaH6UgM8X/view
[4] B. P. Allen, P. T. Groth, Evaluating class membership relations in knowl- edge graphs using large language models, in: The Semantic Web: ESWC Satellite Events, 2024.https://2024.eswc-conferences.org/wp-content/uploads/2024/05/77770011.pdf
[5] N. Fathallah, A. Das, S. De Giorgis, A. Poltronieri, P. Haase, L. Kovrigu- ina, NeOn-GPT: A large language model-powered pipeline for ontology learning, in: The Semantic Web: ESWC Satellite Events, 2024. https://2024.eswc-conferences.org/wp-content/uploads/2024/05/77770034.pdf
[6] C.-H. Chiang, H.-y. Lee, Can large language models be an alternative to human evaluations?, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023. 10.18653/v1/2023.acl-long.870
11. Investigating community repairs in Wikidata
Supervisors: Nicolas Ferranti, Axel PolleresBackground
Knowledge graphs (KGs) are nowadays the main structured data representation model on the web, representing interconnected knowledge of different domains. There are several methods to model a KG. For instance, they can be extracted from semi-structured web data, like DBpedia, or edited collaboratively by a community, like Wikidata. Since there is no perfect method and knowledge about the world is constantly changing, regular updates in the KGs are required.
Knowledge graph refinement is the process of improving the quality and accuracy of a knowledge graph by adding, modifying, or deleting entities, relationships, or attributes based on new information or corrections. This process is crucial for ensuring that a knowledge graph reflects the current state of knowledge in a particular domain and that it can be used effectively for applications such as search, recommendation, and decision-making.
Wikidata has different constraint mechanisms to identify possible inconsistent data, however, it relies exclusively on its user community to fix inconsistencies.
Overall, knowledge graph refinement is an important and ongoing process that is essential for ensuring that knowledge graphs remain up-to-date, accurate, and useful for a range of applications. As new information becomes available and our understanding of the world evolves, it will be necessary to continue refining and improving knowledge graphs to ensure that they reflect the current state of knowledge in a particular domain.
The goal of the thesis
The goal of this thesis is to extend an already existent dataset of Wikidata historical repairs by including the user behind the repair and to analyze the user behavior. The student would have to: (1) work with the dataset and extract the users; (2) analyze the results towards the role of humans and bots, the specificities of different constraint types, and the domain knowledge.
Requirements
Pro-activity and self-organization. Programming skills.
Initial references
● To learn about RDF KGs: HOGAN, Aidan et al. Knowledge graphs. ACM Computing Surveys (CSUR), v. 54, n. 4, p. 1-37, 2021.
● To learn about Wikidata property constraints: Ferranti, N., De Souza, J. F., Ahmetaj, S., & Polleres, A. (2024). Formalizing and validating Wikidata’s property constraints using SHACL and SPARQL. Semantic Web, 15(6), 2333-2380.
● Data quality in Wikidata: Shenoy, K., Ilievski, F., Garijo, D., Schwabe, D., & Szekely, P. (2021). A Study of the Quality of Wikidata. arXiv preprint arXiv:2107.00156.
12. LLM-based Analysis of Scientific Charts
Supervisor: Amin AnjomshoaaThe charts included in scientific papers serve as a critical tool for conveying complex concepts, ideas, and research findings in a clear and accessible manner. They play a significant role by visually summarizing key results, trends, or relationships within a dataset, allowing readers to quickly grasp the essence of the research. In most cases, charts not only visualize raw data but also provide insights into the relationships between variables, statistical trends, or outcomes of experimental work.
Because of this, the interpretation of charts is crucial for understanding the full scope of a scientific paper.
The goal of this research is to develop methods for extracting detailed metadata from the charts included in scientific papers. This metadata includes essential information such as the source of the data used to generate the chart, the specific variables that are represented, and the relevant descriptions or explanations of the chart provided in the text. This process will involve using Large Language Model (LLM) techniques for analyzing the text surrounding the charts, identifying the variables and their relationships, and linking the data presented in the charts to the broader context of the paper's narrative.
References:
[1] Mukhopadhyay, S., Qidwai, A., Garimella, A., Ramu, P., Gupta, V., & Roth, D. (2024). Unraveling the Truth: Do LLMs Really Understand Charts? A Deep Dive into Consistency and Robustness. arXiv preprint arXiv:2407.11229.
[2] Li, S., & Tajbakhsh, N. (2023). Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs. arXiv preprint arXiv:2308.03349.
[3] Masry, A., Long, D. X., Tan, J. Q., Joty, S., & Hoque, E. (2022). Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244.
Keywords: Scientific Papers, LLM, Data Extraction
13. A Generic Data Space Architecture for Construction of Knowledge Graphs
Supervisor: Amin AnjomshoaaThe concept of a Data Space refers to a flexible, integrated framework that allows different types of data from diverse sources to be accessed, managed, and linked without needing to be fully integrated into a single schema or database. It enables data transactions between different data ecosystem parties based on the governance framework of that data space [1]. Web of Data can therefore be seen as a realization of the data spaces concept [2] on a global scale, relying on a specific set of web standards. As such it provides an incremental approach to data management, where the degree of data integration can evolve over time as needed, rather than requiring full integration from the outset. In this context, the evolution of industrial data platforms (considered key enablers of overall industrial digitization) and personal data platforms (services that use personal data, subject to privacy preservation, for value creation) has continued to follow different paths. On the industrial data platform front, initiatives like the GAIA-X reference architecture [3] and the International Data Spaces Association (IDSA) [4] have emerged, both aiming to promote decentralized data sharing and data sovereignty. While they share this common goal, their scope and focus differ. GAIA-X is primarily centered on building a European data infrastructure, emphasizing data sovereignty, security, and interoperability within Europe. In contrast, IDSA takes a more global approach, focusing on developing standards and frameworks for secure data sharing across various industries and domains worldwide. Though both initiatives advance decentralized data ecosystems, their priorities and strategies reflect their distinct objectives. On the personal data platform front, the Solid (Social Linked Data) project [5] was initiated by Sir Tim Berners-Lee, is closely aligned with the concept of Data Spaces, particularly in the context of personal data management and the semantic web. Solid’s main goal is to give individuals control over their own data, by enabling the decentralized storage and management of personal information. Both industrial and personal data platforms aim to create decentralized data ecosystems, but they differ in their scope and approach. A platform such as Gaia-X is primarily focused on building a European data infrastructure, while personal platforms such as Solid Framework follow a more general framework for decentralized data control and sharing. Gaia-X takes a top-down approach, driven by a consortium of organizations, while Solid Framework is bottom-up, driven by individual users and developers.
This research aims to create a decentralized, collaborative data architecture that benefits from both industrial as well as personal data space approaches. This hybrid system will then enable on-demand generation of knowledge graphs by combining data from multiple sources and empowering real-time collaboration between independent data spaces and based on semantic technologies.
References:
[1] Dataspaces Support Center, https://dssc.eu/space/Glossary/176554052/2.+Core+Concepts
[2] Halevy, A., Franklin, M., & Maier, D. (2006, June). Principles of dataspace systems. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (pp. 1-9).
[3] Braud, A., Fromentoux, G., Radier, B., & Le Grand, O. (2021). The road to European digital sovereignty with Gaia-X and IDSA. IEEE network, 35(2), 4-5.
[4] Otto, B., Lohmann, S., Steinbuss, S., Teuscher, A., Auer, S., Boehmer, M., ... & Woerner, H. (2018). Ids reference architecture model. industrial data space. version 2.0.
[5] Sambra, A. V., Mansour, E., Hawke, S., Zereba, M., Greco, N., Ghanem, A., ... & Berners-Lee, T. (2016). Solid: a platform fordecentralized social applications based on linked data. MIT CSAIL & Qatar Computing Research Institute, Tech. Rep., 2016,
https://solidproject.org/.
Keywords: Data Spaces, Linked Data, Knowledge Graph
14. Readiness Assessment of Health Data Space in Austria
Supervisor: Amin AnjomshoaaThe European Health Data Space (EHDS) [1] represents the initial proposal of the European Data Strategy [2] to establish domain-specific European data spaces as the foundation for a European Health Union. This initiative is designed to tackle health-specific challenges related to electronic health data access and sharing. It aims to enable individuals to control their electronic health data while providing researchers, innovators, and policymakers with the means to utilize this data in a trusted and secure manner that preserves privacy.
The Austrian healthcare sector has significant potential for enhancing healthcare through digitalization and optimizing data utilization via health data spaces. Austrian health authorities have already taken the initial steps toward achieving this overarching objective [3, 4]. Nevertheless, the conceptual ambiguity and synonymous usage of the term in both research and industry pose significant challenges to achieving a precise conceptualization and meaningful utilization of data spaces [5, 6].
The primary goal of this research is to investigate the current status of data space implementations in the healthcare sector across Europe.
The study aims to deliver a comprehensive analysis, offering insights into the adoption and utilization of data spaces in the context of EHDS proposal. Additionally, the research seeks to establish a benchmark for assessing the readiness of health data spaces in Austria, considering various perspectives such as technical aspects and policy frameworks.
Keywords: Data Space, Health Industry, European Data Strategy
References:
[1] European Health Data Space,
https://health.ec.europa.eu/ehealth-digital-health-and-care/european-health-data-space_en
[2] European Commission. European Data Strategy (2020).
https://ec.europa.eu/info/strategy/priorities-2019-2024/europe-fit-digital-age/european-data-strategy_en
[3]Health Data Space in Österreich, https://www.sozialversicherung.at/cdscontent/load?contentid=10008.748742&version=1623841403
[4] Wiener eHealth Strategie, https://www.wien.gv.at/spezial/ehealth-strategie/files/e-health-strategie.pdf
[5] Hutterer, A., Krumay, B., & Mühlburger, M. (2023). What Constitutes a Dataspace? Conceptual Clarity beyond Technical Aspects.
[6] Hussein, R., Scherdel, L., Nicolet, F., & Martin-Sanchez, F. (2023). Towards the European Health Data Space (EHDS) ecosystem: A survey
research on future health data scenarios. International Journal of
Medical Informatics, 170, 104949.
16. Personas and Narratives in Innovation Processes - A Knowledge Perspective
Supervisors: Susanne Ahmad, Florian KraguljInnovation processes require organizations to overcome existing constraints. These limitations encompass not only technological and social boundaries but, most importantly, conventional approaches to knowledge transfer and future envisioning. Narratives have emerged as powerful tools that can effectively convey (tacit) knowledge that would otherwise be difficult to communicate through traditional means.
In this bachelor thesis, you will examine the concepts of personas as an approach to narration in innovation processes from a knowledge perspective. You will explore what characterizes effective personas, identify best practices for their implementation in innovation processes, and analyze their relationship with the role of narratives in knowledge management. After reviewing the literature (structured literature review) and researching industry best practices, you will conclude with recommendations for the effective design and utilization of personas to enhance knowledge transfer and innovation outcomes.
For further questions, please contact Susanne Ahmad (susanne.ahmad@wu.ac.at).
Keywords: Narrative knowledge management, personas, innovation, structured literature review
Initial References:
Linde, C. (2001). Narrative and social tacit knowledge. Journal of knowledge management, 5(2), 160-171.
Mikhlina, A., & Saukkonen, J. (2024). Using Personas in Human Capital Management: A Novel Concept Affected by Legal, Ethical and Technological Considerations. Electronic Journal of Knowledge Management, 22(2), 50-62.
Sergeeva, N., & Trifilova, A. (2018). The role of storytelling in the innovation process. Creativity and Innovation Management, 27(4), 489-498.
17. Developing a Conversational Interface for the CRISP Knowledge Graph (KG)
Supervisors: Hannah Schuster, Amin AnjomshoaaBackground and Motivation:
The CRISP Knowledge Graph (CRISP-KG) integrates data from a variety of open data sources to support crisis response and intervention efforts. By consolidating datasets from various sources with differing geographical granularities, our approach enables the integration of these datasets onto planar shapes, where the smallest unit is a one square kilometer tile, extending to larger administrative units like regions or states (Bundesländer). This allows for seamless connections and analysis across datasets with varying spatial resolutions. One challenge we face is making this data accessible to individuals who lack knowledge of CRISP-KG data structures and schema, or the programming expertise required to formulate and execute queries using structured query languages like SPARQL.
While we provide some example queries to assist non-programmers, it remains difficult for those without a technical background to compose complex queries on their own. As a result, we aim to develop an interface that will allow users to query the data using natural language, translating user input into SPARQL queries. This will empower a wider range of users to access and utilize the data in the CRISP-KG, facilitating better decision-making in crisis scenarios.
Research Question:
How can LLMs improve access and facilitate communication between non-programmers and knowledge graphs in general, and more specifically with the CRISP-KG?
Objectives:
Investigate existing methods and tools for natural language processing (NLP) and query translation into SPARQL.
Design an intuitive user interface that enables users to input queries in natural language and receive corresponding SPARQL queries.
Evaluate the usability and effectiveness of the interface for non-technical users.
Implement the interface and integrate it with the CRISP-KG.
Explore the potential impact of the interface on data accessibility and decision-making in crisis response.
References:
Ngomo, A.-C. N., Bühmann, L., Unger, C., Lehmann, J., & Gerber, D. (2013). Sorry, I don't speak SPARQL: Translating SPARQL queries into natural language. In Proceedings of the 22nd International Conference on World Wide Web (pp. 977–988). https://doi.org/10.1145/2488388.2488473
Steinmetz, N., Arning, A.-K., & Sattler, K.-U. (2019). From natural language questions to SPARQL queries: A pattern-based approach. In BTW 2019 (pp. 289–308). Gesellschaft für Informatik. https://doi.org/10.18420/btw2019-18
18. European Train Travel Made Easy: Creating an Integrated Digital Rail Network – THE DATA
Supervisors: Shahrom Sohi, Axel PolleresTravelling by train across Europe often involves navigating complex information systems. Different countries have different rail systems, and crossing borders can create challenges in the passenger experience. This is where innovative aggregator platforms come into play—such as Trainline , Omio , and even Uber —which have successfully integrated train information and ticketing services.
At the heart of these systems lies timetable data , which provides real-time departures and arrivals for trains across Europe. However, combining this data from various sources is a challenging task due to several factors:
Information retrieval : Accessing accurate and up-to-date timetable information from multiple operators.
Data quality control : Ensuring consistency and reliability in the data.
Regular updates : Keeping the timetable data current to reflect any changes or disruptions.
Disruptions in service : Handling delays, cancellations, or track alterations.
These challenges can lead to an unpleasant passenger experience, causing travelers to opt for alternative modes of transportation.
We are seeking students with a passion for railways and cross-border train travel who want to "get their hands dirty" with GTFS (General Transit Feed Specification) , one of the many data standards used in scheduling.
Example of the thesis can be developed like:
Connecting real time timetable updates with static timetable data.
This topic consists of an analysis of real time timetable data formats (GTFS-RT, SIRI, proprietary formats) and an implementation of an integrating adapter that allows to map the gathered data onto static timetable information.
Integrating open timetable data
The EU MMTIS regulation obligated all member states to provide rail timetables on national access points. The data covers the rail connections in the country and sometimes (parts of) international connections. This leads to duplicated trips that might overlap completely, partially or not at all.
The aim of this topic would be to analyse the different data variants, develop and test different methods to match the broken up journeys. This can involve classical heuristics or machine learning. Test datasets will be provided. The results will be utilized by OpenTimetable.eu
Meet Your Supervisor: Shahrom Sohi (shahrom.hosseinisohi@pv.oebb.at)
Shahrom Sohi, a transport engineer and digital transportation enthusiast working with ÖBB, will be your main point of contact throughout this research project. This thesis offers a unique opportunity to collaborate with ÖBB and other European Mobility Digital players, contributing to the development of more efficient and user-friendly rail travel experiences across Europe.
References:
van Overhagen, L. (2021) ‘A design vision towards seamless European train journeys: Making the train the default option to travel within Europe’. Available at: https://repository.tudelft.nl/islandora/object/uuid%3A01a0e501-2e1a-469d-b1c3-03df7abae737 (Accessed: 13 May 2024).
CER Ticketing Roadmap (no date). Available at: https://www.cer.be/cer-eu-projects-initiatives/cer-ticketing-roadmap (Accessed: 13 May 2024).
Railways, E.U.A. for (2024) Analysis of distribution rules in TAP, OSDM, and recent competition cases | European Union Agency for Railways. Available at: https://www.era.europa.eu/content/analysis-distribution-rules-tap-osdm-and-recent-competition-cases.
19. European Train Travel Made Easy: Creating an Integrated Digital Rail Network – THE PASSENGER
Supervisors: Shahrom Sohi, Axel PolleresTravelling by train across Europe often involves navigating complex information systems. Different countries have different rail systems, and crossing borders can create challenges in the passenger experience. This is where innovative aggregator platforms come into play—such as Trainline , Omio , and even Uber —which have successfully integrated train information and ticketing services.
We would like to investigate the role of UX in European Train Travelling. European train UX is vast. Some areas to focus on:
Ticketing & Booking: How seamless is the process across different rail operators? Does OSDM, NeTEx, or another standard improve the experience?
Real-time Information & Accessibility: Availability of SIRI, GTFS-RT, or proprietary APIs. (see THE DATA – thesis proposal)
Multimodal Integration: How well do trains connect to urban transport?
Passenger Navigation & Station UX: Ease of use of wayfinding systems, multilingual support.
Onboard Experience: Wi-Fi, seating comfort, real-time updates, accessibility for persons with reduced mobility (PRM).
Surveys & Interviews: Ask passengers about pain points in ticketing, travel disruptions, and station UX.
Usability Testing: Testing apps/websites of rail operators (ÖBB, DB, SNCF, etc.) to find UX inconsistencies.
Data-Driven Analysis
APIs & Open Data: Use GTFS, NeTEx, or OSDM data to analyze the availability of route/ticketing information. (see THE DATA – Thesis)
Complaints & Reviews: Scrape social media or review platforms to identify UX trends.
Delays & Journey Disruptions: Compare planned vs. actual arrival times to analyze how UX is affected by disruptions.
Expert Interviews: UX Designers from rail companies (ÖBB, DB, SNCF). Policy experts on passenger rights & EU accessibility regulations. Academic researchers working on transport UX.
Meet Your Supervisor: Shahrom Sohi (shahrom.hosseinisohi@pv.oebb.at)
Shahrom Sohi, a transport engineer and digital transportation enthusiast working with ÖBB, will be your main point of contact throughout this research project. This thesis offers a unique opportunity to collaborate with ÖBB and other European Mobility Digital players, contributing to the development of more efficient and user-friendly rail travel experiences across Europe.
References
Railways, E.U.A. for (2024) Analysis of distribution rules in TAP, OSDM, and recent competition cases | European Union Agency for Railways. Available at: https://www.era.europa.eu/content/analysis-distribution-rules-tap-osdm-and-recent-competition-cases.
van Overhagen, L. (2021) ‘A design vision towards seamless European train journeys: Making the train the default option to travel within Europe’. Available at: https://repository.tudelft.nl/islandora/object/uuid%3A01a0e501-2e1a-469d-b1c3-03df7abae737 (Accessed: 13 May 2024).
CER Ticketing Roadmap (no date). Available at: https://www.cer.be/cer-eu-projects-initiatives/cer-ticketing-roadmap (Accessed: 13 May 2024).
20. European Train Travel Made Easy: Creating an Integrated Digital Rail Network – THE NETWORK
Supervisors: Shahrom Sohi, Axel PolleresThe European railway sector is undergoing a digital transformation aimed at improving data interoperability and infrastructure management. The European Union Agency for Railways (ERA) has mandated the use of Linked Data technologies for databases and registries, such as the Railway Infrastructure Register (RINF), to enhance data exchange and system harmonization.
Example of the thesis could be:
Optimization of Rail Path Finding: based on the Project OpenRailRouting, the path a train takes through a network should be as accurate as possible. The thesis should cover an analysis of all weak points of the current implementation and propose improvements to allow more accurate paths based on a limited set of input points or stations. A test setup will be provided.
Large Rail Network Visualization: To create a schematic rail network visualization that is partially geographically correct is hard. Different methods to create such visualization should be evaluated to visualize large networks (continent scale) efficiently.
The source data will be provided.
Building the EUROPEAN INFRASTRUCTURE KNWOLDGE GRAPH: conecting the different information from RINF (register of infrastrucutre) and shows how differences are directly from the European Railway Agency. Measuring the level of compliance at this point.
Meet Your Supervisor: Shahrom Sohi (shahrom.hosseinisohi@pv.oebb.at)
Shahrom Sohi, a transport engineer and digital transportation enthusiast working with ÖBB, will be your main point of contact throughout this research project. This thesis offers a unique opportunity to collaborate with ÖBB and other European Mobility Digital players, contributing to the development of more efficient and user-friendly rail travel experiences across Europe.
Reference:
Rojas, J.A. et al. (2021) ‘Leveraging Semantic Technologies for Digital Interoperability in the European Railway Domain’, in A. Hotho et al. (eds) The Semantic Web – ISWC 2021. Cham: Springer International Publishing, pp. 648–664. Available at: https://doi.org/10.1007/978-3-030-88361-4_38.
21. Analyzing Mobility Patterns and Utilization of Park & Ride Facilities in Austria
Supervisors: Shahrom Sohi, Axel PolleresIntermodal transport systems, which incorporate multiple modes of transportation within a single journey, are essential for sustainable urban mobility (Riley et al., 2010). Among such systems, Park and Ride (P&R) facilities reduce car congestion, improving public transport accessibility, and enhancing overall travel efficiency. These facilities, strategically positioned near railway stations or bus stops, serve as key nodes for commuters transitioning from private vehicles to public transport (Litman, 2011; Pitsiava-Latinopoulou & Iordanopoulos, 2012). ÖBB INFRA operates numerous P&R stations across Austria, facilitating seamless multimodal trips.
Research Objectives:
Usage Patterns: How do commuters utilize P&R facilities in Austria?
Accessibility and Travel Behavior: What are the key factors influencing mode choice at these locations?
Forecasting Commuter Flows: How can mobility data be leveraged to predict where people travel after using P&R facilities?
Optimization Strategies: How can ÖBB INFRA enhance the efficiency of P&R facilities based on mobility insights?
This research will could employ a mixed-methods approach combining:
Quantitative Data Analysis: Processing and visualizing mobility datasets from ÖBB INFRA to identify peak usage times, parking turnover rates, and travel flows.
Survey Data Collection: Conducting on-site commuter surveys to understand behavioral factors affecting P&R usage.
GIS & Spatial Analysis: Mapping station accessibility and evaluating the relationship between facility location and utilization rates.
Predictive Modeling: Utilizing historical data to develop forecasting models for commuter flows and demand prediction.
By integrating mobility data analytics with user insights, this thesis will contribute to improving ÖBB INFRA’s intermodal strategies. The findings will support evidence-based decision-making for future infrastructure planning, ensuring more sustainable and efficient mobility solutions in Austria.
Meet Your Supervisor: Shahrom Sohi (shahrom.hosseinisohi@pv.oebb.at)
Shahrom Sohi, a transport engineer and digital transportation enthusiast working with ÖBB, will be your main point of contact throughout this research project. This thesis offers a unique opportunity to collaborate with ÖBB and other European Mobility Digital players, contributing to the development of more efficient and user-friendly rail travel experiences across Europe.
References:
Riley, P. et al. (2010) Intermodal Passenger Transport in Europen passenger intermodality from A to Z A TO Z the european forum on intermodal passenger travel. Available at: https://www.academia.edu/5074766/P_Intermodal_Passenger_Transport_in_Europe_PASSENGER_INTERMODALITY_FROM_A_TO_Z_the_european_forum_on_intermodal_passenger_travel_Link_is_funded_by_the_European_Commissions_Directorate_General_for_Mobility_and_Transport_DG_MOVE.
Litman, T. (2007) ‘Evaluating rail transit benefits: A comment’, Transport Policy, 14(1), pp. 94–97. Available at: https://doi.org/10.1016/j.tranpol.2006.09.003.
Pitsiava-Latinopoulou, M. and Iordanopoulos, P. (2012) ‘Intermodal Passengers Terminals: Design Standards for Better Level of Service’, Procedia - Social and Behavioral Sciences, 48, pp. 3297–3306. Available at: https://doi.org/10.1016/j.sbspro.2012.06.1295.
22. A Data Engineering Perspective of AI System Developments
Supervisor: Fajar J. EkaputraMain idea: Developing reliable and robust Machine Learning models as part of AI system development requires much more than a good training algorithm. It is necessary to build the model using high-quality training data [1]. Subsequently, in the deployment stage, it is also crucial to prepare the input data to maximize the quality of the results.
In this thesis, we are looking to investigate existing frameworks for data engineering in the context of AI systems development to identify key steps, their definitions and characteristics, relevant methods and tools, and the challenges related to these steps. Furthermore, we are planning to use the result to extend the current classification of process steps defined by van Bekkum et al. [3] as part of boxology notation.
Research Questions:
· What are the key data engineering concepts, their definitions, relevant methods and tools, in the context of AI system developments?
· (optional) How do we represent these data engineering concepts in visual and formal notations?
Expected Tasks:
· Conduct a literature review to identify key papers on the data engineering process in the context of AI system development.
· Read and extract relevant information from selected research papers provided in a structured sheet, such as key steps, their definitions and methods/tools used for the steps.
· Analyze and summarize key findings and patterns in a structured format.
Prior Knowledge and Skills:
· Basic understanding of Machine Learning
· Ability to systematically analyze and summarize academic papers.
· Knowledge of data analysis techniques.
References:
· [1] N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, “Data Lifecycle Challenges in Production Machine Learning: A Survey,” SIGMOD Rec., vol. 47, no. 2, pp. 17–28, Dec. 2018, doi: 10.1145/3299887.3299891.
· [2] Y. Roh, G. Heo, and S. E. Whang, “A Survey on Data Collection for Machine Learning: A Big Data - AI Integration Perspective,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 4, pp. 1328–1347, Apr. 2021, doi: 10.1109/TKDE.2019.2946162.
· [3] M. Van Bekkum, M. De Boer, F. Van Harmelen, A. Meyer-Vitali, and A. T. Teije, “Modular design patterns for hybrid learning and reasoning systems: a taxonomy, patterns and use cases,” Appl Intell, vol. 51, no. 9, pp. 6528–6546, Sep. 2021, doi: 10.1007/s10489-021-02394-3.
Keywords: Literature Study, Data Engineering, Machine Learning lifecycle
23. Automated Support for Semantic Toolhub Data Acquisition
Supervisor: Fajar J. EkaputraMain idea: The open-source landscape is increasingly crowded. While much good software is being developed and made available, finding the right software for a given use case is increasingly complex. To combat this challenge in the knowledge graph domain, the semantic toolhub (https://w3id.org/toolhub/) aggregates existing initiatives together with their main characteristics. It allows users to select the desired functionality and get applications that fulfill the requirements.
So far, the underlying data for the semantic toolhub has been manually curated. That makes the approach labor-intensive and challenging to maintain and scale. In this thesis, you aim to automate the data elicitation task. Using the readme of a repository or the abstract of a paper (or both), we want to automatically identify a software's main attributes and categorize it along a predetermined taxonomy.
Research Questions:
· To what extent can we support automatic acquisition of semantic tools’ descriptions through ML approaches, e.g., BERT or LLMs?
Expected Tasks:
· 1. Gather state of the art regarding automated software analysis
· 2. Based on your literature review, select at least two appropriate extraction approaches, e.g., an LLM and an encoder model like BERT
· 3. Develop the pipeline for the two approaches, thus among others:
o a. automatic download of the readme or abstract
o b. data preparation
o c. analysis and feature extraction
· 4. Evaluate both approaches using the existing data as a gold standard
· 5. Apply the best approach to categorize new data
Prior Knowledge and Skills:
· Programming capabilities
· Familiarity with Large Language Models (e.g., GPT, DeepSeek, Lama) and/or other ML approaches
· Data analysis and evaluation skills.
References:
Reiz, Achim, Fajar J. Ekaputra, and Nandana Mihindukulasooriya. "Semantic Tool Hub: Towards a Sustainable Community-Driven Documentation of Semantic Web Tools." In European Semantic Web Conference, pp. 311-315. Cham: Springer Nature Switzerland, 2024., link.springer.com/chapter/10.1007/978-3-031-78952-6_48
Keywords: Information Extraction, LLMs, Machine Learning, Semantic Tool Hub
Write a Thesis