Turning PDFs into Research Data (2026)

Application for this course opens in January 2026.
The application deadline will be February 27, 2026.

Do you ever feel that the data you need for your research is accessible but it’s not in a convenient table, such as company reports or building plans?

Perhaps the information you need is spread out across many different documents?

If only we could read and extract structured data from thousands of written documents.

In this course, we explore how to accomplish this task by combining web scraping, Optical Character Recognition (OCR), and Natural Language Processing (NLP). Over four weeks, we provide online lessons and interactive sessions to learn the fundamentals of these key technologies.

Application

The deadline to apply for a seat in this free course will be February 27, 2026.

As the number of participants is limited, and to ensure the best fit between the course’s content and its participants, we will ask you to specify in the application form why you would like to participate in this course. We will review all applications after the deadline and notify you of the outcome by mid-March. You will get access to the course materials at the latest one week before the first meeting as you are expected to review the videos and materials of the first unit before the first meeting.

The course is open to all researchers. However, as part of our role within the consortium for business, economics, and related data, priority will be given to researchers working in these fields.

The course is completely free of charge. As part of this opportunity, we kindly ask all participants to actively contribute to our evaluation process. This helps us to continuously improve the course.

Topics

Methods for extracting text and files from websites using tools such as Selenium and how to avoid common pitfalls.
Methods for extracting text from images, such as scans of written documents.
Exploring technologies that can help automate data extraction from harvested text, including Retrieval Augmented Generation (RAG), and a critical review of common data quality issues

Format

This is an online course.

Week 1: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is web scraping and OCR (~45 min). Interactive Online Session (~60 min).
Week 2: Applying last week’s lessons to the example coding exercise or your own project (~30 min). Interactive Online Session (~60 min).
Week 3: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is NLP and common data extract issues (~30 min). Interactive Online Session (~60 min).
Week 4: Applying last week’s lessons to the example coding exercise or your own project (~30 min). Interactive Online Session (~60 min).

Weekly Meetings

The course includes 4 live Online Meetings, in which you will discuss the week’s contents with the instructor and fellow participants:

Meeting 1: April 14, 3:00pm – 4:00pm CEST
Meeting 2: April 21, 3:00pm – 4:00pm CEST
Meeting 3: May 05, 3:00pm – 4:00pm CEST
Meeting 4: May 12, 3:00pm – 4:00pm CEST

Prerequisites

Basic programming knowledge (R, Python, …)
- Note that the course will be in Python, but if you only know R, this is still ok! The code examples are simple and will run entirely on Google Colab, meaning you will not have to install anything. This course will make a good opportunity to try Python for the first time and you can also try the free self-paced BERD Academy introduction to Python course beforehand.
Willingness to learn new technical skills
A Google Account
A Zoom account to participate in the online meetings.

About the Instructor

John ‘Jack’ Collins is a PhD Student in Sociology at the Graduate School of Economic and Social Sciences. He holds a Bachelor’s of Sociology with Honours from the Australian National University. Jack has a Master’s degree in Data Science from James Cook University. His Master’s project was regarding predictive modelling for student attrition from sub-tertiary courses in Australia. During his Master’s studies, he also assisted in research projects regarding social attitudes and voting behaviour in Australia. Before starting PhD, Jack was a Senior IT Consultant specialising in data engineering, analytics and software development. Jack is interested in applying Data Science and IT to sociological research, particularly with regard to machine learning, analytics, and web applications.