Archives

Turning PDFs into Research Data

Do you ever feel that the data you need for your research is accessible but it’s not in a convenient table, such as company reports or building plans?

Perhaps the information you need is spread out across many different documents?

If only we could read and extract structured data from thousands of written documents. 

In this course, we explore how to accomplish this task by combining web scraping, Optical Character Recognition (OCR), and Natural Language Processing (NLP). Over four weeks, we provide online lessons and interactive sessions to learn the fundamentals of these key technologies.

Topics

  • Methods for extracting text and files from websites using tools such as Selenium and how to avoid common pitfalls.
  • Methods for extracting text from images, such as scans of written documents. 
  • Exploring technologies that can help automate data extraction from harvested text and a critical review of common data quality issues. 

Format

This is an online course. 

  • Week 1: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is web scraping and OCR  (~45 min). Interactive Online Session (~60 min).
  • Week 2: Applying last week’s lessons to the example coding exercise or your own project (~30 min). Interactive Online Session (~60 min).
  • Week 3: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is NLP and common data extract issues (~30 min). Interactive Online Session (~60 min).
  • Week 4: Applying last week’s lessons to the example coding exercise or your own project (~30 min). Interactive Online Session (~60 min).

Weekly Meetings

The course includes 4 live Online Meetings, in which you will discuss the week’s contents with the instructor and fellow participants:

Meeting 1: Aug 27, 2024, 4:30pm – 5:30pm CEST
Meeting 2: Sep 03, 2024, 4:30pm – 5:30pm CEST
Meeting 3: Sep 10, 2024, 4:30pm – 5:30pm CEST
Meeting 4: Sep 17, 2024, 4:30pm – 5:30pm CEST

Prerequisites

  • Basic programming knowledge (R, python, …)
    • Note that the course will be in Python, but if you only know R, this is still ok! The code examples are simple and will run entirely on Google Colab, meaning you will not have to install anything. This course will make a good opportunity to try Python for the first time and you can also try the self-paced BERD introduction to Python course
  • Willingness to learn new technical skills
  • A Google Account

About the Instructor

John ‘Jack’ Collins is a PhD Student in Sociology at the Graduate School of Economic and Social Sciences. He holds a Bachelor’s of Sociology with Honours from the Australian National University. Jack has a Master’s degree in Data Science from James Cook University. His Master’s project was regarding predictive modelling for student attrition from sub-tertiary courses in Australia. During his Master’s studies, he also assisted in research projects regarding social attitudes and voting behaviour in Australia. Before starting PhD, Jack was a Senior IT Consultant specialising in data engineering, analytics and software development. Jack is interested in applying Data Science and IT to sociological research, particularly with regard to machine learning, analytics, and web applications.

Data Science with Python

You want to learn Python for Data Science, but don’t find the time to visit synchronous courses regularly?

Register for this free self-paced course and learn all you need to start with Python on your own schedule!

Introductory Tutorials

  • Python Introduction
  • Basic Scripting in Pyhton
  • Functions and Packages

Advanced: Data Management and Visualisation

  • Introduction to Pandas
  • Data Exploration in Pandas
  • Visualisation with Matplotlib
  • Advanced Plotting

Advanced: Working with Libraries

  • Numpy/Scipy
  • Working with Web Documents
  • Machine Learning with scikit-learn

About the Instructors

Sven Hertling and Nicolas Heist both work at the University of Mannheim as researchers or scientific staff. While Hertling holds a Master’s degree in Computer Science and primarily researches semantic technologies/semantic web, linked data, and knowledge graphs, Heist’s research interests primarily focuses on Semantic Web technologies, Knowledge Graphs, and Linked Data.

Data Science with R

You want to learn R for Data Science, but don’t find the time to visit synchronous courses regularly?

Register for this free self-paced course and learn all you need to start with R on your own schedule!

Introductory Tutorials

  • The True Basics of R
  • Data Manipulation in R
  • Data Visualization in R

Advanced Data Manipulation in R

  • Data Management
  • Subsets & Aggregation
  • Advanced Programming
  • Testing in R

Advanced Data Visualization in R

  • The Concept of Visualization & Advanced base R Graphs
  • Introduction to ggplot2
  • Advanced ggplot2

About the Instructor

Leonie Gehrmann is a doctoral student at the University of Mannheim in the field of Marketing. Her research interests primarily focus on machine learning applications in marketing, economics of data, and consumer psychology.

Data Literacy Essentials: Processing Data (R)

In this session we will introduce you to R, a programming language that allows you to edit, visualize and analyze data. We will give you an introduction to the programming environment RStudio, teach you the basics of syntax and show you examples of the possibilities that open up with the use of R. No previous knowledge is assumed.

Data Literacy Essentials: Processing Data (STATA)

In this session we will introduce you to Stata, a widely used statistical program that allows you to process, visualize and analyze data. We will give you an introduction to the most important functions and the basics of operation and use examples to show you the possibilities that open up when using Stata. No previous knowledge is assumed.