Archives

Turning PDFs into Research Data

Do you ever feel that the data you need for your research is accessible but it’s not in a convenient table, such as company reports or building plans? Perhaps the information you need is spread out across many different documents? If only we could read and extract structured data from thousands of written documents. 

In this course, we explore how to accomplish this task by combining web scraping, Optical Character Recognition (OCR), and Natural Language Processing (NLP). Over four weeks, we provide online lessons and interactive sessions to learn the fundamentals of these key technologies.

Topics

  • Methods for extracting text and files from websites using tools such as rvest and how to avoid common pitfalls. 
  • Methods for extracting text from images, such as scans of written documents. 
  • Exploring technologies that can help automate data extraction from harvested text and a critical review of common data quality issues. 

Format

This is an online course. 

  • Week 1: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is web scraping and OCR  (~60 min). Interactive Online Session (~60 min).
  • Week 2: Applying last week’s lessons to the example coding exercise or your own project (~60 min). Interactive Online Session (~60 min).
  • Week 3: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is NLP and common data extract issues (~60 min). Interactive Online Session (~60 min).
  • Week 4: Interactive session about applying last week’s lessons to the example coding exercise or your own project (~60 min). Interactive Online Session (~60 min).

Weekly Meetings

The course includes 4 live Online Meetings, in which you will discuss the week’s contents with the instructor and fellow participants:

Meeting 1: Aug 27, 2024, 4:30pm – 5:30pm CEST
Meeting 2: Sep 03, 2024, 4:30pm – 5:30pm CEST
Meeting 3: Sep 10, 2024, 4:30pm – 5:30pm CEST
Meeting 4: Sep 17, 2024, 4:30pm – 5:30pm CEST

Prerequisites

  • Basic programming knowledge (R, python, …)
  • Willingness to learn new technical skills

About the Instructor

John ‘Jack’ Collins is a PhD Student in Sociology at the Graduate School of Economic and Social Sciences. He holds a Bachelor’s of Sociology with Honours from the Australian National University. Jack has a Master’s degree in Data Science from James Cook University. His Master’s project was regarding predictive modelling for student attrition from sub-tertiary courses in Australia. During his Master’s studies, he also assisted in research projects regarding social attitudes and voting behaviour in Australia. Before starting PhD, Jack was a Senior IT Consultant specialising in data engineering, analytics and software development. Jack is interested in applying Data Science and IT to sociological research, particularly with regard to machine learning, analytics, and web applications.

Data Science with Telecommunication Data

More details and registration available soon

Telecommunication data are ubiquitous today, as almost everyone carries a mobile device in their pocket. This generates huge data trails that provide almost unbiased information about people’s mobility and location. Such data have been used, for example, in the COVID-19 years to assess the impact of non-pharmaceutical interventions on mobility patterns.

The workshop will focus on the analysis of telecommunication data. This includes the pre-processing of massive data, the restructuring of data into mobility data such as flows, or the detection of local clusters, to name just few possible applications.

Under the umbrella of the BERD Academy, the workshop aims to bring together data scientists who work with or on telecommunication data and/or are interested in flow and mobility data to address applied research questions from sociology, economics or related fields. The workshop provides an open forum for current research and new ideas on data analysis.

:spiral_calendar_pad: Nov 12, 2024
:round_pushpin: LMU Munich

About the Instructor

Göran Kauermann has been a full professor of Statistics at LMU Munich since 2011 and heads the Chair of Statistics for Economics, Business, and Social Sciences there. Additionally, he is the chairman of the German Data Science Society (GDS). His research interests focus on semi- and nonparametric analysis, generalized linear and mixed models, and network data analysis.

AI based Methods for Using Text as Data in the Social Sciences

Register now for this in-person workshop in Leipzig to get to know state of the art AI applications for using text as data in social science research!

Moderated by Prof. Dr. Gerhard Heyer (University of Leipzig), you will be introduced to various examples. This workshop is for both, researchers who are just starting with text-as-data and those looking for advanced practical applications.

:spiral_calendar_pad: Dec 12, 2023
:timer_clock: 1 pm – 5.15 pm
:round_pushpin: Seminargebäude Uni Leipzig

Program

1) iLCM – an interactive text mining environment for social and economic scientists

by Dr. Christian Kahmann (University of Leipzig)

Slides

iLCM is an integrated research environment for the analysis of structured and unstructured data in a ‘Software as a Service’ architecture (SaaS), and has been designed to address the needs of researchers with little experience in working with text mining tools as well as experienced researchers with substantial knowledge of the R language. It supports the quantitative evaluation of large amounts of qualitative data using text mining methods, including organising data into subcorpora, annotating and classifying data with active learning, and representing data and topics over time. We introduce the software and present a real application use case.

2) Working with English News Corpora in the Leipzig Corpora Collection

by Dr. Thomas Eckart, Erik Körner and Felix Helfer (Sächsische Akademie der Wissenschaften)

Slides

The Leipzig Corpora Collection (LCC) contains up-to-date and time-stamped crawled news data for more than 900 corpora in more than 250 languages. We shall present the available data for corpora in English using the (No)Sketch Searchengine, including their metadata such as publication date, or subject area, and demonstrate by way of example how the corpora can be used for an application in the social sciences using a complex search environment enhanced by linguistic pre-processing.

3) Active Learning with (L)LMs: State of the Art and Practical Challenges

by Christopher Schröder (Center for Scalable Data Analytics and Artificial Intelligence, ScaDS.AI, Dresden/Leipzig

Slides

Following a brief introduction to Active Learning, we shall demonstrate how the Active Learning Library Small-Text can be applied to a “Words-of-the-Day”-Corpus of the Leipzig Corpora Collection. In addition we shall discuss how LLMs can be used for social science research, and how they can be optimised with respect to performance and memory requirements.

About the Host

Gerhard Heyer is a professor of Natural Language Processing at the Institute of Computer Science at the University of Leipzig. His research primarily focuses on research data infrastructures, automatic semantic processing of the text, and applications of text mining, including in the Digital Humanities.

Data Science for Social Good

Registration ended (Feb 15, 2023)

This 2-month full-time program joins forces of aspiring talents in the area of Data Science in small groups to work on projects with a positive societal impact.

The program is designed for two teams of 4-5 fellows, each working on a separate project for the social good. Both teams will be assisted by a Technical Mentor and a Project Manager. The participants receive a fellowship that covers living expenses for the time of the program from August 1st – September 30th.

The program is aimed at students, recent graduates or PhD students from diverse scientific, as well as geographical backgrounds. We therefore encourage applications from the fields of data science, computer science, statistics, but also in general the social and natural sciences.

If you are interested in joining the program or have students, friends or colleagues in mind that could be interested, check out the website and apply by February 15th: https://sites.google.com/view/dssgx-munich-2023/startseite

Please also share this call in your network, in classes you teach or approach people directly.

In case of any concerns please feel free to reach out via dssg2023@stat.uni-muenchen.de

About the Host and the organizing Institution

This Event is hosted by the Chair of Frauke Kreuter, who is a professor of Statistics and Data Science in the Social and Behavioral Sciences at LMU Munich. In her research, she focuses on statistical methods related to labor market and occupational research, as well as data science. In addition to her academic work, she is the founder or co-founder of several programs that address evolving data environments and data-driven research. The Munich Center for Machine Learning (MCML) is one of six national AI Competence Centers and brings together the leading ML researchers from LMU, TUM and associated institutions.

[IAB-SMART 3] How to tidy and anonymize raw smartphone geolocation data: Code and Practitioner’s Examples from the IAB-SMART Project

Click here to download the slides:

In addition to in-app survey data, the IAB-SMART project collected 1.3 million locations observations from GPS and mobile network data on participants recruited among Android users in the Panel Study Labor Market and Social Security (PASS). Financed by BERD@NFDI, these collected raw geolocations were aggregated and published as IAB-SMART-Activity at the Research Data Center of the Institute for Employment Research (IAB), containing geolocation indicators that can be used by researchers and linked to PASS survey data and PASS-ADIAB administrative employment data.

In this webinar, we will provide you with an overview of the preparation and anonymization tasks necessary to edit raw geolocations to meaningful indicators and to publish these indicators as a Scientific Use File (SUF) . Using R code examples, we will guide you step-by-step through the data editing process and provide you with tips & tricks for preparing smartphone location data. We will also highlight helpful packages, papers and tutorials.

The webinar’s analytical challenges, questions and open discussions support you directly in your current and/or future work with (smartphone) geolocation data.

This webinar is part of the IAB-SMART Webinar Series with three sessions. To get the whole picture, we recommend participating in the whole series, but the sessions can also be attended individually.

About the Instructor

Andreas Filser, who holds a doctoral degree in the field of Social Sciences, is a research associate at the Research Data Centre at the Institute for Employment Research of the German Federal Employment Agency (FDZ-IAB). His research interests primarily focus on demography, family sociology, and labor market issues.

[IAB-SMART 2] What do geolocation smartphone data add to a survey panel? –  Available indicators from the IAB-SMART Project

Click here to download the slides of the webinar:

Smartphones are ubiquitous in everyday life, and researchers capture behavioral data using sensors built into smartphones, such as geolocations. IAB-SMART-Activity  is a new dataset with 398 participants and activity indicators based on geolocation data. This data was collected from early January 2018 to the end of August 2018. The data from IAB-SMART-Activity can be linked to the Panel Study Labor Market and Social Security (PASS) survey and administrative employment histories (PASS-ADIAB), providing a unique opportunity to study activity and labor market participation.

This webinar describes the IAB-SMART-Activity dataset and its research potential. In addition to describing the variables in the IAB-SMART-Activity data module, possible applications are outlined. Finally, the possibilities for accessing the data via Research Data Center at the Institute for Employment Research (IAB) are discussed.

This webinar is part of the IAB-SMART Webinar Series with three sessions. To get the whole picture, we recommend participating in the whole series, but the sessions can also be attended individually.

About the Instructor


Dr. Florian Zimmermann is a staff member at the Research Data Centre of the Institute for Employment Research. Furthermore, he is involved in several projects related to data access and data exploration. His research focuses on gender studies, immigration, corporations, and social inequality.

[IAB-SMART 1] The IAB-SMART Study: Collecting Behavioral Smartphone Sensor Data for Social Research

Click here to download the slides of the webinar:

Do you own a smartphone? Probably. Smartphones have become an ubiquitous tool of our daily life, always with us and hard to imagine without. You are probably aware that smartphones collect various kinds of personal information about you, such as geolocations, call and text message log data, and app usage data. While your smartphone collects sensitive data about you, it has an unknown potential for understanding human behavior and society in social sciences. However, to use this data researchers have to design data collections that are in line with the current legal regulation such as, GDPR, transparent as possible and protect the data of participants.

In this webinar, I will give you an overview on how to collect smartphone data ethically and transparently with an Android app. The app we used is called IAB-SMART and collected various sensor data over a period of six months. Besides from showing how to design the recruitment process, I will provide an overview of the data collected and its various hidden error sources along the Total Survey Error Framework (TSE).

This webinar is part of the IAB-SMART Webinar Series with three sessions. To get the whole picture, we recommend participating in the whole series, but the sessions can also be attended individually.

About the Instructor

Georg-Christoph Haas, who holds a doctoral degree in Sociology, is a Data Scientist and Research Consultant at the Institute for Employment Research in Nuremberg. His research focuses primarily on survey methodology, UX research, and data science. Additionally, he leads and plans several data collection projects and supports researchers in designing their data collection efforts.