Do you ever feel that the data you need for your research is accessible but it’s not in a convenient table, such as company reports or building plans? Perhaps the information you need is spread out across many different documents? If only we could read and extract structured data from thousands of written documents.
In this course, we explore how to accomplish this task by combining web scraping, Optical Character Recognition (OCR), and Natural Language Processing (NLP). Over four weeks, we provide online lessons and interactive sessions to learn the fundamentals of these key technologies.
Topics
Methods for extracting text and files from websites using tools such as rvest and how to avoid common pitfalls.
Methods for extracting text from images, such as scans of written documents.
Exploring technologies that can help automate data extraction from harvested text and a critical review of common data quality issues.
Format
This is an online course.
Week 1: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is web scraping and OCR (~60 min). Interactive Online Session (~60 min).
Week 2: Applying last week’s lessons to the example coding exercise or your own project (~60 min). Interactive Online Session (~60 min).
Week 3: Watch pre-prepared video lectures about relevant theory and demonstration of example exercises. The topic is NLP and common data extract issues (~60 min). Interactive Online Session (~60 min).
Week 4: Interactive session about applying last week’s lessons to the example coding exercise or your own project (~60 min). Interactive Online Session (~60 min).
Weekly Meetings
The course includes 4 live Online Meetings, in which you will discuss the week’s contents with the instructor and fellow participants:
Meeting 1: Aug 27, 2024, 4:30pm – 5:30pm CEST Meeting 2: Sep 03, 2024, 4:30pm – 5:30pm CEST Meeting 3: Sep 10, 2024, 4:30pm – 5:30pm CEST Meeting 4: Sep 17, 2024, 4:30pm – 5:30pm CEST
Prerequisites
Basic programming knowledge (R, python, …)
Willingness to learn new technical skills
About the Instructor
John ‘Jack’ Collins is a PhD Student in Sociology at the Graduate School of Economic and Social Sciences. He holds a Bachelor’s of Sociology with Honours from the Australian National University. Jack has a Master’s degree in Data Science from James Cook University. His Master’s project was regarding predictive modelling for student attrition from sub-tertiary courses in Australia. During his Master’s studies, he also assisted in research projects regarding social attitudes and voting behaviour in Australia. Before starting PhD, Jack was a Senior IT Consultant specialising in data engineering, analytics and software development. Jack is interested in applying Data Science and IT to sociological research, particularly with regard to machine learning, analytics, and web applications.
Telecommunication data are ubiquitous today, as almost everyone carries a mobile device in their pocket. This generates huge data trails that provide almost unbiased information about people’s mobility and location. Such data have been used, for example, in the COVID-19 years to assess the impact of non-pharmaceutical interventions on mobility patterns.
The workshop will focus on the analysis of telecommunication data. This includes the pre-processing of massive data, the restructuring of data into mobility data such as flows, or the detection of local clusters, to name just few possible applications.
Under the umbrella of the BERD Academy, the workshop aims to bring together data scientists who work with or on telecommunication data and/or are interested in flow and mobility data to address applied research questions from sociology, economics or related fields. The workshop provides an open forum for current research and new ideas on data analysis.
Register now for this in-person workshop in Leipzig to get to know state of the art AI applications for using text as data in social science research!
Moderated by Prof. Dr. Gerhard Heyer (University of Leipzig), you will be introduced to various examples. This workshop is for both, researchers who are just starting with text-as-data and those looking for advanced practical applications.
Dec 12, 2023 1 pm – 5.15 pm Seminargebäude Uni Leipzig
Program
1) iLCM – an interactive text mining environment for social and economic scientists
iLCM is an integrated research environment for the analysis of structured and unstructured data in a ‘Software as a Service’ architecture (SaaS), and has been designed to address the needs of researchers with little experience in working with text mining tools as well as experienced researchers with substantial knowledge of the R language. It supports the quantitative evaluation of large amounts of qualitative data using text mining methods, including organising data into subcorpora, annotating and classifying data with active learning, and representing data and topics over time. We introduce the software and present a real application use case.
2) Working with English News Corpora in the Leipzig Corpora Collection
by Dr. Thomas Eckart, Erik Körner and Felix Helfer (Sächsische Akademie der Wissenschaften)
The Leipzig Corpora Collection (LCC) contains up-to-date and time-stamped crawled news data for more than 900 corpora in more than 250 languages. We shall present the available data for corpora in English using the (No)Sketch Searchengine, including their metadata such as publication date, or subject area, and demonstrate by way of example how the corpora can be used for an application in the social sciences using a complex search environment enhanced by linguistic pre-processing.
3) Active Learning with (L)LMs: State of the Art and Practical Challenges
by Christopher Schröder (Center for Scalable Data Analytics and Artificial Intelligence, ScaDS.AI, Dresden/Leipzig
Following a brief introduction to Active Learning, we shall demonstrate how the Active Learning Library Small-Textcan be applied to a “Words-of-the-Day”-Corpus of the Leipzig Corpora Collection. In addition we shall discuss how LLMs can be used for social science research, and how they can be optimised with respect to performance and memory requirements.
Hosted by the Chair of Frauke Kreuter and the Munich Center for Machine Learning (MCML), this 2-month full-time program joins forces of aspiring talents in the area of Data Science in small groups to work on projects with a positive societal impact.
The program is designed for two teams of 4-5 fellows, each working on a separate project for the social good. Both teams will be assisted by a Technical Mentor and a Project Manager. The participants receive a fellowship that covers living expenses for the time of the program from August 1st – September 30th.
The program is aimed at students, recent graduates or PhD students from diverse scientific, as well as geographical backgrounds. We therefore encourage applications from the fields of data science, computer science, statistics, but also in general the social and natural sciences.
In addition to in-app survey data, the IAB-SMART project collected 1.3 million locations observations from GPS and mobile network data on participants recruited among Android users in the Panel Study Labor Market and Social Security (PASS). Financed by BERD@NFDI, these collected raw geolocations were aggregated and published as IAB-SMART-Activity at the Research Data Center of the Institute for Employment Research (IAB), containing geolocation indicators that can be used by researchers and linked to PASS survey data and PASS-ADIAB administrative employment data.
In this webinar, we will provide you with an overview of the preparation and anonymization tasks necessary to edit raw geolocations to meaningful indicators and to publish these indicators as a Scientific Use File (SUF) . Using R code examples, we will guide you step-by-step through the data editing process and provide you with tips & tricks for preparing smartphone location data. We will also highlight helpful packages, papers and tutorials.
The webinar’s analytical challenges, questions and open discussions support you directly in your current and/or future work with (smartphone) geolocation data.
This webinar is part of the IAB-SMART Webinar Series with three sessions. To get the whole picture, we recommend participating in the whole series, but the sessions can also be attended individually.
Smartphones are ubiquitous in everyday life, and researchers capture behavioral data using sensors built into smartphones, such as geolocations. IAB-SMART-Activity is a new dataset with 398 participants and activity indicators based on geolocation data. This data was collected from early January 2018 to the end of August 2018. The data from IAB-SMART-Activity can be linked to the Panel Study Labor Market and Social Security (PASS) survey and administrative employment histories (PASS-ADIAB), providing a unique opportunity to study activity and labor market participation.
This webinar describes the IAB-SMART-Activity dataset and its research potential. In addition to describing the variables in the IAB-SMART-Activity data module, possible applications are outlined. Finally, the possibilities for accessing the data via Research Data Center at the Institute for Employment Research (IAB) are discussed.
This webinar is part of the IAB-SMART Webinar Series with three sessions. To get the whole picture, we recommend participating in the whole series, but the sessions can also be attended individually.
Do you own a smartphone? Probably. Smartphones have become an ubiquitous tool of our daily life, always with us and hard to imagine without. You are probably aware that smartphones collect various kinds of personal information about you, such as geolocations, call and text message log data, and app usage data. While your smartphone collects sensitive data about you, it has an unknown potential for understanding human behavior and society in social sciences. However, to use this data researchers have to design data collections that are in line with the current legal regulation such as, GDPR, transparent as possible and protect the data of participants.
In this webinar, I will give you an overview on how to collect smartphone data ethically and transparently with an Android app. The app we used is called IAB-SMART and collected various sensor data over a period of six months. Besides from showing how to design the recruitment process, I will provide an overview of the data collected and its various hidden error sources along the Total Survey Error Framework (TSE).
This webinar is part of the IAB-SMART Webinar Series with three sessions. To get the whole picture, we recommend participating in the whole series, but the sessions can also be attended individually.
Note on cookies
In order to improve performance and enhance the user experience for the visitors to our websites, we use cookies and store anonymous usage data.
For more information please read our Privacy policy.