Doctors Data Analysis Using Web Scraping

Aims
Publicly available data that the Austrian Ärztekammern[m1] of the nine Austrian states publish on the web gets scraped, processed and finally saved in a database on the DEXHELPP research server for further analyses. The processing is done in such a way that questions about the supply effectiveness of doctors in Austria can be answered in a way that is as comfortable as possible.

Questions that can be answered by database queries include, but are not restricted to:

  • How many general practitioners are there in each district/state/Austria? How many of them have got a contract with one of the health insurances and how many do not?
  • How many doctors with a certain special field are there in each district/state/Austria? How many of them have got a contract with one of the health insurances and how many do not?
  • What is the female/male ratio of doctors?
  • Which foreign languages, diplomas, additional fields etc. do Austria’s doctors offer?
  • Which opening hours are published by Austrian doctors? Where is it possible to consult a doctor on Fridays afternoons or on the weekend?
  • How many contracts with which health insurance do exist?
  • What is the proportion of doctors with and without contracts with any health insurance? What is the proportion of their published opening hours?
  • Where are the most shared offices?

Methods
The relevant content of each web page of the Austrian Ärztekammern[m2] of the nine Austrian states gets scraped by using Java and Selenium and is then saved within a hierarchically structured xml file. The preprocessing and joining the separate xml files to one big Austrian file is done in Python, before important information is added and/or standardized. By using allocation maps

  • offices can be assigned to a district by using the scraped postal code
  • offices and doctors can be related to a certain specific field
  • the scraped textual information about the existing contracts with health insurances can be mapped to contracts with the relevant insurances BVA, GKK, SVA, SVB and/or VAEB
  • scraped information about whether an office in Lower or Upper Austria is a shared one or not can be processed
  • shared offices can be automatically detected for the other states
  • opening hours per day and week can be calculated when data is available on the web.

Finally all the data is saved in a postgresql database on the research server.

This operation is repeated periodically. Thus, chronological answer sequences to relevant questions can be investigated. The visualization of the results takes place in the Versorgungsatlas[W3] .


[m1]Belassen wir den deutschen Begriff oder verwenden wir medical associations?

[m2]Belassen wir den deutschen Begriff oder verwenden wir medical associations?

[W3]English name?