La semaine précédente s’est tenue la Winter School 2013 de la Digital Methods Initiative (DMI) de l’Université d’Amsterdam. Le thème de cette session 2013 a été “Data Sprint: The New Logistics of Short-form Method”, en référence à l’exécution d’un projet de recherche à composante numérique sur une très courte période. D’une certaine manière, ce thème n’est en rien original à 2013 pour la DMI, car c’est la forme que prend la plupart de leurs événements, qui recoupent les étapes suivantes:

  1. Constitution de groupes, en général lors d’un événement comme les Winter School ou de rencontres avec d’autres chercheurs
  2. Recherche de données / réutilisation de données déjà existantes
  3. Approfondissement / émergence de questions de recherche
  4. Développement d’outils / utilisation et adaptation d’outils déjà existants
  5. Création de résultats et de visuels, et présentation au public

Lors des Winter et Summer Schools de la DMI, ces étapes sont en général réalisées en l’espace de quelques jours (2 jours pour la session 2013). Les avantages sont que la présence de différentes compétences et intérêts de recherche rend la réalisation de nombreux projets possibles, en plus de bénéficier d’un milieu prompt à faire éclore de nouvelles perspectives de recherche. Enfin, l’excitation due à l’échéance proche rend beaucoup de choses possibles, de manière parfois inattendue. Les inconvénients sont toutefois l’impossibilité de tester des hypothèses d’une très grande ampleur, de peur de ne pas avoir de résultats pour la restitution collective, et le fait de se baser très fortement sur les données, les outils et les processus déjà existants (même si de nombreux outils sont développés ad hoc, justement dans le but d’extraire de nouvelles sources de données).

L’édition 2013 a comme toujours été extrememnt prolifique: je vous laisse consulter les divers projets issus de ce “datasprint”. Pour ma part, j’ai travaillé pendant ces quelques jours avec de brillants collègues sur un très intéressant projet de métriques alternatives à la publication pour visualiser l’activité scientifique, champ émergent répondant au doux nom de “post-scientometrics”.  Suspense, un billet de blog va bientôt arriver!

Anne Helmond (PhD Candidate), Professor Jill Rettberg, Dr. David M. Berry, and Dr. Jean-Christophe Plantin. Not pictured: Erik Borra, PhD Candidate (Amsterdam January, 2013). Source

Ces rencontres ont également été l’occasion de réaliser une parallèle avec une forme similaire de création de connaissance: le booksprint. Il s’agit de rassembler pendant une courte période de temps un petit nombre de personnes qui partagent des connaissances communes sur une même sujet pour leur faire rédiger un livre, de la rédaction jusqu’à l’impression. Si l’aventure semble impossible à réaliser en un si court laps de temps, elle a toutefois fait ces preuves depuis plusieurs années dans le milieu du logiciel libre, (par exemple les FLOSS manuals) afin de rédiger les manuels techniques, tutoriaux et autres documentations que les développeurs sont en général peu enclins à réaliser. Toutefois, on trouve des exemples de booksprints appliqués à un livre d’esthétique des nouveaux médias ou un manuel de rédaction de contrats reliés à l’industrie pétrolière.

“New Aesthetics, New Anxieties”, rédigé lors d’un booksprint de 5 jours du 17 au 21 juin 2012.

David Berry, chercheur en Digital Media à l’université de Swansea – qui est récemment passé en France pour un séminaire dans le cadre du programme de recherche SACRED – nous a livré ces retours d’expériences en la matière. Les étapes du booksprint sont les suivantes:

  1. Brainstorming et rédaction du plan
  2. Structuration du livre; division des chapitres, séparation des tâches. Ces deux parties doivent dans l’absolu être achevées le 1er jour, afin de laisser le plus de temps possible à la rédaction
  3. Écriture en tant que telle. Il est possible de passer par des logiciels qui visent à faciliter le travail à plusieurs
  4. Mise en page et éventuels retours sur le contenu: lorsqu’un niveau important de contenu a été atteint, un retour sur la structuration finale du livre est en général à effectuer, du fait que le plan a en général bougé depuis le premier jour.. Cette étape se termine par une relecture globale, en général collaborative, puis la correction et la mise en page
  5. Publication du livre (en général un PDF, mais c’est apparemment plus facile de motiver les troupes en proposant de rédiger un livre en 5 jours plutôt qu’un fichier numérique)
David Berry replaçant le booksprint dans les évolutions contemporaines de la production universitaire – crédit: Anne Helmond

Les retours d’expérience de David Berry ont été agrémentés par un entretien vidéo avec Adam Hyde, “facilitateur” de book sprint, qui en a déjà plus d’une cinquantaine à son actif (il détaille sa méthodologie ici et dans une vidéo ici). Il a livré plusieurs éléments sur le rôle clé que doit avoir le facilitateur durant le booksprint: celui-ci sert avant tout à susciter la collaboration de chacun, à gérer les situations de stress des participants, les possibles désaccords entre auteurs sur le contenu ou la forme du travail collaboratif, et à réduire au maximum toute forme d’intrusions du monde extérieur.  Il a également décrit le travail nécessaire sur les conditions extérieures: nécessité d’avoir un endroit calme et sans trop de tentations (par exemple un château, une maison de campagne), un grand stock de café, de la nourriture attirante (il a évoqué un booksprint qui s’était doté d’un cuisinier privé).

Un point particulièrement difficile est de susciter chez les auteurs la prise de risque et d’oser la discussion et la réflexion collective pour faire éclore de nouvelles idées. Ces deux choses sont particulièrement difficiles à gérer devant une échéance aussi pressente, où le réflexe de chacun peut être de remplir sa partie au plus vite. La production collaborative et dans un temps réduit peut en effet se heurter à une absence de culture de collaboration, mais également à la difficulté de s’engager dans une production dont on ne pense pas maîtriser tout le processus, ce qui tend notamment à effrayer les universitaires.

Share
 

Hal Varian, chief economist chez Google le disait déjà en 2009:

I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?

Ce billet vise à présenter le profil de data scientist, terme que l’on rencontre de plus en plus fréquemment au carrefour des pratiques autour de l’open data, des big data ou des digital humanities. 

Data what?

Devant les grandes masses de données présentes en ligne, le terme de data science tente de recouvrir un ensemble de compétences nécessaires à l’acquisition, au traitement et à l’analyse de données.

Comme le formule Drew Conway pour répondre à la question “what is data science” sur Quora:

(…) data science most often refers to the tools and methods used to analyze large amounts of data.  As such, the discipline is an amalgamation of many bits from other areas of research.  For tools, the influence primarily comes from computer science, where issues of algorithmic efficiency and storage scalability form the main focus.  For analysis, however, the influences are much more varied. Modern methods are borrowed from both the so-called hard sciences (physics, statistics, graph theory) and the social sciences (economics, sociology, political sciences, etc).  Specific classes of techniques that are naturally interdisciplinary are also very popular, such as machine learning.

Quelles compétences pour les data scientists?

Hal Varian détaille également sa conception du travail de data scientist :

The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.

De même, le blog de l’entreprise Dataspora met en avant trois tâches constituant le travail de science des données :

  • Statistics : pouvoir analyser statistiquement un grand ensemble de données ;
  • Data munging : aka “the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis”;
  • Data visualization :  la restitution visuelle du travail sur les données, à travers l’utilisation de langages de programmation comme R pour les visualisations statiques, ou des outils de visualisation dynamiques tels que Processing.

On pourra objecter que le travail d’analyse se situe tout au long du processus, et non uniquement à la première étape : le travail d’acquisition-nettoyage des données (étape 2) et leur visualisation a justement pour but de faciliter l’analyse des données une fois rendues plus lisibles.

 

Enfin, DJ Patil, l’inventeur avec Jeff Hammerbacher du terme Data science, récapitule dans une interview pour le site O’reilly les étapes du travail du data scientist :

  • Finding rich data sources.
  • Working with large volumes of data despite hardware, software, and bandwidth constraints.
  • Cleaning the data and making sure that data is consistent.
  • Melding multiple datasets together.
  • Visualizing that data.
  • Building rich tooling that enables others to work with data effectively.

C’est donc à une multitude de compétences que fait appel cette pratique de science des données, généralement séparées entre statisticien, designer et programmeur, comme le remarque Nathan Yau sur flowing data:

Statisticians should know APIs, databases, and how to scrape data; designers should learn to do things programmatically; and computer scientists should know how to analyze and find meaning in data.

Ce glissement des compétences professionnelles est également remarqué malicieusement par le porteur du projet Gephi, Sébastien Heymann :

Les critiques du terme

De nombreuses réactions portent sur le terme même de data science pour désigner cette pratique. En effet, plusieurs acteurs mettent en avant l’absurdité de ce terme : les données sont un matériau de la pratique scientifique et ne peuvent devenir le point de concentration de celle-ci, comme le rappelle Drew Conway:

First, the term “data science” is a misnomer with respect to what most people consider endeavors classified as such.  Fundamentally, “science” is about formalizing a hypothesis given a reasonable set of observations and assumptions, designing an experiment around that hypothesis, testings it and analyzing the data generated through that process to either confirm or falsify the hypothesis.  Therefore, “data” is simply a natural byproduct of science.  Very (very) rarely are things labeled as data science actually scientific.

De plus, d’autres avis tendent à arrêter le terme à l’acquisition et au nettoyage des données, en excluant la pratique de visualisation de données, comme le formule Flip Kromer:

A set of tools to expose insight or make predictions by drawing on the data’structure rather than primarily its content.

Troisième, Jérôme Denis met en avant le fait que les données, quelque soit leur provenance, ne sont jamais brutes. Il développe ce point de vue à l’occasion d’un commentaire d’un billet sur le site Internet actu à propos d’un article sur l’ouverture des données publiques : toutefois, ce constat correspond aux données en issues du Web et big data. Rappelant les apports des STS, il rappelle que:

Les données sont toujours adressées, elles répondent à des questions, équipent des activités précises.

Enfin, Harlan Harris, lors d’une présentation intitulée « what is data science anyway ? », conteste également la nouveauté du terme : il cite en effet l’existence d’un journal of data science datant de 2003. Sur ce point, Gil Press propose également une archéologie des pratiques de data science.

Le programmeur Pete Warden est d’accord avec un grand nombre de critiques adressé au terme de data science : ce n’est pas une vraie science, le terme est incongru et il recouvre une diversité de pratiques et de points de vue. Toutefois, il met en avant dans O’Reilly Radar le fait que ce terme, avec tous ces défauts, constitue un « objet-frontière » permettant à un ensemble de professions disparates de communiquer et d’agir ensemble :

We need a term to describe this movement, so we can create job ads, conferences, training and books that reach the right people. Those goals might sound very mundane, but without an agreed-upon term we just can’t communicate.

PS. Un pearltree sur les ressources en ligne à propos de la science des données est disponible ici.

Share
 

Lors du colloque Homeland Connections: E-Diasporas Atlas / A century of transnationalism clôturant le projet de recherche TIC et Migration, plusieurs points concernant la constitution et la visualisation de corpus de sites web en ligne ont été évoqués, plaçant la réplicabilité des corpus au centre des interrogations. Ce thème a été judicieusement traité à la fois au niveau du crawl constituant le corpus de sites Web, mais également au niveau des questions de recherche inhérentes à chaque chercheur.  Les propos échangés ici concernent l’analyse de réseaux de diaspora en ligne: toutefois, les commentaires et considérations méthodologiques dépassent cette application pour s’appliquer à un ensemble de thèmes de recherche ayant recours à des cartographies du Web.

« Différents crawlers sur un même corpus produisent-ils des résultats différents ? »

Mathieu Jacomy (Médialab Sciences-po) livre une expérimentation qu’il a réalisé avec Erik Borra (Digital Methods Initiative, l’Université d’Amsterdam) visant à savoir si différents crawlers sur un même corpus produisent des résultats similaires. Les trois crawlers testés étaient le navicrawler de Webtlas, l’issuecrawler de la Digital Methods Initiative et le crawler de Linkfluence. Les sites étaient également au nombre de trois: un site statique, un site dynamique et un site “entre les deux”.

Différentes itérations du crawl ont été lancées, à plusieurs temps d’intervalle, d’un jour à deux semaines. Les 54 crawls réalisés ont mis en avant un fait important: différents crawlers ne produiront pas les mêmes résultats avec les réglages par défaut; à l’inverse, des crawlers configurés en fonction de la nature des sites du corpus produisent des résultats similaires. Mathieu et Erik se sont en effet rendu compte que les crawls obtenus sont encore plus différents entre les différentes itérations d’un même crawler non configuré qu’entre des crawlers différents: pour le dire autrement, un crawler non configuré fournit des résultats encore plus différents avec lui-même qu’avec un autre crawler. Ces différences proviennent des caractéristiques des sites crawlés:  par exemple, un corpus possédant beaucoup de portails devra régler son crawler (distance et profondeur) afin de dépasser cette “barrière”.

Cette petite expérience sur les caractéristiques des différents crawlers met en avant la nécessité pour le chercheur d’adapter les fonctionnalités de son crawler aux sites constituant son corpus, réduisant ainsi les risques de biaiser les résultats;

« Différents chercheurs avec une même méthode produisent-ils différents corpus ? »

La question de la replicabilité à également été abordée par Anat Ben David (Bar-Ilan University) et Priya Kumar (University of London), sous l’angle cette fois de l’influence de la question de recherche sur le corpus. Les deux chercheuses ont en effet travaillé chacune de leur côté sur la diaspora palestinienne en ligne: apprenant l’existence l’une de l’autre à la fin de leurs travaux respectifs, elles ont ainsi pu comparer leur corpus respectifs -non sans l’appréhension de voir son travail invalidé, comme le revèle Anat Ben David.

Les deux corpus présentent un nombre conséquent d’URL communes, ce qui constitue un moyen de vérifier la validité des corpus, dans les deux cas réalisés par exploration manuelle. Toutefois, chaque corpus tend à privilégier des catégories d’acteurs en particulier au sein des corpus, du fait de la spécificité des questions de recherche : Anat Ben David s’est ainsi intéressé à l’émergence d’une sphère Web palestinienne, et plus précisément à la  géographie de cette diaspora ayant pour spécificité d’être sans Etat de référence. En parallèle, Priya Kumar, s’est davantage concentré sur les types d’activités en ligne des membres de cette diaspora.

On retrouve alors la spécificité des questions de recherche dans les variables choisies pour analyser les corpus: en premier celles d’Anat Ben David:

Les catégories de Priya Kumar: 

Cette adaptation du corpus en fonction des questions de recherche du chercheur se retrouve également dans le découpage des acteurs constituant le corpus, d’abord Anat Ben David:

Les catégorisations d’acteurs de Priya Kumar: 

L’adaptation des corpus en fonction des questions de recherche respectives aux chercheurs tend à rendre difficile une comparaison entre les différents corpus de sites web de diaspora. L’événement e-diaspora a en effet été accompagné de la sortie d’un atlas papier regroupant tous les corpus de sites des différentes diaporas, invitant presque intuitivement à les comparer. Toutefois, les critères de sélection des sites Web des corpus peuvent changer fortement entre chercheurs: le chercheur Emmanuel Ma Mung Kuang (CNRS) a par exemple restreint son corpus au sites web par et pour des Chinois d’outre-mer, excluant ainsi les sites officiels, les sites en Chine ou Taïwan, les sites pour les Chinois d’outre-mer mais pas réalisés par eux, et les sites sinophones; une autre manière de sélectionner les sites peut être de choisir un nombre de liens pointant vers le site, désignant son importance et permettant ainsi de sélectionner les sites constituant le corpus (par exemple, exclure tous les sites qui ont moins de cinq liens entrants). Ces deux modalités de constitution du corpus diffèrent entre les chercheurs, rendant ainsi difficile une comparaison entre les différentes diasporas en ligne. On notera toutefois que des critères communs ont été mis en avant dans l’atlas, par “régional components”, “activism”, “incipient diapora”.

Dans tous les cas, les cartes de sites web se basent, à l’instar des cartes géographiques (comme j’ai tenté de le démontrer dans cet article), sur un principe d’exclusion d’éléments dans un souci de réalisation, de visibilité et d’analyse. Comme l’a rappelé Franck Ghitalla (qui modérait ce panel et qui a développé ce point ici) la carte ne représente pas la réalité, mais le découpage qu’un chercheur effectue dans un matériau afin de réaliser son travail d’analyse, même si, paradoxalement, la carte tend à suggérer une exhaustivité du phénomène représenté.

Share
 

The release of nuclear radiation after the explosion of various reactors at the Fukushima I Power Plant triggered various actions from civil society: it involved scraping and refining data from official data to publish them in structured format (as did the German designer Marian Steinbach or the ad hoc group radmonitor311); other actors decided to use Geiger counters in order to provide alternative radiation readings (such as Safecast); various data feeds were plugged in platforms such as Pachube, triggering various remix of these data; finally, several maps were creating, displaying either official sources (such as the Institute for Information Design Japan), or both, e.g. by using Pachube feeds which aggregated multiple data sources (such as Spurs or Failedrobot). (I have described extensively this process of map making here and in french here).

Are these alternative orscrapped data and radiation mapping mashup occupying a specific position within the online debate about the location and level of nuclear radiation in the post-Fukushima ? Where were these actors located in the online issue-network (Rogers & Marres, 2008) about post-Fukushima radiation? Did they appear as specific data and information provider for other actors involved in the debate?

Methodology

In order to analyse the position of alternative voices online, it was necessary to first have a big picture of the online debate. What was the geography of the online debate about radiation? The graph below is giving such a big picture: every nodes is a website taking part in the radiation issue, ie. producing, using, debating about the nature, the location and the level of radiation. this graph was constituted by using the Firefox plugin Navicrawlerwhich crawls the various links amongst websites ; websites are categorized manually after having explored the websites, by reading the content. The result of this exploration is visualised with the software Gephi.

Figure 1. The size of the nodes shows their authority score calculated by using the algorithm Hits from (Kleinberg, 1999). The spatialization algorithm is Force Atlas 2. The color of the edges shows the source of the link.

Categories of actors involved

The graph highlighted the presence of various communities of actors. By reading the graph, a civil society sphere constitutes a cluster on the bottom left side, and is composed as follows:

  • Mother and children defence organization, providing information about how to protect this particularly affected population;
  • Citizen and neighbours defence group, for whom the debate is usually around local issue such as food consumption and people evacuation;
  • Anti-nuclear activists, subgroup of ecological associations calling for stopping every nuclear production;
  • Independent bloggers who aggregate various websites and resources and comment the news.

On the other side of the graph (top right) is what can be described as the official sources sphere. They are constituted with:

  • The various ministries, with the MEXT as the main hub, which is responsible for radiation monitoring;
  • Prefectures, producing local readings or referring to the government;
  • Industries, such as TEPCO, the company operating the Fukushima Dai-ichi power plant;
  • International organizations such as the IAEA;
  • Japanese Universities.

In the middle and top left stand the “Geiger sphere”, which includes the various actors who produced Geiger counter data or refined official data ; the mapping mashup sphere is constituted by a selection of 16 maps wich and act as means to visualize the radiation data.

Graph analysis

1. Territorial occupations

By looking at the issue topology, the debate appears to be highly polarized between the official actors on the top right side and civil society members on the other side. As the table below shows, official sources have a much higher clustering coefficient than the civil society members: the three most clusterized communities are the three main official sources provider, ie. industries, ministries and prefectures. These websites constitutes tight communities with abundant internal linking.

  Categories Average clustering coefficient
1. Industries 0,612
2. Governmental sources 0,473
3. Prefectures 0,349
4. Anti-nuclear activists 0,246
5. Universities 0,213
6. Mapmaking sphere 0,193
7. Mapping mashup 0,178
8. Children and Mother defence group 0,155
9. Neighbour group 0,113
10. International organizations 0,0
11. Independent bloggers 0,0

If the debate about radiation online appears to be highly polarized along two opposite poles, it also appears to be dominated by official sources. A ranking analysis was performed by using the Hits algorithm. The two following tables displays the 20 first results of the Hits algorithm, starting from the highest score:

  URL category Authority score
1. http://www.mext.go.jp government 0.029769959
2. http://www.meti.go.jp government 0.019485792
3. http://www.kantei.go.jp government 0.01894452
4. http://www.tepco.co.jp industries 0.018132612
5. http://www.mhlw.go.jp government 0.017861975
6. http://www.maff.go.jp government 0.015426252
7. http://www.nirs.go.jp government 0.013531799
8. http://www.mofa.go.jp government 0.013261164
9. http://www.soumu.go.jp government 0.012719892
10. http://www.mlit.go.jp government 0.012449256
11. http://www.nsc.go.jp government 0.012178619
12. http://www.env.go.jp government 0.011907984
13. http://www.nisa.meti.go.jp government 0.011637348
14. http://wwwcms.pref.fukushima.jp prefectures 0.011366712
15. http://www.pref.miyagi.jp prefectures 0.011366712
16. http://www.jaea.go.jp government 0.010825439
17. http://www.enecho.meti.go.jp government 0.010013532
18. http://www.cao.go.jp government 0.00947226
19. http://radioactivity.mext.go.jp government 0.0092016235
20. http://www.chuden.co.jp industries 0.008930988

2. Maps in the graph

Let us now have a closer look at the authorities within the selection of maps:

  Label Authority
1. http://atmc.jp 0.00405954
2. http://japan.failedrobot.com 0.003518268
3. http://www.nnistar.com 0.002976996
4. http://www.naver.jp/radiation 0.0018944519
5. http://blog.safecast.org 0.0018944519
6. http://fukushima-radioactivity.jp 0.0018944519
7. https://www.targetmap.com/viewer.aspx?reportId=6329 0.001082544
8. http://radiation.yahoo.co.jp 0.001082544
9. http://radiation.goo.ne.jp 8.11908E-4
10. http://www.earthspiral.jp 8.11908E-4
11. http://labs.geigermaps.jp 5.41272E-4
12. http://jciv.iidj.net 5.41272E-4
13. http://arch.inc-pc.jp 5.41272E-4
14. http://radiation.crowdmap.com 5.41272E-4
16. http://www.spurs.jp 0.0

How can one explain such ranking? It is noticeable that maps created by companies are not on the first position, whereas they could supposedly benefit from the high rank of their hosting website: Naver appears on 4th position, and Yahoo and Goo on the 8th and 9th position. Furthermore, by looking at the size of data sets used on maps, those on the top does not have the biggest data sets: other example such as the map from the Japanese institute for information design has got a very extensive data set and is only in 12th position; on the other side, ATMC has got only one source of data (MEXT) and with a poor granularity – it only provides one measurements per Prefecture. Furthermore, it does not appear to be a matter of data: in the top 5 are maps with the three types of sources, ie. only official data, only Geiger data and maps using both. The sole unifying criteria to explain such a ranking is the updating frequency: the 4 maps on the top are providing daily updating on their data, and getting closer to real-time maps. On the other side, the maps with a low updating frequency appear to be at the bottom of the list.

3. Linking practices between spheres of actors

What about the relations between the categories of actors? Looking at the graph, the Geiger sphere actors are represented in yellow and the maps websites in red. They occupy a central place in the graph, between the two sides of the debate. As we saw earlier, in addition to producing original readings, they are also scrapping official data: can they be considered as mediating actors between the civil society and official actors? Are they creating a dialog between them, taking data from one side to bring it to the other? The next figure shows alternatives view of the URL in figure 1:

Figure 2. The top figure contains the whole issue’s URL grouped by categories; the size of nodes depends on the number of websites it contains; the size of the edge show the number of links from one node to the other. The color of the edge shows the source of the link. The three figures below specifically highlight relation amongst group of actors: official sources, civil societies websites and mapmakers sphere. The color designates outcoming links in red, incoming links in blue and reciprocal links in yellow.

 It is possible to draw various conclusions from these circular views:
  • The figure 2 shows a triangle between ministries, prefectures and industries, and in a minor proportion with international organizations: official sources abundantly link to each other, but also exclusively: the do not link to any other websites other than them.
  • The civil society members, on graph 3, also show intensive reciprocal linking practices between their spheres, as it is highlighted in yellow; but they also abundantly link to the official sources; they link in a smaller proportion to the Geiger sphere in yellow. Civil society reference for data source does not appear to be alternative or scrapped radiation data, but mostly official sources.
  • The mapmaker sphere naturally links to mapping mashup, as they constitute its visualization means; it also links to the official sources to acknowledge the data sources when scraping them, and links in a smaller proportion to the civil society actors. This open linking policy explains why they are occupying a central position within the graph. However, this openness does not appear to be reciprocal: they do not appear to constitute authorities within the debate, capable of competing with the official sources that are occupying the top position within the sphere (cf. the Authorities table). But more importantly, they do not appear to constitute an alternative data provider that civil society members could use in order to contradict official sources statements. These various organizations still mostly to official sources as official data provider: looking simultaneously to these cross-sphere links and to the issue network graph, it appears that civil society members literally step over the mapmaker sphere in order to link to the official data provider; the Geiger sphere therefore remains an autonomous sphere, relatively independent from the two sides of the debate.

Conclusion

The crawling of this issue online was performed in August 2011: as the Web is a very fast-moving environment, the online topology of this issue might most likely be different today ;  furthermore, the Fukushima is still a very “hot controversy” as Venturini, (2010) would say, and the Japanese public debate about the Fukushima aftermath is still going on: this graph must then be considered as a snapshot of the online situation at the moment it was realized, not as a definite picture.

However, it is still possible to see that the online debate about post-Fukushima radiations appear to be dominated by official sources, which constitutes dense communities linking exclusively to other official sources; on the other side of the controversy, civil society members also have interlinking practices in order to occupy the web territory. The Geiger data sphere and maps, on the other hand, stay in their specific sphere. There are close relationships between data providers, refiners and mappers: they can even constitute the same entity, as with Safecast. To say it otherwise, data providers are their first “consumers” to create maps, enforcing close-circuit relationship and hyperlinks practices. Furthermore, within the considered websites, these radiation maps did not appear to sustain an “empowerment” process, feeding an alternative position within the controversy and enabling the persons using it to reach a voice loud enough to compete with the official sources, which still has got the biggest voice in the debate.

However, another limitation of this work is that it only takes hyperlinking practice into account, while individuals and groups possess other forms of communicating and publishing information data that does not appear on such graph (e.g. participating in a Google group) ; furthermore, lots of action to spread out different accounts of the post-Fukushima radiation situation is done by offline actions, that would hence need to be studied by doing fieldwork.

You can download the full list of URL as a CSV and the graph 1. file as a GEXF.

PS. I would like to thank Franck Ghitalla and his IC05 students, as well as the Digital Methods Initiative members, especially Richard Rogers, Bernhard Rieder and Erik Borra for their insightful comments throughout this work.

Update 2 (17/03/2012): a shorter version of this blog post was published on the online journal Berliner Gazette as part of their great thema: Fukushima / 11 März; they also organized a very insightful  symposium called the Learning form Fukushima in October 2011, where I had the chance to present earlier versions of this graph. Another article was published in the online and print version of the German weekly newspaper der Freitag (15/03/2012 issue), with another visualisation + website list.

 

Share
 

This paper is freshly written after another three days escapade I had the pleasure to make at the Digital Methods Initiative at the University of Amsterdam. The title comes from a remark from Bernhard Rieder stating that the Internet is getting more and more like a piece of software and less than a documentary system. As a consequence it is becoming less and less user-readable: one needs to dig deeper than HTML pages in order to see the mechanics of the internet. This remark will constitute the building block of the present blog post where I will present API as a resource to do online research. This post is largely based on the papers presentation, talks and workshop at the 2012 DMI winter school, where this year’s sexy theme was interfaces for the cloud.

Let us start with the various conceptions of the Web that were successively put forward to describe it:

  1. The Web as a navigational space: the web is something where you can navigate from one page to another. It does have a topology constituted with websites, pages, linked together with hyperlink. As a user, you would surf from one page to another into this document space. this conception of the web triggered various spatial conception of the Web as alternative and mappable space, i.e. cyberspace.
  2.  The Web as a platform: that was the big promise of the Web 2.0 age: the web is not only a succession of static page, but a succession of platform where you can build upon. The emphasis here is on user participation  where on can add pages, create links between pages, add content to platform, and so on.
  3.  The end of the Web? lots as been said about the decline of the Web: other information distribution models eg. based on  applications, are other ways to gain searched information or accessing a website without navigating between pages or starting  from a browser.

Another way to look at the end of the Web is by looking at how the hyperlink evolved in its functions and structure. On this matter, Anne Helmond presentation was clearly describing the fact that hyperlink are no longer manual links: links as a manual practice from webmaster linking one page to another is not at the centre of the stage anymore. On the opposite, a further algorithmisation of the hyperlink is taking place with the increasing presence of social button or widgets, which pre-configures hyperlinking practices : if the act of creating a link (by liking or digging) between two pages still keeps its share of user participation, it is encapsulated in applications that take care of the destination of the link. To say it otherwise and by borrowing an Olivier Etzrscheid’s blog post, the like will kill the link.

Another way to look at the algorithmization of the Web is by taking a look at the increasing place of API. The invention of API is not new (Bernhard Rieder, in his archeology of API, pointed at the Google Soap API); however, they were mostly business to business solutions: for this reason they had  strong requirements on integrity, hence making them not really handy. On the other side,  web API developped online are much lighter and looks like like “API for the masses” by comparing them with the previous versions.

On the same talk, Bernhard opened up with various conceptions of API and which kind of research perspective they could bring:

In this perspective, as Esther Weltevrede clearly showed in her presentation (from a collaborative work with Noortje Marres) that API appear as the nice and polite way of doing research versus the “punkish” way that scraping is. Scraping rises various obvious legal implications, but also technical issues, as detailed by Dick Hall, Infochimps business development manager and interviewd by Audrey Watters: the acceptable use of websites (ie. the number of times a scraper can visit a website) is defined by their Terms of Service: but these are different for every services and may change unilaterally from time to time, making scraping more difficult. But having these difficulties in mind, API and scraping calls for new possibilities to reach a real-time sociology.

If API provides a cleaner access to data, they are not clear of every critics: as Boyd and Crawford said in their paper about big data, the representativity of the set is not always easy to know. Getting data from an API firehose does not give you access to all the data set, but usually to a very small amount of all the data. For Twitter research, one has the choice between between what they call a  spritzer (roughly 1% public tweets) or a gardenhose  (roughly 10% public tweets).

In parallel of research practices, scraping can be of accessing data in a case of data scarcity. For instance, in the wake of the Fukushima power plant accident, numerous developers and hackers started scraping the official websites in charge of monitoring as they were not releasing data in structured format: scraping was then a way to aggregate information that were distributed amongst many websites, but also to publish structured data feeds (like the one set up by Marian Steinbach, generating a CSV every 10 minutes from the official monitoring data SPEEDI) that could eventually be use to create maps and other visualization means. if accessing to the data does not create an immediate empowerment situation, let us just say that the black box was opened without asking for the key: can we talk about tactical scraping?

  • API to study the evolution of the Web: this is probably the less common use of API: it can act as a piece of evidence while investigating the Web in itself: what can API tell us about the contemporary web in terms of properties, modus operandi and uses? That was the underlying questions under the various projects of the last DMI winter school.

Some project members took a look under the Facebook API rug: a first project tried to identify various possible information gap present in Facebook by double-crossing the FB API with other more engaged website, like Opensecrets.org of factcheck.org: Mitt Romney FB was taken as an example of cross-sourced FB page. Following the same motivation, a second project tried to see the differences between the elements available on the FB application (on user side) vs. the data you access through the API (on developper’s side) in order to develop  various “validity checking” possibilities.

Based on Jean Tinguely machines, a third project aimed to show API for themselves and in their all nakedness: they used the ready-made API platform IFTT (standing for If This Then That) and aiming to put the internet work for you, to let various API play with each other and joyfully intertwine: after they set up a profile for jean Tinguely in various online apps and services, they happily let the various API talk to each other and create some snowball actions: If Twitter message, then (empty) picture on Flickr ; if email, then flickr ; and my favorite, if twitter then twitter. Funnily enough, according to one the project member, one of the hardest part was not to add any content but to keep the API work within themselves.

Finally, the last project tried to track the tracker present online in different spheres: to do so, the firefox plugin Ghostery was repurposed in order to get a list of common online tracker, eg. widget, social button, analytics, etc. A, ad hoc tool was build by Koen Martens and Erik Borra which could identify the tracker present in lists of URL. We tried to compare the trackness of various spheres, eg. comparing national spheres, adult entertainment websites vs. children entertainment websites, technology blogs vs. news blog, etc. the results were transformed in GDF and visualized through Gephi.

And the last added one by Bernhard Rieder where he scraped the programmable web mashup repository and created this graph: one can which API are most widely used and which ones are combined. Here again, made with Gephi.

 

Update: Anne Helmond, PHD candidate at the University of Amsterdam and member of the Digital methods initiative, published her great introduction to API critique based on the presentation she gave at the winter school, as well as a summary of collaborative notes about this workshop .

Share
 

I’ve just finished my article for the OII Decade in Internet time conference that I’m really happy to be attending next week in Oxford, UK. My paper is about the role maps played in the Fukushima radiation debate. To spice up the things a bit, I decided not to really care about consequences maps on the debate (pretty hard to measure anyway), but rather to focus on the way mapping mashup were giving shape to public involvement about this topic. I guess that is what happen when you read lots of Foucault and Deleuze about ‘dispositif’!

I started working on this case study 2 months ago, so it is still fresh and needs quite a lot of work, but at least I had quite a lot of pleasure writing it. It results from many different experiences, thoughts and meetings I had during the summer. The topic of Fukushima started scratching my mind when I attended the 2011 DMI summer school where I had the chance to talk a lot about maps, geolocation and web studies: it also gave me quite a lot of tools to tinkle with during july and august; latter in the summer, I met Noortje Marres at the Goldsmith college, London, who kindly gave me several interesting hints to think about and authors to read ; as I dug the Fukushima radiation case study, I had the chance to skype-interview various very interesting persons directly involved in the mapmaking or geiger counter readings: Ed Borden form Pachube, data scrappers Marian Steinbach and Prof. Haruhiko Okumura, data visualiser Andreas Schneider from the Institute for Information Design Japan or Shunnosuke Shimizu and its sexy map ; finally, to mix pleasure and work, the guys form Safecast at the Tokyo hackerspace were kind enough to invite me to one their weekly meeting during my vacations in Tokyo, where I could get an inner perspective of their amazing work.

Here is the abstract:

“This article explores how mapping mashup can trigger and shape the public involvement during an online controversy. It suggests to depart from merely focusing on the maps’ impact to rather analyse their specific affordances in shaping the public engagement. This interrelation between maps and public can be analysed in the online debate about radiation levels in Japan after the Fukushima nuclear event in March 2011: various maps were created in order to provide alternative views of the radiation situation. This case study provides the opportunity to analyse how citizens gathered online to fulfill the various steps in radiation mapmaking; the various steps in this process were reconstituted by interviewing the main actors and by using a Web Crawler. Once this interrelation between the public and maps is clarified, the affordances of mapping mashup are characterized by their belonging to digital culture”.

and the whole stuff. Enjoy!

Meanwhile, I keep collecting and publishing the various Japan radiation maps I can find using scoopit, so feel free to contact me if I missed one of them.

Share
© 2012 Cartonomics: Space, Web and Society Suffusion theme by Sayontan Sinha