[SemPub17]:Call for Challenge (Semantic Publishing Challenge 2017)

Mi Feb 1 23:02:30 CET 2017

ESWC 2017 Call for Challenge: Semantic Publishing Challenge 2017

==== Call for Challenge: Semantic Publishing ====

Challenge Website: https://github.com/ceurws/lod/wiki/SemPub2017

Challenge hashtag: #SemPub2017, #SemPub

Challenge Chairs:

  *

    Angelo Di Iorio (University of Bologna, IT)

  *

    Anastasia Dimou (Ghent University / imec, BE)

  *

    Christoph Lange (EIS, University of Bonn / Fraunhofer IAIS, DE)

  *

    Sahar Vahdati (EIS, University of Bonn, DE)

Challenge Coordinators:

  *

    Mauro Dragoni, Fondazione Bruno Kessler, IT)

  *

    Monica Solanki (University of Oxford, UK)

14th Extended Semantic Web Conference (ESWC) 2017

Dates: May 28th - June 1st, 2017

Venue: Portorož, Slovenia

Hashtag: #eswc2017

Feed: @eswc_conf

Site: http://2017.eswc-conferences.org

General Chair: Eva Blomqvist (Linköping University, SE)

MOTIVATION AND OBJECTIVES

As in 2016, 2015 and 2014, the goal is to facilitate measuring the 
excellence of papers, people and scientific venues by data analysis. 
Instead of considering publication venues as single and independent 
units, we focus on their explicit and implicit connections, interlinking 
and evolution.  We achieve that thanks to the primary data source we are 
using, which is highly relevant for computer science: the CEUR-WS.org 
workshop proceedings, which have accumulated 1,800 proceedings volumes 
with around 30,000 papers over 20 years and thus covers the majority of 
workshops in computer science. We go beyond the tasks of the 2016 
challenge in two ways: (1) refining and extending the set of 
quality-related data to be extracted and (2) linking and exploiting 
existing Linked Open Data sources about authors, publications, topics, 
events and communities. The best data produced in the 2017 challenge 
will be published at CEUR-WS.org or in a separate LOD, interlinked to 
the official CEUR-WS.org LOD and to the whole Linked Open Data Cloud.

DATASET

The primary dataset used is the Linked Open Datasetthat has been 
extracted from the CEUR-WS.org workshop proceedings (HTML tables of 
content and PDF papers) using the extraction tools winning the previous 
Challenges, plus its full original PDF source documents (for extracting 
further information). The most recent workshop proceedings metadata have 
explicitly been released under the CC0 open data license; for the older 
proceedings CEUR-WS.org has the permission to make that data accessible. 
In addition to the primary dataset, we use (as linking targets) existing 
Linked Open Datasets containing related information: Springer recently 
announced computer science proceedings LOD, the brand-new LOD of 
OpenAIRE including all EU-funded open access publications, Springer LD, 
DBLP, ScholarlyData(a refactoring of the Semantic Web Dog Food), 
COLINDA, and further datasets available under open licenses.

The evaluation dataset will comprise a dataset of around 100 selected 
PDF full-text papers from these workshops. Like last year, the training 
dataset will be distinct from the evaluation dataset, as well as the 
expected results of queries against this subset. Both datasets will 
respect the diversity of the CEUR-WS.org workshop proceedings volumes 
with regard to content structure and quality.

TASKS

Our challenge invites submissions in one or more out of three tasks, 
which are independent from each other but are conceptually connected by 
taking into account increasingly more contextual information. Some tasks 
include sub-tasks but participants will compete in a task as a whole. 
They are encouraged to address all sub-tasks (even partially) to 
increase their chance to win.

Task 1: Extracting information from the tables in papers

Participants are required to extract information from the tables of the 
papers (in PDF). Extracting content from tables is a difficult task, 
which has been tackled by different researchers in the past. Our focus 
is on tables in scientific papers and solutions for re-publishing 
structured data as LOD. Tables will be collected from CEUR-WS.org 
publications and participants will be required to identify their 
structure and content. The task then will require PDF mining and data 
processing techniques.

Task 2: Extracting information from the full text of the papers

Participants are required to extract information from the textual 
content of the papers (in PDF). That information should describe the 
organization of the paper and should provide a deeper understanding of 
the content and the context in which it was written. In particular, the 
extracted information is expected to answer queries about the internal 
organization of sections, tables, figures and about the authors’ 
affiliations and research institutions. The task mainly requires PDF 
mining techniques and some NLP processing.

Task 3: Interlinking

Participants are required to interlink the CEUR-WS.org linked dataset 
with relevant datasets already existing in the LOD Cloud. Task 3 can be 
accomplished as an entity interlinking/instance matching task that aims 
to address both interlinking data from the output of the other tasks as 
well as interlinking the CEUR-WS.org linked dataset – as produced in 
previous editions of this challenge – to external datasets. Moreover, as 
triples are generated from different sources and due to different 
activities, tracking provenance information becomes increasingly important.

EVALUATION

In each task, the participants will be asked to refine and extend the 
initial CEUR-WS.org Linked Open Dataset, by information extraction or 
link discovery, i.e. they will produce an RDF graph. To validate the RDF 
graphs produced, a number of queries in natural language will be 
specified, and their expected results in CSV format. Participants are 
asked to submit both their dataset and the translation of the input 
(natural language queries) to work on that dataset. A few days before 
the deadline, a set of queries will be specified and be used for the 
final evaluation. Participants are asked then to run these queries on 
their dataset and to submit the produced output in CSV. Precision, 
recall and F-measure will be calculated by comparing each query’s result 
set with the expected query result from a gold standard built manually. 
Participants’ overall performance in a task will be defined as the 
average F-measure over all queries of the task, with all queries having 
equal weight. For computing precision and recall, the same automated 
tool as for previous SemPub challenges will be used; this tool will be 
publicly available during the training phase. We reserve the right to 
disqualify participants whose dataset dumps are different from what 
their information extraction tools create from the source data, who are 
not using the core vocabulary, or whose SPARQL queries implement 
something different from the natural language queries given in the task 
definitions. The winners of each task will be awarded as last year.

TARGET AUDIENCE

The Challenge is open to people from industry and academia with diverse 
expertise which could participate in all tasks, or focus on specific 
ones. Task 1 and 2 address an audience with a background in mapping, 
information extraction, information retrieval and NLP and invites the 
previous years’ participants to refine their tools, as well as new 
teams. Task 3 additionally addresses the wide interlinking audience, 
without excluding in the same time other participants to participate in 
the challenge. Task 3 invites new participants as well as participants 
from Tasks 1 and 2.

FEEDBACK AND DISCUSSION

A discussion group is open for participants to ask questions and to 
receive updates about the challenge: sempub-challenge at googlegroups.com 
<mailto:sempub-challenge at googlegroups.com>. Participants are invited to 
subscribe to this group as soon as possible and to communicate their 
intention to participate. They are also invited to use this channel to 
discuss problems in the input dataset and to suggest changes.

HOW TO PARTICIPATE

Participants are first required to submit:

* Abstract: no more than 200 words.

* Description: It should explain the details of the automated annotation 
system, including why the system is innovative, how it uses Semantic Web 
technology, what features or functions the system provides, what design 
choices were made and what lessons were learned. The description should 
also summarize how participants have addressed the evaluation tasks. An 
outlook towards how the data could be consumed is appreciated but not 
strictly required. The description should be submitted as a 5 pages 
document.

If accepted, the participants are invited to submit their task results. 
In this second phase they are required to submit:

* The Linked Open Dataset produced by their tool on the evaluation 
dataset (as a file or as a URL, in Turtle or RDF/XML).

* A set of SPARQL queries that work on that LOD and correspond to the 
natural language queries provided as input

* The output of these SPARQL queries on the evaluation dataset (in CSV 
format)

Accepted papers will be included in the Conference USB stick. After the 
conference, participants will be able to add data about the evaluation 
and to finalize the camera-ready for the final proceedings.

The final papers must not exceed 15 pages in length.

Papers must be submitted in PDF format, following the style of the 
Springer's Lecture Notes in Computer Science (LNCS) series 
(http://www.springer.com/computer/lncs/lncs+authors). Submissions in 
semantically structured HTML, e.g. in the RASH 
(http://cs.unibo.it/save-sd/rash/documentation/index.html), or dokieli 
(https://github.com/linkeddata/dokieli) formats are also accepted as 
long as the final camera-ready version conforms to Springer's 
requirements (LaTeX/Word + PDF).

Further submission instructions will be published on the challenge wiki 
if required.

All submissions should be provided via the submission system 
https://easychair.org/conferences/?conf=sempub17.

NOTE: At least one author per accepted submission will have to register 
for the ESWC Conference, in order to be eligible for the prizes and to 
include the paper in the proceedings.

JUDGING AND PRIZES

After the first round of review, the Program Committee and the chairs 
will select a number of submissions conforming to the challenge 
requirements that will be invited to present their work. Submissions 
accepted for presentation will receive constructive reviews from the 
Program Committee, they will be included in the Springer 
post-proceedings of ESWC.

Six winners will be selected from those teams who participate in the 
challenge at ESWC. For each task we will select:

* best performing tool, given to the paper which will get the highest 
score in the evaluation

* best paper, selected by the Program and Challenge Committee

IMPORTANT DATES

* January 29, 2017: Publication of tasks, rules and queries description

* January 29, 2017: Publication of the training dataset
* February 10, 2017: Publication of the evaluation tool
* March 10, 2017: Paper submission (5 page document)
* April 7, 2017: Notification and invitation to submit task results

* April 7, 2017: Test data (and other participation tools) published
* April 23, 2017: Conference camera-ready papers submission (5 page 
document)
* May 11, 2017: Publication of the evaluation dataset details
* May 13, 2017: Results submission
* May 30 - June 1: Challenge days

*June 30, 2017: Camera ready paper for challenges post proceedings (12 
pages document)

-------------- nächster Teil --------------
Ein Dateianhang mit HTML-Daten wurde abgetrennt...
URL: <https://lists.tu-clausthal.de/cgi-bin/mailman/private/ifi-ci-event/attachments/20170201/6865e1ab/attachment.html>