torsdag 26. mai 2011

WIMS'11 - dag 2

Peter Mika, Yahoo! Research
2001: The birth of the Semantic Web
- Scientific American article
- Web Working Symposium at Stanford
- Semantic Web standard
- EU funding (OnToKnowledge) and DAML funding
- the Semantic Web starts a career.... and so do I

2004-2006: Reality sets in
- engineers are not logicians
- humans will have to do most of the job
- no funding

2007: A Second Chance
- data first, schema second... logic third
- Linked Data

Why Semantic Search?
Improvements in IR are harder and harder to come by
- machine learning using hundreds of features
- heavy investment in computational power

Remaining challenges are not computational, but in modeling and user recognition

Poorly solved information needs
- multiple interpretations ('Paris Hilton')
- Long tail queries (george bush - the beer brewer in Arizona)
- multimedia search
- imprecise or overly precise searches
- searches for descriptions (probably the most important)

Searching for 'roi blanco' [colleague of Peter] gives fairly good organic results, but the advertisements attached are quite bad

Don't solve the sparsity problem where it doesn't exist!

Why Semantic Search? (Part II)
Swoogle - the first semantic web search engine (2007)
- The Semantic Web is now a reality (not in 2007)
- end users use keyword queries, not SPARQL

Novel search tasks
- aggregation of search results (e.g. price comparison across websites)
- analysis and prediction (e.g. world temperature by 2020)
- semantic profiling
- semantic log analysis
- support for complex tasks (search apps)

Why Semantic Search? (Part III)
There is a model
- publisheres are (increasingly) interested in making their content searchable, linkable and easier to aggregate and reuse

rNews: RDFa vocabulary for news articles
Facebook's Like and the Open Graph Protocol (also mentioned by Jim H.)

Semantic Search
- makes use of the structure of the data
- exploits this understanding at some part of the search process

Data on the web
Two solutions:
- extraction using Information Extraction techn.
   - NLP
   - extr of triples
   - filling web forms aut.
   - extr from HTML tables
   - wrapper induction

Have to teach your search enginge how to crawl the semantic web
- linked data
- RDFa
- SPARQL endpoints (problem of discovering the endpoints)

Data fusion
- ontology matching
- entity resolution
- blending

Query interpretation
- provide a higher level representation of queries in some conceptual space
- interpretation happens before the query is executed

Query interpretation in  Semantic Search
- "Snap to grid"
- larger user involvement
   - guiding the user in constructiong the queries (e.g. Freebase Suggest)

- matching and ranking
- indexing for speeding up matching
- type of index depends on the query language to support

Semantic Search evaluation
- critical component in developing IR systems
- keyword search over RDF data
- focus on relevance, not efficiency
- real queries and real data
- TREC style evaluation

Search interface
- snippet generation (enriched serach results for pages that contain microformat or RDFa)
- adaptive and interactive pres.
- aggregatred search
- query and task templates - Semantic Information Mashup (DERI)
Time Explorer

Future work in Semantic Web Search
- semi-automated ways of metadata creation (how to go from 5 % metadata to 95 %?)
- data quality (how to assess the quality of data?)
- reasoning (aut. ontology mapping, instance mapping and blending)
- scalability
- ontology reuse (how to get people to reuse ontologies?)
- display (how to aut. generate effective displays for data we don't understand, or only partially understand?)

In 2011
- the Semantic Web is still evolving (finally, a JSON syntax for RDF!)
- leaner and meaner
   - bottom-up approaches
- we get some credit in other fields
   - RDF data management is now a topic at VLDB, others

The Semantic Web has finally grown up - and so have I!

Sören Auer (Universität Leipzig): Creating Knowledge out of Interlinked Data
Based on the EU FP7 project LOD2

Why the Semantic Web won't work
- reasoning does not scale on the web (web scalable DL reasoning is out-of-sight)

"What is the only former Yugoslav republic in the EU?"
- this question can still not be answered by IBM's Watson

But we can do what works already now:

A global, distributed platform for data, information and knowledge integration

Linked Data Lifecycle:
...-> Interlinking/Fusing -> Classification/Enrichment -> Quality anal. -> Evolution/Repair -> Search/Browse/Explore -> Extraction -> Storage/Querying - Manual revision/authoring ->...

- recently launched DBpedia Live (the most important part of Wikipedia are the Infoboxes which are higly interlinked in the original source)
DBpedia Live (DBpedia Live is constantly updatet while the "official" DBpedia SPARQL endpoint is only updatet about twice a year)
Mappings Wiki -

DBpedia inline: use the typed links approach to provide more information about internal links
Semantic wikis: Currently does not scale to Wikipedia's needs, but is perfectly ok for smaller wikis

LinkedGeoData - revealing the data behind OpenStreetMaps
OpenStreetMaps: "Wikipedia for GeoData"
- extremely rich source of data
- LinkedGeoData will try to exploit this rich informatioin source with Linked Data technology

Important work is the ongoing work on standardization of how to map RDB to RDF
- W3C RDB2RDF Working Group

From unstructured data:
- deploy existing NLP appr (OpenCalais, Ontos API) - NLP2RDF

From semi-structured sources
- efficient bi-directionsl synchronization

From structured sources
- declarative syntax and semantics

RDF Data Management
- still 5-50 times slower than RDBMS
- performance increases steadily
- a little performance decrease is acceptable, but not too much

DBpedia Benchmark
- uses DBpedia as data and a selection of 25 frequently executed queries
- ranking between different systems as expected, but the differences were bigger than other benchmarks

The performance gap between RDB and RDF must be reduced
More realistic benchmarks

Two kinds of Semantic Wikis:
- semantic (text) wikis (edit the text)

OntoWiki - a semantic data wiki
- serves as LInked data/SPARQL endpoint
- semantic data wikis (edit triples)

RDFaCE - RDFa Content Editor (
- especially targeted for rNews

LOD Linking
- automatic
- semi-aut. (SILK, LIMES)
- manual

Interlinking challenges: Only 5 % of the information on the Data Web is actually linked

Linked Data is mainly instance data

Quality Analysis
Challenges: Establish measures for assessing the authority, provenance, reliability of Data Web resources



LOD Lifecycle supported by Debian based LOD2 stack (to be released in September)
- but will also be available through a web interface directly accessible from the web (without downloading or pre-installing)

Use cases:
- especially suited for governmental data (open data)

Michael Hausenblas (DERI, Univ. of Galway): Utilising Linked Open Data in Applications
Six steps for utilising LOD in applications:

1. Data awareness
- LOD cloud

2. Modeling
- Neologism
- Data Cube
- (established and controlled by DERI)

3. Publishing
- Google refine plug-in

4. Discovery
- VoiD/DCat
- Sindice


6. Use cases
- DERI in-house
- CSO/schools pilot

Workshop part II (Sören Auer)
Linked Data provides a global data-space with a uniform API (due to RDF as the data model)

Access methods:
- dereference URIs via HTTP GET
- data dumps (RDF/XML)

When to use RDB/SQL and when to use RDF/SPARQL:

- well defined RDB schema that won't change very much
- performance is an important issue

- getting more information out of your data
- highly dynamic, frequently changes in structure/schema

All in all the two technologies should be seen as complementing each others and not competing.

Ingen kommentarer: