2020 LDP Conference on Linked Data

A Machine Learning Approach for Classifying Sinopia's RDF

https://ld4p.github.io/classify-rdf-2020/
BackgroundSinopia's Classifying RDF ChallengeResource TemplatesPanda's DataFrameColab Notebook DemoNext Steps

Pandas DataFrame

Pandas is a widely used open-source data visualization and analysis Python tool that is used in two very popular machine-learning projects, Tensorflow and PyTorch.

Pandas provides a two-dimensional labeled data structure with columns of potentially different types called a Data Frame. A Data Frame is similar in concept to a spreadsheet with significant added functionality. The equivalent of a spreadsheet row in Pandas is a Series made up of labels and values.

Sinopia RDF Data Frame

To construct a Data Frame from Sinopia's RDF data, each resource is represented by a Series that starts with three values:

  • resource URI or skolemized blank node
  • group string of what group the resource belongs to; in Sinopia this is the name of institution or organization
  • resource template

For the remaining labels in the Pandas Series, all of the predicate propertyTemplate URIs in each resource template is used with the values being a count of how often that a predicate is found in resource. For labeled predicates that are not present for a specific resource then a count of 0 is used.

Production Pandas Dataframe