Analyse de données

Assignment Queries SQL et MapReduce

Introduction

In this set of exercises is composed of two parts:

A) a Relational conceptual model

B) a set of SQL queries for which you will continue using the database installed during the classes.

C) a set of MapReduce exercises

#A - Tourist tours

We need a database for a company that offers sightseeing experiences around the world. Each tourist attraction (e.g., a museum) has a name, a description and a location (city, country and geographical coordinates). The opening hours and the entry fee must be stored as well, when applicable. There might be some facilities at a tourist attraction, in which case the type of each (e.g., cafe, restroom, parking) must be specified. Tourists can book as many guided tours as they want; a tour takes place on a specific date and time, it has a fixed duration and a price. A tour might include one or several attractions. For example, a tourist may book a tour that includes a visit to the Eiffel Tower and to the Louvre. A tour is typically led by one or multiple tour guides, of which the company keeps their first and last names, contact information (phone and email address), and the languages they speak. When a tourist books a tour, the company needs to store the tourist personal information: first and last name, phone, email address, nationality, and the languages spoken. Optionally, tourists can write a review on the tours that they attended.

Exercise A.1 - Conceptual model

(a) Draw a conceptual model (entity-relationship diagram) of this database.

(b) Derive a logical model (collection of relational tables) from your conceptual model. Specify the name of each table, as well as the names of its columns, the primary key and the foreign keys.

#B - SQL queries

For each question return the query and 10 lines of the result (when the result has multiple lines).

B.1 Show the customers sorted by last name

B.2 Return the countries where there is at least one store

B.3 Return the city having the max number of customers

B.4 Return the first name and last name of the customers that live in Valencia

B.5 Return the customers (if any) that have never rented

B.6 Return the films that are most present in the stores

In [0]:

Select * from Customer
order by Last_name asc



select country 
from Store
join address on store.address_id=address.address_id
join city on address.city_id=city.city_id
join country on city.country_id= country.country_id

#résultat, australie et canada

select city , COUNT(*)
from customer
join address on customer.address_id=address.address_id
join city on address.city_id=city.city_id
group by city
order by count(*) desc;

#on a donc la liste des villes et le nombre de clients associé, on trouve que les villes comptant le plus de clients
#sont londres et aurora avec 2 clients chacune

SELECT customer.first_name, customer.last_name
FROM customer
JOIN address ON customer.address_id = address.address_id
JOIN city ON address.city_id = city.city_id
WHERE city.city = 'Valencia';

#On trouve Mathew Bolin


select first_name,last_name
from customer
left join rental on customer.customer_id =rental.customer_id

Where rental.customer_id is null

#on ne trouve pas de personnes n'ayant pas louée de film


select title, count(*)
from Store
join inventory on store.store_id=inventory.store_id
join film on inventory.film_id=film.film_id
group by film.title
order by count(*) desc;

#On toruve une liste de film avec le nombre de fois qu'il sont dans le stock. On trouve plusieurs film qui sont 
#autant de fois dans le staock, le max est 8.

C. Count world repeated in the same sentence

Suppose that we have a text (one example is provided in the data .zip file) and we want to develop an algorithm that focuses on the most repeated worlds in the sentences.

At first we define what is (for us) the most repeated word.

We want to ignore the stop words (the, and, etc.). A file that contains the words that are considered stop words is provided.

Look at this text:

The cat is on the cat table.
The table is red.
The wooden table is broken.
The cat and the other cat are with a cat.
The cat is white.
The apple has an apple color.

For this set of sentences the most repeated word is cat. The times the world is repeated in the sentences is 5 (2 times in the first sentence and 3 times in the 4th sentence).

Exercise C.1 - Count repeated word

Provide an implementation of an algorithm following the MapReduce paradigm that counts the repeated words ans shows them. The result must be presented in decreasing order (at first the words that have the most repetitions).

Describe in detail each step of the Map-Reduce algorithm.

In [3]:

################# COMPLETE HERE your code  ##################
################## Detail each Map-Reduce step and how do you represent and tread data ##################



################# COMPLETE HERE your code  ##################
################## Detail each Map-Reduce step and how do you represent and tread data ##################



def mysplit(text,separateur1,separateur2):
    """ Fonction permettant de séparer un texte (chaine de caractère) en 
        plusieurs chaine de caractères que l'on stocke dans une liste nommée
        'Liste_finale'.
    """
    
    Liste_finale=[]
    phrase=""
    
    for lettre in text:
        
        if lettre not in [separateur1,separateur2]:
            phrase += lettre
            
        else:
            Liste_finale.append(phrase)
            phrase="" # On commence une nouvelle phrase
            
    if len(phrase) != 0 :
        Liste_finale.append(phrase)
        
    return Liste_finale




def liste_phrase(path):
    """ Fonction permettant de renvoyer une liste contenant des chaines de caractères
        correspondant aux différentes phrases du texte. Puis de supprimer et remplacer
        par d'autres tous les symboles non désirés.
        
        Arguments :
            
        - path => chemin où est stocké le fichier texte à traiter
    """
    file = open(path,'r')
    texte = file.read() # On stocke le contenu du fichier dans la variable texte
    texte_min = texte.lower() # On enlève toutes les majuscules du texte

    liste_phrase = mysplit(texte_min,'.','?') # On utilise la fonction précedente

    for i in range(len(liste_phrase)) : # On remplace tous les symboles non désirés
        
        liste_phrase[i] = liste_phrase[i].replace("\n", " ")
        liste_phrase[i] = liste_phrase[i].replace("/", "")
        liste_phrase[i] = liste_phrase[i].replace(";", "")
        liste_phrase[i] = liste_phrase[i].replace(",", "")
        liste_phrase[i] = liste_phrase[i].replace("\\", "")
        liste_phrase[i] = liste_phrase[i].replace("--", " ")

    return liste_phrase

def map(path):
    """ Fonction renvoyant une liste 'list_map' contenant autant de listes que
        de phrases dans le texte. Chaque sous-liste contient les mots de la phrase
        sous forme (mot,1).
        
        Arguments :
            
        - path => chemin où est stocké le fichier texte à traiter
    """
    
    liste_phrases = liste_phrase(path) # On utilise la fonction précédente
    list_map = [] 
    
    for phrase in liste_phrases:
        
        L=[]
        liste_mots = mysplit(phrase, " "," ") # On créer une liste de mot à partir de la phrase
        
        for mot in liste_mots:
            
            if mot != '' :
                
                L.append((mot,1)) # Chaque mot est compté une fois
                
        list_map.append(L)
        
    return list_map


def shuffle(list_map):
    '''
    cette fonction prend en entrée la liste renvoyé par la fonction map et renvoie un dictionnaire ayant
    pour clé les mots du texte et en valeurs une liste contenant toute les occurences de ce mots
    '''
    D={}
    for i in list_map:
        for couple in i:
            if couple[0] not in D:
                D[couple[0]]=[]
                D[couple[0]].append(couple)
            else:
                D[couple[0]].append(couple)
    return D

def reduce(dico_shuffled):
    '''
    cette fonction prend en entrée le dictionnaire  triée, et le réduit ( pour chaque mot on donne 
    le nombre d'occurence totale)
    '''
    for i in dico_shuffled:
        dico_shuffled[i]=len(dico_shuffled[i])

    return dico_shuffled


def decroissant(dico_reduced):
    '''
    cette fonction prend en entrée le dictionnaire réduit et renvoie une liste contenant des couples
    (mot,occurences) triées dans l'ordre decroissant
    '''
    L=[]
    while len(dico_reduced)>0:  # tant que le dictionnaire n'est pas vide
        occ_max=0                           #on trouve le mots qui apparait le plsu de fois
        mot_max=None
        for mot,occ in dico_reduced.items():
            if occ > occ_max:
                occ_max,mot_max=occ,mot 

        L.append((mot_max,occ_max)) # on l'ajoute à la liste

        del dico_reduced[mot_max]# on le supprime du dictionnaire initiale

    return L

dico_reduced=reduce(shuffle(map('inserer le chemin vers le fichier texte')))
print(decroissant(dico_reduced))
###### EXPECTED RESULT for the little example

#(cat, 5)
#(apple 2)


###### EXPECTED RESULT for the little example

#(cat, 5)
#(apple 2)

Exercise C.2 Repetition rate

Show, using a MapRedice algorithm the repetition rate of the words. For our example the the repetition rate of the word cat is 2,5:

(2 repetitions + 3 repetitions) / number of sentences where repetition is present

(2 + 3) / 2 = 2,5

2 comes from sentence 1 and 3 from sentence 4. Two sentences contain repetitions for cat then the division is by 2.

In [4]:

# Write here your code and the detailed description of the map-reduce algorithm.


path = 'mettre le chemin vers le fichier texte'


dico_reduced=reduce(shuffle(map(path)))


def moyenne(liste_phrase,dico_reduced):

    for mot in dico_reduced.keys():       #pour chaque mots, on parcours les phrases du texte et on compte combien comporte me mot
        nb_phrase=0
        for i in liste_phrase:
            if mot in i:
                nb_phrase+=1
        dico_reduced[mot]=dico_reduced[mot]/float(nb_phrase) #on effectue ensuite a dision pour chaque mot

    return dico_reduced


print(moyenne(liste_phrase(path),dico_reduced))

In [0]: