CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

| Download
Project: francesca
Views: 28
Image: ubuntu2204
Kernel: Python 3 (system-wide)

Analyse de données


Assignment Queries SQL et MapReduce

Introduction

In this set of exercises is composed of two parts:

A) a Relational conceptual model

B) a set of SQL queries for which you will continue using the database installed during the classes.

C) a set of MapReduce exercises

#A - Tourist tours

We need a database for a company that offers sightseeing experiences around the world. Each tourist attraction (e.g., a museum) has a name, a description and a location (city, country and geographical coordinates). The opening hours and the entry fee must be stored as well, when applicable. There might be some facilities at a tourist attraction, in which case the type of each (e.g., cafe, restroom, parking) must be specified. Tourists can book as many guided tours as they want; a tour takes place on a specific date and time, it has a fixed duration and a price. A tour might include one or several attractions. For example, a tourist may book a tour that includes a visit to the Eiffel Tower and to the Louvre. A tour is typically led by one or multiple tour guides, of which the company keeps their first and last names, contact information (phone and email address), and the languages they speak. When a tourist books a tour, the company needs to store the tourist personal information: first and last name, phone, email address, nationality, and the languages spoken. Optionally, tourists can write a review on the tours that they attended.

Exercise A.1 - Conceptual model

(a) Draw a conceptual model (entity-relationship diagram) of this database.

(b) Derive a logical model (collection of relational tables) from your conceptual model. Specify the name of each table, as well as the names of its columns, the primary key and the foreign keys.

#B - SQL queries

For each question return the query and 10 lines of the result (when the result has multiple lines).

B.1 Show the customers sorted by last name

B.2 Return the countries where there is at least one store

B.3 Return the city having the max number of customers

B.4 Return the first name and last name of the customers that live in Valencia

B.5 Return the customers (if any) that have never rented

B.6 Return the films that are most present in the stores

Select * from Customer order by Last_name asc select country from Store join address on store.address_id=address.address_id join city on address.city_id=city.city_id join country on city.country_id= country.country_id #résultat, australie et canada select city , COUNT(*) from customer join address on customer.address_id=address.address_id join city on address.city_id=city.city_id group by city order by count(*) desc; #on a donc la liste des villes et le nombre de clients associé, on trouve que les villes comptant le plus de clients #sont londres et aurora avec 2 clients chacune SELECT customer.first_name, customer.last_name FROM customer JOIN address ON customer.address_id = address.address_id JOIN city ON address.city_id = city.city_id WHERE city.city = 'Valencia'; #On trouve Mathew Bolin select first_name,last_name from customer left join rental on customer.customer_id =rental.customer_id Where rental.customer_id is null #on ne trouve pas de personnes n'ayant pas louée de film select title, count(*) from Store join inventory on store.store_id=inventory.store_id join film on inventory.film_id=film.film_id group by film.title order by count(*) desc; #On toruve une liste de film avec le nombre de fois qu'il sont dans le stock. On trouve plusieurs film qui sont #autant de fois dans le staock, le max est 8.

C. Count world repeated in the same sentence

Suppose that we have a text (one example is provided in the data .zip file) and we want to develop an algorithm that focuses on the most repeated worlds in the sentences.

At first we define what is (for us) the most repeated word.

We want to ignore the stop words (the, and, etc.). A file that contains the words that are considered stop words is provided.

Look at this text:

  • The cat is on the cat table.

  • The table is red.

  • The wooden table is broken.

  • The cat and the other cat are with a cat.

  • The cat is white.

  • The apple has an apple color.

For this set of sentences the most repeated word is cat. The times the world is repeated in the sentences is 5 (2 times in the first sentence and 3 times in the 4th sentence).

Exercise C.1 - Count repeated word

Provide an implementation of an algorithm following the MapReduce paradigm that counts the repeated words ans shows them. The result must be presented in decreasing order (at first the words that have the most repetitions).

Describe in detail each step of the Map-Reduce algorithm.

################# COMPLETE HERE your code ################## ################## Detail each Map-Reduce step and how do you represent and tread data ################## ################# COMPLETE HERE your code ################## ################## Detail each Map-Reduce step and how do you represent and tread data ################## def mysplit(text,separateur1,separateur2): """ Fonction permettant de séparer un texte (chaine de caractère) en plusieurs chaine de caractères que l'on stocke dans une liste nommée 'Liste_finale'. """ Liste_finale=[] phrase="" for lettre in text: if lettre not in [separateur1,separateur2]: phrase += lettre else: Liste_finale.append(phrase) phrase="" # On commence une nouvelle phrase if len(phrase) != 0 : Liste_finale.append(phrase) return Liste_finale def liste_phrase(path): """ Fonction permettant de renvoyer une liste contenant des chaines de caractères correspondant aux différentes phrases du texte. Puis de supprimer et remplacer par d'autres tous les symboles non désirés. Arguments : - path => chemin où est stocké le fichier texte à traiter """ file = open(path,'r') texte = file.read() # On stocke le contenu du fichier dans la variable texte texte_min = texte.lower() # On enlève toutes les majuscules du texte liste_phrase = mysplit(texte_min,'.','?') # On utilise la fonction précedente for i in range(len(liste_phrase)) : # On remplace tous les symboles non désirés liste_phrase[i] = liste_phrase[i].replace("\n", " ") liste_phrase[i] = liste_phrase[i].replace("/", "") liste_phrase[i] = liste_phrase[i].replace(";", "") liste_phrase[i] = liste_phrase[i].replace(",", "") liste_phrase[i] = liste_phrase[i].replace("\\", "") liste_phrase[i] = liste_phrase[i].replace("--", " ") return liste_phrase def map(path): """ Fonction renvoyant une liste 'list_map' contenant autant de listes que de phrases dans le texte. Chaque sous-liste contient les mots de la phrase sous forme (mot,1). Arguments : - path => chemin où est stocké le fichier texte à traiter """ liste_phrases = liste_phrase(path) # On utilise la fonction précédente list_map = [] for phrase in liste_phrases: L=[] liste_mots = mysplit(phrase, " "," ") # On créer une liste de mot à partir de la phrase for mot in liste_mots: if mot != '' : L.append((mot,1)) # Chaque mot est compté une fois list_map.append(L) return list_map def shuffle(list_map): ''' cette fonction prend en entrée la liste renvoyé par la fonction map et renvoie un dictionnaire ayant pour clé les mots du texte et en valeurs une liste contenant toute les occurences de ce mots ''' D={} for i in list_map: for couple in i: if couple[0] not in D: D[couple[0]]=[] D[couple[0]].append(couple) else: D[couple[0]].append(couple) return D def reduce(dico_shuffled): ''' cette fonction prend en entrée le dictionnaire triée, et le réduit ( pour chaque mot on donne le nombre d'occurence totale) ''' for i in dico_shuffled: dico_shuffled[i]=len(dico_shuffled[i]) return dico_shuffled def decroissant(dico_reduced): ''' cette fonction prend en entrée le dictionnaire réduit et renvoie une liste contenant des couples (mot,occurences) triées dans l'ordre decroissant ''' L=[] while len(dico_reduced)>0: # tant que le dictionnaire n'est pas vide occ_max=0 #on trouve le mots qui apparait le plsu de fois mot_max=None for mot,occ in dico_reduced.items(): if occ > occ_max: occ_max,mot_max=occ,mot L.append((mot_max,occ_max)) # on l'ajoute à la liste del dico_reduced[mot_max]# on le supprime du dictionnaire initiale return L dico_reduced=reduce(shuffle(map('inserer le chemin vers le fichier texte'))) print(decroissant(dico_reduced)) ###### EXPECTED RESULT for the little example #(cat, 5) #(apple 2) ###### EXPECTED RESULT for the little example #(cat, 5) #(apple 2)

Exercise C.2 Repetition rate

Show, using a MapRedice algorithm the repetition rate of the words. For our example the the repetition rate of the word cat is 2,5:

(2 repetitions + 3 repetitions) / number of sentences where repetition is present

(2 + 3) / 2 = 2,5

2 comes from sentence 1 and 3 from sentence 4. Two sentences contain repetitions for cat then the division is by 2.

# Write here your code and the detailed description of the map-reduce algorithm. path = 'mettre le chemin vers le fichier texte' dico_reduced=reduce(shuffle(map(path))) def moyenne(liste_phrase,dico_reduced): for mot in dico_reduced.keys(): #pour chaque mots, on parcours les phrases du texte et on compte combien comporte me mot nb_phrase=0 for i in liste_phrase: if mot in i: nb_phrase+=1 dico_reduced[mot]=dico_reduced[mot]/float(nb_phrase) #on effectue ensuite a dision pour chaque mot return dico_reduced print(moyenne(liste_phrase(path),dico_reduced))