Popularity-Based Recommender : PearsonR Correlation

4 minute read

Simple Approaches to Recommender Systems

Making Recommendations Based on Correlation

import numpy as np
import pandas as pd

These datasets are hosted on: https://archive.ics.uci.edu/ml/datasets/Restaurant+%26+consumer+data

They were originally published by: Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-Medellín. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys’11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011.

frame =  pd.read_csv('rating_final.csv')
cuisine = pd.read_csv('chefmozcuisine.csv')
geodata = pd.read_csv('geoplaces2.csv', encoding = 'mbcs')
frame.head()
userID placeID rating food_rating service_rating
0 U1077 135085 2 2 2
1 U1077 135038 2 2 1
2 U1077 132825 2 2 2
3 U1077 135060 1 2 2
4 U1068 135104 1 1 2
geodata.head()
placeID latitude longitude the_geom_meter name address city state country fax ... alcohol smoking_area dress_code accessibility price url Rambience franchise area other_services
0 134999 18.915421 -99.184871 0101000020957F000088568DE356715AC138C0A525FC46... Kiku Cuernavaca Revolucion Cuernavaca Morelos Mexico ? ... No_Alcohol_Served none informal no_accessibility medium kikucuernavaca.com.mx familiar f closed none
1 132825 22.147392 -100.983092 0101000020957F00001AD016568C4858C1243261274BA5... puesto de tacos esquina santos degollado y leon guzman s.l.p. s.l.p. mexico ? ... No_Alcohol_Served none informal completely low ? familiar f open none
2 135106 22.149709 -100.976093 0101000020957F0000649D6F21634858C119AE9BF528A3... El Rincón de San Francisco Universidad 169 San Luis Potosi San Luis Potosi Mexico ? ... Wine-Beer only at bar informal partially medium ? familiar f open none
3 132667 23.752697 -99.163359 0101000020957F00005D67BCDDED8157C1222A2DC8D84D... little pizza Emilio Portes Gil calle emilio portes gil victoria tamaulipas ? ? ... No_Alcohol_Served none informal completely low ? familiar t closed none
4 132613 23.752903 -99.165076 0101000020957F00008EBA2D06DC8157C194E03B7B504E... carnitas_mata lic. Emilio portes gil victoria Tamaulipas Mexico ? ... No_Alcohol_Served permitted informal completely medium ? familiar t closed none

5 rows × 21 columns

places =  geodata[['placeID', 'name']]
places.head()
placeID name
0 134999 Kiku Cuernavaca
1 132825 puesto de tacos
2 135106 El Rincón de San Francisco
3 132667 little pizza Emilio Portes Gil
4 132613 carnitas_mata
cuisine.head()
placeID Rcuisine
0 135110 Spanish
1 135109 Italian
2 135107 Latin_American
3 135106 Mexican
4 135105 Fast_Food

Grouping and Ranking Data

rating = pd.DataFrame(frame.groupby('placeID')['rating'].mean())
rating.head()
rating
placeID
132560 0.50
132561 0.75
132564 1.25
132572 1.00
132583 1.00
rating['rating_count'] = pd.DataFrame(frame.groupby('placeID')['rating'].count())
rating.head()
rating rating_count
placeID
132560 0.50 4
132561 0.75 4
132564 1.25 4
132572 1.00 15
132583 1.00 4
rating.describe()
rating rating_count
count 130.000000 130.000000
mean 1.179622 8.930769
std 0.349354 6.124279
min 0.250000 3.000000
25% 1.000000 5.000000
50% 1.181818 7.000000
75% 1.400000 11.000000
max 2.000000 36.000000
rating.sort_values('rating_count', ascending=False).head()
rating rating_count
placeID
135085 1.333333 36
132825 1.281250 32
135032 1.178571 28
135052 1.280000 25
132834 1.000000 25
places[places['placeID']==135085]
placeID name
121 135085 Tortas Locas Hipocampo
cuisine[cuisine['placeID']==135085]
placeID Rcuisine
44 135085 Fast_Food

Preparing Data For Analysis

places_crosstab = pd.pivot_table(data=frame, values='rating', index='userID', columns='placeID')
places_crosstab.head()
placeID 132560 132561 132564 132572 132583 132584 132594 132608 132609 132613 ... 135080 135081 135082 135085 135086 135088 135104 135106 135108 135109
userID
U1001 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 0.0 NaN NaN NaN NaN NaN NaN
U1002 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN 1.0 NaN NaN NaN 1.0 NaN NaN
U1003 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
U1004 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN
U1005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 130 columns

Tortas_ratings = places_crosstab[135085]
Tortas_ratings[Tortas_ratings>=0]
userID
U1001    0.0
U1002    1.0
U1007    1.0
U1013    1.0
U1016    2.0
U1027    1.0
U1029    1.0
U1032    1.0
U1033    2.0
U1036    2.0
U1045    2.0
U1046    1.0
U1049    0.0
U1056    2.0
U1059    2.0
U1062    0.0
U1077    2.0
U1081    1.0
U1084    2.0
U1086    2.0
U1089    1.0
U1090    2.0
U1092    0.0
U1098    1.0
U1104    2.0
U1106    2.0
U1108    1.0
U1109    2.0
U1113    1.0
U1116    2.0
U1120    0.0
U1122    2.0
U1132    2.0
U1134    2.0
U1135    0.0
U1137    2.0
Name: 135085, dtype: float64

Evaluating Similarity Based on Correlation

similar_to_Tortas = places_crosstab.corrwith(Tortas_ratings)

corr_Tortas = pd.DataFrame(similar_to_Tortas, columns=['PearsonR'])
corr_Tortas.dropna(inplace=True)
corr_Tortas.head()
C:\Users\piers\Anaconda3\lib\site-packages\numpy\lib\function_base.py:2995: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar)
C:\Users\piers\Anaconda3\lib\site-packages\numpy\lib\function_base.py:2929: RuntimeWarning: divide by zero encountered in double_scalars
  c *= 1. / np.float64(fact)
PearsonR
placeID
132572 -0.428571
132723 0.301511
132754 0.930261
132825 0.700745
132834 0.814823
Tortas_corr_summary = corr_Tortas.join(rating['rating_count'])
Tortas_corr_summary[Tortas_corr_summary['rating_count']>=10].sort_values('PearsonR', ascending=False).head(10)
PearsonR rating_count
placeID
135076 1.000000 13
135085 1.000000 36
135066 1.000000 12
132754 0.930261 13
135045 0.912871 13
135062 0.898933 21
135028 0.892218 15
135042 0.881409 20
135046 0.867722 11
132872 0.840168 12
places_corr_Tortas = pd.DataFrame([135085, 132754, 135045, 135062, 135028, 135042, 135046], index = np.arange(7), columns=['placeID'])
summary = pd.merge(places_corr_Tortas, cuisine,on='placeID')
summary
placeID Rcuisine
0 135085 Fast_Food
1 132754 Mexican
2 135028 Mexican
3 135042 Chinese
4 135046 Fast_Food
places[places['placeID']==135046]
placeID name
42 135046 Restaurante El Reyecito
cuisine['Rcuisine'].describe()
count         916
unique         59
top       Mexican
freq          239
Name: Rcuisine, dtype: object