Anime Recommendation

Overview¶

Anime is a Japanese colloquialism for an abbreviated term of “animation”. In Japan, “anime” is usually synonymous with all animated works from any location. However, in other parts of the world the term “anime” primarily refers to Japanese animation exclusively. What makes anime different from other animation is primarily its cultural contexts. Anime originated about a hundred year ago. The first anime was made in 1917 and was called “Nakamura Gatana” and was only 4 minutes long. Anime has come a long way with different artists. There are endless amounts of genres from comedy to romance and horror. There are even unique terms to describe specific genres of anime like “shonen” which is anime for young boys or “isekais” that are about being transported to other worlds. Animes are typically adapted from source material like “manga” which are Japanese comics or from visual novels. There are some animes that are original. Anime is made like other animated works by being written, storyboarded, workshopped, taken into an animatic, voiced, and animated. It takes months and often years to finish. It is made by a studio of artists led by a director. Some popular animes are Dragon Ball Z, Demon Slayer, Astro Boy, Pokémon, Death Note, Akira, and Spirited Away. We will ultimately be exploring content based filtering and collaborative based filtering recommendation algorithms for anime.

Database¶

The database chosen came from Kaggle, a reliable online website where data scientists find and publish datasets. It can be downloaded here: https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database The data includes recommendation data from 73,516 users from myanimelist.net, one of the world's most active online anime and manga community and database, on 12,294 anime. The data comes in two tables: anime.csv and rating.csv. By looking into the anime.csv table, we see that each entry contains 7 attributes:

  • anime_id: Uniquely identifies an anime
  • name: Name of anime
  • genre: List of genres an anime belongs to
  • type: Type of anime like movie, TV, OVA, etc.
  • episodes: Number of episodes the anime has (1 if type is a movie)
  • rating: Average rating of anime on myanimelist
  • members: Number of community members in the anime's "group"
    For the rating.csv table, each entry contains 3 attributes:
  • user_id: Uniquely identifies a user
  • anime_id: Uniquely identifies an anime. It is a foreign key referencing anime_id from anime.csv
  • rating: User's rating of the anime. Default is -1 if the user watched it but has not rated it yet
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import linregress
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KDTree
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import KMeans

data = pd.read_csv('anime.csv')
data2 = pd.read_csv('rating.csv')
data
Out[1]:
anime_id name genre type episodes rating members
0 32281 Kimi no Na wa. Drama, Romance, School, Supernatural Movie 1 9.37 200630
1 5114 Fullmetal Alchemist: Brotherhood Action, Adventure, Drama, Fantasy, Magic, Mili... TV 64 9.26 793665
2 28977 Gintama° Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.25 114262
3 9253 Steins;Gate Sci-Fi, Thriller TV 24 9.17 673572
4 9969 Gintama' Action, Comedy, Historical, Parody, Samurai, S... TV 51 9.16 151266
... ... ... ... ... ... ... ...
12289 9316 Toushindai My Lover: Minami tai Mecha-Minami Hentai OVA 1 4.15 211
12290 5543 Under World Hentai OVA 1 4.28 183
12291 5621 Violence Gekiga David no Hoshi Hentai OVA 4 4.88 219
12292 6133 Violence Gekiga Shin David no Hoshi: Inma Dens... Hentai OVA 1 4.98 175
12293 26081 Yasuji no Pornorama: Yacchimae!! Hentai Movie 1 5.46 142

12294 rows × 7 columns

In [2]:
data2
Out[2]:
user_id anime_id rating
0 1 20 -1
1 1 24 -1
2 1 79 -1
3 1 226 -1
4 1 241 -1
... ... ... ...
7813732 73515 16512 7
7813733 73515 17187 9
7813734 73515 22145 10
7813735 73516 790 9
7813736 73516 8074 9

7813737 rows × 3 columns

Preprocessing and Analyzing Anime Table¶

Before we can use the table for recommendation, it is necessary to clean the data first. By dropping NaN values in the table, it will set us up to extract the features from it and perform nearest neighbor analysis for the content based filtering recommendation. Although there should not be any duplicates of anime_id, it is good to take a precaution and to only take the newest anime_id if a duplicate is found.

The Rating Distribution¶

Before continuing to clean and prepare the data for content based filtering, we can do some exploratory analysis to better understand the dataset. When checking the distribution of average anime ratings on anime.csv, we see that it is normally distributed with a mean of 6.46 and a standard deviation of roughly 1. Because of this, a vast majority of animes had a rating of 5 or above. The max rating was a 10, meaning everybody who rated the anime gave it a 10, and the lowest rating was a 1.67. The graph is included below:

In [3]:
data = data.dropna() # Cleaning Data
data = data.drop_duplicates(subset=['anime_id'], keep= 'last')
plt.hist(data['rating'], edgecolor = 'black')
plt.title('Histogram of Anime Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
data['rating'].describe()
Out[3]:
count    12017.000000
mean         6.478264
std          1.023857
min          1.670000
25%          5.890000
50%          6.570000
75%          7.180000
max         10.000000
Name: rating, dtype: float64
No description has been provided for this image

Anime Types¶

Next we should consider the distribution of anime types. There are 6 types in total: movie, music, ONA, OVA, Special, and TV. The most popular anime types are TV followed by OVA followed by Movie (Note that OVA is a follow up of a TV series that can go anywhere from 1-3 episodes). When looking at the average rating of each type, the popular types received a noticeable higher rating than the less popular types, with the exception of Specials, which has a higher average rating than OVA despite OVA being more popular.

In [4]:
x = data.groupby('type')
x.size().plot(kind = 'bar', xlabel = 'Type', ylabel = 'Frequency', title = 'Anime Type Count')
plt.figure(figsize=(5,5))
plt.bar(x.groups.keys(),x.mean()['rating'])
plt.xlabel('Type')
plt.ylabel('Average Rating')
plt.title('Average Rating Across Anime Types')
plt.show()
No description has been provided for this image
No description has been provided for this image

Anime Genres¶

For anime genres, I created a dictionary to count how many of each genre there are as well as the average rating of each genre. The results seem to show that there are certain genres that are more popular and certain genres that are less popular, but the ratings are not that much drastically different. The average rating for Dementia is much lower than the others, but it is an outliar. Genres like josei and thriller have some of the highest ratings, but the difference is not too significant. Next, I wanted to see if popular genres, i.e. genres that show up the most, correlate with a higher average rating. At first glance, there is no significant correlation between genre count and rating. Comedy having over 4000 works but having a relatively low average rating, and thriller and josei being some of the least works but having some of the highest average ratings seem to suggest that there is a negative correlation between rating and count, so I plotted Anime Rating vs. Count and tried to use a linear regression to see how much they are correlated. The result was that I failed to see a correlation between the two variables, and average rating of the genre likely does not depend on the count of the genre. The very low r value of -0.0398 suggested low correlation and the very high p-value of 0.8 suggest that the data is not significant and there is no clear relationship between the variables.

In [5]:
frequency = dict() #Counts each genre. If an anime has more than one genre, it is double counted
scores = dict() #Stores average rating
for index, row in data.iterrows():
    genres = row['genre'].split(', ')
    for genre in genres:
        frequency[genre] = frequency.get(genre, 0) + 1
        scores[genre] = scores.get(genre, 0) + row['rating']
for key, value in frequency.items():
    scores[key] = scores.get(key,0)/frequency.get(key,0)
    
plt.figure(figsize=(40,5))
plt.bar(range(len(scores)), list(scores.values()), tick_label = list(scores.keys()))
plt.xlabel('Anime Genre')
plt.ylabel('Average Rating')
plt.title('Average Rating Across Anime Genres')
plt.show()
plt.figure(figsize=(40,5))
plt.xlabel('Anime Genre')
plt.ylabel('Frequency')
plt.title('Number of Anime per Anime Genre')
plt.bar(range(len(frequency)), list(frequency.values()), tick_label = list(frequency.keys()))
No description has been provided for this image
Out[5]:
<BarContainer object of 43 artists>
No description has been provided for this image
In [6]:
x = list(frequency.values())
y = list(scores.values())
plt.scatter(x,y)
a, b = np.polyfit(x, y, 1)
plt.xlabel('Number of Anime')
plt.ylabel('Rating')
plt.title('Anime Genre Average Rating vs. Count')
plt.plot(x, (np.array(x)*a)+b)
linregress(x, y)
Out[6]:
LinregressResult(slope=-1.7915622030846685e-05, intercept=6.749665182852247, rvalue=-0.039817495681292954, pvalue=0.7998761622286875, stderr=7.021363800846438e-05, intercept_stderr=0.08647336310841422)
No description has been provided for this image

The Content Based Filtering Recommendation¶

Content based filtering uses features of an item to recommend similar items to users. If the system recognized that you liked Iron Man 1, for example, it will see that Iron Man 2 is similar to Iron Man 1 and recommend you Iron Man 2. The benefits of a content based filtering system is that it does not require data about other users to make a recommendation, and can then recommend niche items to users even if others are not interested. The products recommended can be highly relevant, and the process is very transparent. The system does not need to collect user information and protect user privacy if that is a concern for the user, and the system is a good start for companies that are just starting and do not have much user data to work with. On the otherhand, content based filtering can lack diversity, and can lead to difficulty selling new products as it takes effort to assign it attributes again. To create such a system, I first need to define the features I will be using to represent each anime. It is intuitive to start with genres, episodes, rating, members, and type. If a user likes TV series, it is natural to suggest another TV series, hence the type feature. If an anime has 12 episodes, it is ideal to suggest another anime with a similar length, and not an anime with 60+ episodes, hence the episode feature. If the user loves watching romance anime, it is reasonable to suggest other romance anime, hence the genre feature. Adding rating and members as a feature can be experimented. If they are not added as a feature, then animes will be suggested purely based on length and genre, but if they are added then higher rating and more popular anime may sometimes overrule the genre similarity of two anime. This may not be a bad feature since the majority of users are likely to enjoy anime that are popular and have a high rating even if the genre does not always match, so these two are added as features here. The following attempt is a simple attempt at content based filtering.
I first used one hot encoding to replace the genre column. These new columns will be used as features. When I performed the fitting, I realized that some episodes were tagged as 'unknown'. For simplicity, I simply deleted these entries and converted the whole dataframe's values into floats. Alternatively, I could have searched the episode count of the anime and manually input them in. If there were too many unknown entries, I could have filled them with a default value as well, while only manually inputtting into the animes with high rating and high members. I used the MinMaxScaler to create my features. This process scales each column from values 0-1. This alternatively could have been done manually. I then input the features into a kd tree. Each time I input an anime, it will return the 10 animes that are similar to it. As an example, I asked the tree to return 10 recommendations for Kimi no Na wa, and the output is as follows:

In [7]:
genre_dummies = data['genre'].str.get_dummies(sep=',')#One Hot Encode
type_dummies = pd.get_dummies(data['type'])          
data10 = pd.concat([data, genre_dummies], axis = 1)
data10 = pd.concat([data10, type_dummies], axis = 1)
data10 = data10.drop(columns = ['genre','type','anime_id'])
data10 = data10[data10['episodes'] != 'Unknown']
data10
Out[7]:
name episodes rating members Adventure Cars Comedy Dementia Demons Drama ... Supernatural Thriller Vampire Yaoi Movie Music ONA OVA Special TV
0 Kimi no Na wa. 1 9.37 200630 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
1 Fullmetal Alchemist: Brotherhood 64 9.26 793665 1 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 1
2 Gintama° 51 9.25 114262 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 Steins;Gate 24 9.17 673572 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 Gintama&#039; 51 9.16 151266 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12289 Toushindai My Lover: Minami tai Mecha-Minami 1 4.15 211 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
12290 Under World 1 4.28 183 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
12291 Violence Gekiga David no Hoshi 4 4.88 219 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
12292 Violence Gekiga Shin David no Hoshi: Inma Dens... 1 4.98 175 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
12293 Yasuji no Pornorama: Yacchimae!! 1 5.46 142 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0

11830 rows × 92 columns

In [8]:
copy = data10.drop(columns = ['name']).astype(float)
scaler = MinMaxScaler() #Scales each column from 0-1
features = scaler.fit_transform(copy)
kdt = KDTree(features) #Nearest Neighbor algorithm for anime recommendations
index = kdt.query([copy.iloc[0]], k=10, return_distance=False) #Returns 10 anime recommendations for Kimi no Na wa.
for ind in index:
    print(data10.iloc[ind]['name']) #prints name of anime rcommendation
40                           Death Note
86                   Shingeki no Kyojin
804                    Sword Art Online
1      Fullmetal Alchemist: Brotherhood
159                        Angel Beats!
19      Code Geass: Hangyaku no Lelouch
841                              Naruto
3                           Steins;Gate
445                    Mirai Nikki (TV)
131                           Toradora!
Name: name, dtype: object

Preprocessing User Rating Table¶

Now onto preprocessing and analyzing user.csv. The end result will be a collaborative based filtering recommend system. The dataset is not as complex as the last one, and only contains each user's rating on a particular anime. If we graph the distribution of anime ratings, we see that the majority of users gave their anime a high score of 9 and 10, or -1. It makes sense because the majority of people who are willing to give a rating are those who are extremely passionate about the anime, and many people could not be bothered to give a rating of an anime otherwise. What is surprising though is the low amount of ratings that are a 4 or lower. Since passionate people rate animes, it would make sense for there to be a good number of 1s as well. I expected a bimodal distribution with one peak at 8-10 and another at -1 through 2, but the number of 1s and 2s occur the least frequently here. There are many users who watch anime, however, and this dataset only contains ratings from 70,000 ish users. The users that were picked are possibly biased and not random. It is ok to use their data in a recommendation system, however, which is what we will do next.

In [9]:
data2 = data2.dropna()
data2 = data2.drop_duplicates(subset=['anime_id','user_id'], keep= 'last')
plt.hist(data2['rating'], edgecolor = 'black')
plt.title('Histogram of Anime Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
Out[9]:
Text(0, 0.5, 'Frequency')
No description has been provided for this image

The Collaborative Based Filtering Approach¶

Collaborative based filtering is the opposite of content based filtering. While content based filtering uses data of items to find similar items, collaborative based filtering uses data from users to find similar users. Once a similar cluster of users are found, a recommendation is made given what similar users like. If you buy a nail, for example, the system will look at other users who buy nails and realize those users often bought a hammer with their nail, so they will recommend you a hammer. In content based filtering, the system will not necessarily see the connection between nail and hammer, and will likely recommend similar nails. The advantages is that it is much simpler than content based filtering. If the data is there, the implementation is relatively straightforward. It can find great connections that content based filtering cannot find, like the nail and hammer association. On the other hand, scalability is an issue. It can be difficult to process a larger amount of customer interests, but if not enough data is collected, the recommendation could be inaccurate. Collaborative based also has cold-start problem, which often occurs when new products are released. The following is a simple collaborative based filtering implementation.
Since there are too many users, I decided to only sample users who have watched more than 500 anime. This saves memory and increases runtime, while preserving a good sized sample of 1800+ users. First, I need to pivot the table to show each user's rating on each anime, remembering to give each anime the user did not watch the average rating they gave. An alternative method is to give the users who did not watch the anime the average rating of the anime. After that, I performed cosine_similarity on the table, which compares each user to every other user, and assigns them a value based on how similar they are. I then clustered similar users together(arbitrarily decided on 20 clusters), then found anime recommendations based on what animes people in the same cluster liked. I did this by sorting by highest mean rating, and taking the top 5 values. I could slightly modify the code to further improve and elaborate on this method by testing it on users with less than 500 anime reviews. I did an example by asking it to find recommendations for user 54, and the output is shown below:

In [10]:
data2 = data2.groupby('user_id').filter(lambda x: len(x)>500) # Decrease Dataset for runtime and prevent notebook from crash on memory
data3 = data2.groupby('user_id').mean().drop(columns = 'anime_id')
data4 = pd.pivot_table(data2.reset_index(), 
               index='user_id', columns='anime_id', values='rating') #Pivot table for user rating of each anime
for index, row in data3.iterrows(): 
    data4.loc[index] = data4.loc[index].fillna(row['rating']) #Fill in not viwed show with user's average rating
data4
Out[10]:
anime_id 1 5 6 7 8 15 16 17 18 19 ... 34252 34283 34324 34325 34349 34358 34367 34475 34476 34519
user_id
17 4.351082 4.351082 7.000000 4.351082 4.351082 4.351082 4.351082 4.351082 4.351082 10.000000 ... 4.351082 4.351082 4.351082 4.351082 4.351082 4.351082 4.351082 4.351082 4.351082 4.351082
54 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
201 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031 ... 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031 0.856031
226 8.000000 7.680593 8.000000 7.680593 7.680593 7.680593 7.680593 7.680593 7.680593 7.680593 ... 7.680593 7.680593 7.680593 7.680593 7.680593 7.680593 7.680593 7.680593 7.680593 7.680593
271 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287 ... 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287 7.372287
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
73378 9.000000 9.000000 7.935083 7.935083 7.935083 7.935083 7.935083 7.935083 8.000000 7.935083 ... 7.935083 7.935083 7.935083 7.935083 7.935083 7.935083 7.935083 7.935083 7.935083 7.935083
73395 10.000000 9.000000 10.000000 8.266667 8.266667 9.000000 8.266667 8.266667 8.266667 8.266667 ... 8.266667 8.266667 8.266667 8.266667 8.266667 8.266667 8.266667 8.266667 8.266667 8.266667
73408 10.000000 10.000000 9.000000 0.742481 0.742481 0.742481 -1.000000 0.742481 0.742481 0.742481 ... 0.742481 0.742481 0.742481 0.742481 0.742481 0.742481 0.742481 0.742481 0.742481 0.742481
73499 9.000000 7.832504 9.000000 7.832504 7.832504 10.000000 7.832504 7.832504 7.832504 7.832504 ... 7.832504 7.832504 7.832504 7.832504 7.832504 7.832504 7.832504 7.832504 7.832504 7.832504
73502 8.486275 8.486275 8.486275 9.000000 8.486275 8.486275 10.000000 8.486275 8.486275 8.486275 ... 8.486275 8.486275 8.486275 8.486275 8.486275 8.486275 8.486275 8.486275 8.486275 8.486275

1843 rows × 11140 columns

In [11]:
cos = cosine_similarity(data4) #Cosine similarity measures how similar two users are 
np.fill_diagonal(cos, 0 )
similar_anime =pd.DataFrame(cos,index=data4.index)
similar_anime.columns=data4.index
similar_anime.head() # measures how similar each user is to any other user
Out[11]:
user_id 17 54 201 226 271 294 342 392 446 478 ... 73272 73286 73340 73356 73362 73378 73395 73408 73499 73502
user_id
17 0.000000 -0.978626 0.717752 0.978291 0.978297 0.978963 -0.141789 0.971066 0.978606 0.978851 ... 0.976770 0.962211 0.978055 0.978534 0.978011 0.979068 0.978981 0.657635 0.979240 0.978218
54 -0.978626 0.000000 -0.729428 -0.999320 -0.998067 -0.999544 0.149240 -0.986644 -0.998229 -0.999552 ... -0.997657 -0.978811 -0.998040 -0.999252 -0.998600 -0.999432 -0.999557 -0.671318 -0.999348 -0.998878
201 0.717752 -0.729428 0.000000 0.729388 0.728997 0.729626 -0.084838 0.722958 0.728312 0.729244 ... 0.725408 0.719273 0.729466 0.728688 0.729336 0.729428 0.729890 0.533101 0.729849 0.729096
226 0.978291 -0.999320 0.729388 0.000000 0.997508 0.998871 -0.148120 0.986036 0.997613 0.998805 ... 0.997040 0.978624 0.997417 0.998559 0.997872 0.998692 0.998888 0.670239 0.998675 0.998172
271 0.978297 -0.998067 0.728997 0.997508 0.000000 0.997637 -0.146689 0.985432 0.996541 0.997832 ... 0.995750 0.978020 0.996259 0.997381 0.996754 0.997731 0.997797 0.669478 0.997595 0.997054

5 rows × 1843 columns

In [12]:
kmeans = KMeans(n_clusters=20).fit(similar_anime) #Use K means to group similar users together. Other approaches available.
labels = kmeans.labels_
data4['label'] = labels
target = data4.iloc[1]['label'] #As an example, we are seeing what anime to recommend for user_id 54
data4 = data4.groupby('label')  
In [13]:
movies = []
x = data4.get_group(target).drop(columns = 'label').transpose()
x['mean'] = x.mean(axis=1)
x = x.sort_values(by=['mean'], ascending = False)       #Will see how each user judged a movie based on cluster.
count = 0
#Will pick movies with top 5 average score. Did not look to exclude movies that the current watched, but is easily implementable.
for index, row in x.iterrows():                
    if (count < 5):
        movies.append(index)
        count = count + 1         
    else:
        break
x
Out[13]:
user_id 54 917 940 1579 1870 2243 2264 2864 3325 3391 ... 61622 62209 64174 65468 68017 68721 68787 68795 69121 mean
anime_id
1690 -1.0 -0.960396 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.997605 ... -0.989051 -1.0 -1.0 -1.0 -1.0 -0.983636 -1.0 -1.0 -1.0 -0.899399
908 -1.0 -0.960396 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.997605 ... -0.989051 -1.0 -1.0 -1.0 -1.0 -0.983636 -1.0 -1.0 -1.0 -0.899712
2167 -1.0 -0.960396 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.997605 ... -0.989051 -1.0 -1.0 -1.0 -1.0 -1.000000 -1.0 -1.0 -1.0 -0.900175
8861 -1.0 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.997605 ... -0.989051 -1.0 -1.0 -1.0 -1.0 -1.000000 -1.0 -1.0 -1.0 -0.900275
19363 -1.0 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.000000 ... -1.000000 -1.0 -1.0 -1.0 -1.0 -1.000000 -1.0 -1.0 -1.0 -0.900356
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
205 -1.0 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.997605 ... -1.000000 -1.0 -1.0 -1.0 -1.0 -1.000000 -1.0 -1.0 -1.0 -0.999735
226 -1.0 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.997605 ... -1.000000 -1.0 -1.0 -1.0 -1.0 -1.000000 -1.0 -1.0 -1.0 -0.999736
2001 -1.0 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -0.997605 ... -1.000000 -1.0 -1.0 -1.0 -1.0 -0.983636 -1.0 -1.0 -1.0 -0.999781
2993 -1.0 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.000000 ... -1.000000 -1.0 -1.0 -1.0 -1.0 -1.000000 -1.0 -1.0 -1.0 -0.999807
356 -1.0 -1.000000 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.000000 ... -1.000000 -1.0 -1.0 -1.0 -1.0 -0.983636 -1.0 -1.0 -1.0 -0.999854

11140 rows × 113 columns

In [14]:
#Print Movies that user 54 would like
for movie in movies:
    print(data[data['anime_id'] == movie]['name'])
807    Bokurano
Name: name, dtype: object
1742    Fullmetal Alchemist: Premium Collection
Name: name, dtype: object
223    Clannad
Name: name, dtype: object
4480    Yosuga no Sora: In Solitude, Where We Are Leas...
Name: name, dtype: object
179    Gin no Saji 2nd Season
Name: name, dtype: object

Conclusion¶

Before conducting collaborative based filtering and content based filtering recommendation, I had to do some prliminary cleaning and exploratory analysis on each dataset. For the anime table, we delve into relations between rating, type and genre. It seemed at first that there was a correlation between rating and genre, but that proved to be inconclusive. Later during content based filtering, I chose genre, type, episode number, rating, and members as features, and used kd tree to find the 10 nearest neighbors of an anime as a recommendation. I was able to include the entire dataset, and did not need to utilize individual user rating information. When performing collaborative based filtering, however, there were too many users who watched anime, and there were too many animes they have seen. This forced me to cut to only evaluate a portion of the data in consideration of memory and runtime. By pivoting the table to see what rating each user gave to each anime, I was able to perform cosine_similarity and compare each user to other users in the same cluster to give a recommendation. There are benefits and disadvantages to both approaches. A hybrid approach could utilize the benefits of each to give a recommendation, but was not used here. Anime is a beautiful medium, and exploring these two recommendation systems can help share anime to even more people.