我有一个电影数据库的以下数据集:
评分:UserID,MovieID,评级电影:MovieID,类型用户:UserID,性别,年龄
我写了一个PIG脚本,以获得评分最高的电影的年龄组(20-30)的女性用户.以下是我到目前为止的代码:
users_input = load '/users.dat' USING PigStorage('\u003B') as (UserID: long, gender: chararray, age: int, occupation: int, zip: long); movies_input = load '/movies.dat' USING PigStorage('\u003B') as (MovieID: long, title: chararray, genre: chararray); ratings_input = load '/ratings.dat' USING PigStorage('\u003B') as (UserID: long, MovieID: long, rating: int, timestamp: chararray); movie_filter = filter movies_input by (genre matches '.*Action.*') OR (genre matches '.*War.*'); temp = COGROUP movie_filter by MovieID, ratings_input by MovieID; temp1 = FILTER temp BY COUNT(movie_filter) > 0; temp2 = FOREACH temp1 GENERATE group, AVG(ratings_input.rating) AS ratings; temp3 = ORDER temp2 BY ratings DESC; temp4 = LIMIT temp3 1; temp5 = FOREACH temp4 GENERATE ratings; temp6 = FILTER temp3 BY (temp5.ratings == ratings); female_users = filter users_input by gender == 'F'; age_users = filter female_users by age >=20 AND age <=30; age_use = FOREACH age_users GENERATE UserID; MovID = FOREACH temp6 GENERATE group; all_users_records = FILTER ratings_input BY (MovID.group == MovieID); all_users = FOREACH all_users_records GENERATE UserID; female_aged_records = FILTER all_users BY (UserID == age_use.UserID); female_aged_users = FOREACH female_aged_records GENERATE UserID; store all_users into '/output_pig' using PigStorage();
我执行此操作但最终得到错误:" Scalar在输出中有多行.第一:(11),第二:(24) "
有人可以帮帮我吗?提前致谢.
正如其他人所说,这不是一个非常有用的错误信息.你可能有一个点,你需要一个双分号.
@jhofman,我认为你的意思是双冒号(关系运算符)'::'而不是点.
最后,pig脚本应如下所示:
...
temp2 = FOREACH temp1 GENERATE组,AVG(ratings_input::评级)AS评级;
...
temp6 = FILTER temp3 BY(temp5::评级==评级);
...
all_users_records = FILTER ratings_input BY(MovID:: group == MovieID);
all_users = FOREACH all_users_records GENERATE UserID;
female_aged_records = FILTER all_users BY(UserID == age_use :: 用户身份);