Programming Project 3: Processing MTA Subway Entrance Data

2y ago
27 Views
3 Downloads
232.07 KB
8 Pages
Last View : 18d ago
Last Download : 3m ago
Upload by : Ryan Jay
Transcription

CSci 335 Software Design and Analysis IIIProject 3: Processing MTA Subway Entrance DataProf. Stewart WeissProgramming Project 3: Processing MTA Subway Entrance Data1OverviewIn this programming project, you will work with a dataset consisting of all entrances and exits to the stationsof the New York City Transit Authority. This dataset, like the others you have used in this course, is part oftheNew York City OpenDataproject, and is provided by the Metropolitan Transit Authority (MTA). It hasbeen cleaned up by NYC's Department of Information Technology and Telecommunications (DoITT). Theterm subway entrance (exit) is used here to mean an entrance (exit) to any station, whether it is above or1below the ground. The dataset has a row for every distinct subway entrance and exit . Each row providesthe entrance's spatial location, name, which is really its street address, and the train lines that are accessiblefrom it. With this information, a rich set of queries is possible. In particular, you will design a program thatdetermines which entrances are part of the same stations, which stations are part of the same train lines,and which stations and/or trains are closest to a given GPS location. As you will soon see, solving theseproblems will require using algorithms from Chapter 8 (on-line disjoint set operations), and hash tables, aswell as other material you have learned about in this course.2ObjectivesThis project is designed with a few objectives in mind: to give you more experience working with real, large, open data sets. to give you experience programming the on-line disjoint set problem. to give you experience creating a very simple hash table. to give you a problem that has practical application and that can be extended to become a usefulapplication.3About the Data SetThe data set is visualized on a map at this ubway-Entrances/drex-xx56/dataYou can download it in CSV format with this 3hwy/rows.csv?accessType DOWNLOAD.The dataset that is downloaded will have as its rst row, the labels of its columns. For this assignment, youshould delete that row, so that the program can assume all rows are actual data rows. A version of it withoutthe rst line (which contains column labels) is also posted on the server in the directory/data/biocs/b/student.accounts/cs335 sw/resources/project3 files/This particular dataset is relatively small - only 1928 lines, each of which is just ve columns. Your projectwill read the CSV le and store its rows into an array (or vector). As it reads the rows and stores them, itwill do some preliminary processing. After the data is stored in memory, the program will process variousqueries about the dataset. If you need a refresher about CSV les, please read assignment 1 or 2. Rememberthat elds can contain embedded commas. (Fields can contain commas if they are within quoted strings,e.g., Brooklyn, New York is a single eld.)There does not appear to be an on-line data dictionary for this dataset.The table below de nes its veelds. In the table, the following terms are used:decimal1 SomeA xed decimal numeric literal, with an optional leading negative sign, such as -40.32 or 7.1234567stations have exits that are not also entrances, and in this case the name eld of the row indicates this.This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Int'l nc-nd/4.0/1

CSci 335 Software Design and Analysis IIIProject 3: Processing MTA Subway Entrance Dataline identi erProf. Stewart WeissA string used by the MTA to de ne a transit line. These are either uppercase letters Athrough Z, or single digits 1 through 7, or two- or three-letter uppercase strings such as GS orSIR.Field NumberField NameType and Format1ObjectIDA unique positive integer identifying the particular2URL3Namesubway entrance.A URL that provides service information about thesubway entrance.A string that speci es a unique name for this subwayentrance, which is always its street address or a similarlocator. This eld might contain a substring (exit) that4indicates that it is an exit-only.The GeomThe GPS coordinates, in the formatPOINT (decimal decimal)where there is white space between the word POINT andthe rst parenthesis, and between the two decimals.5Note: the rst number is the longitude and thesecond is the latitude (the reverse of the tree dataset!)LineA string that consists of either a singleline identifier,or a sequence ofline identifiersseparated by hyphens.A sample row from this le, wrapped here because of the limited page width, looks like:151,http://web.mta.info/nyct/service/,Parsons Blvd & Archer Ave at NE corner,POINT (-73.799784000399 40.70235300071414),E-J-Z3.1 Subway Entrances, Subway Stations, and LinesAnyone who uses the New York City subway system knows that there is some ambiguity in the meaning ofthe term subway station . For example, the 51st Street station of the 6 line is connected to the LexingtonAvenue station of the E-M line. Is this one station or two? In this project, the answer is one. The termsubway station refers to the transitive closure of the set of all stations that are connected to each other bypedestrian passages without having to pay an additional fare, and without having to exit the subway system,even if there is a free entrance by doing so. Thus, the 59th Street 6 line is not part of the Lexington AvenueF line, because one has to leave the station to get to the other.3.1.1 ConnectivityGiven any pair of subway entrances, either they are part of the same subway station or they are not. Whentwo entrances are part of the same station, we say they aremoment's thought should convince you that connectivity isconnected .Otherwise they aresymmetric, re exive,anddisjoint .transitive,Aand hence,an equivalence relation, and that this relation partitions the collection of all subway entrances into a set ofnon-empty, disjoint sets such that every entrance belongs to a unique set, which we call aEvery subway station has a unique name, which we call itsstation name .subway station .In this project, thestation nameof a given station is any one of the names of the subway entrances that belong to that station. By the natureof how the stations will be identi ed, their names may change as the le is being read, but once the entire leis read, every station will have a unique, stable, station name. This will become clear after you understandhow stations are created.Alineconsists of the set of all stations that service that line. One station may be part of many lines, andhence lines do not partition the collection of stations into disjoint sets. Although most lines de ne a sequenceof stations, determined by the order in which a train running on that line visits each station, some do notbecause they branch. Therefore, in this project, we do not attempt to order the stations for a given line.Each line is identi ed by itsline identifier,Determining Connectivity.de ned above.The problem in determining connectivity, meaning which entrances are partof the same station, is that there is no explicit column in the dataset that tells us this. Therefore, we mustThis work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Int'l nc-nd/4.0/2

CSci 335 Software Design and Analysis IIIProject 3: Processing MTA Subway Entrance DataProf. Stewart Weissuse a heuristic rule to determine this. The rule we shall use is the following:De nition.E1Two subway entrancesline identifiersandE2areconnected the set of the distance between them is at most 0.28 kilometers.or there is a third entranceE0if eitherfor each is identical, andsuch thatE1is connected toThe distance constraint may be subject to change.E0andE2is connected toE0 .Your program should make this distance an easilychangeable parameter. With this de nition, the dataset can be used to construct the collection of distinctsubway stations. In addition, we can assign a unique location to each station by making it the centroid ofthe locations of its entrances. The centroid is the arithmetic mean of the locations of each of its entrances.We can approximate that mean using the average of the latitude and longitude values. Namely, if a stationhas entrances whose GPS points arep1 (x1 , y1 ), p2 (x2 , y2 ), . . . , pn (xn , yn )c (1/n)(nXxk ,k 14nXthen the centroid isyk )(1)k 1Program Invocation, Usage, and BehaviorThe program is invoked from the command line and expects two command line arguments, which specifyrespectively (1) thecsv le to be opened for reading and (2) the sequence of user commands to be processed.If two les are not speci ed, it is a usage error and the program must write an appropriate and meaningfulerror message onto thestandard error stream , after which it must exit.If a le that is speci ed does notexist or cannot be opened for some reason, the program must write an appropriate and meaningful errormessage onto the standard error stream and then exit.Assuming that both les are opened successfully, the program must read the entire csv le, line by line,and process these lines according to the rules speci ed below. Once the entire csv le has been read andprocessed, the program will read the commands from the command le and process them one after the other.These commands are described and explained in Section 4.3 below.4.1 Case Sensitivity and White Space SensitivityThe program should treat all names of stations and station entrances case insensitively. This means thatthe following strings all refer to the same name:Nassau St & Frankfort St at SE cornernassau St & frankfort St at SE cornerNassau st & frankfort St at se cornerIt should treat any sequence of space characters as a single space character. Thus, the following strings referto the same name:Nassau St & Frankfort St at SE cornerNassau St&FrankfortSt at SE cornerThis is true of names entered as arguments to commands as well. This makes it easier for the user of theprogram, but harder for the programmer.4.2 Processing the Input Data FileIt is the task of the main program to read the input le, parse its lines, construct aobject for each row, and make the calls to aof entrances. TheSubway EntranceSubway SystemSubway Entranceclass to insert that object into the collectionobject should represent a single subway entrance. TheSubway Systemclass must manage the collection of entrances so that, as each new entrance is found, it is checked againstall other entrances to determine if it is connected to any of them, and if so, to update its sets. It is thereforea class that contains a container storingLineSubway Entranceobjects and, as you will see, a container storingobjects.This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Int'l nc-nd/4.0/3

CSci 335 Software Design and Analysis IIIProject 3: Processing MTA Subway Entrance DataProf. Stewart Weiss4.3 Command ProcessingAfter the CSV le has been processed, the program enters a command processing loop, in which it readscommands from the second le speci ed on the command line. The syntax of the valid commands from thatbold text is the command literal and the italicized text is its parameter list. Alloutput lists should use the names as they are found in the original le, not names whose casehas been changed.le is as follows. TheCommandDescriptionlist line stationsline identifierlist all stationsLists thestation namesof all subway stations that service thegiven line.Lists thesystem.list entrances station nameLists thestation namesnamesof all subway stations in the subwayof all subway entrances for the given station. Thelist should not include exit-only entrances.nearest station longitudelatitudeLists thestation namesof all subway stations that are closest tothe point (longitude,latitude ). There may be more than onebecause two or more might be the same distance from the point.nearest lines longitudelatitudeLists theline identi ersof all subway lines that are closest tothe point (longitude,latitude ). There may be more than onebecause two or more lines might be at a station that is nearest tothe point, or because two stations might be at the same distancefrom the point.nearest entrance longitudelatitudeLists thenamesof all subway entrances that are closest to thepoint (longitude,latitude ). There may be more than onebecause two or more might be the same distance from the point.4.3.1 Distance Between Two Points on Sphere (The Haversine Formula)TheHaversineformula (see https: //en.wikipedia.org/wiki/Haversine formula) can be used to computethe approximate distance between two points when they are each de ned by their latitude and longitudein degrees.The distance is approximate because (1) the earth is not really a sphere, and (2) numericalround-o errors occur. Nonetheless, for points that are no more than ten kilometers apart, the formula isaccurate enough. Given the following notationd: the distance between the two points (along a great circle of the sphere,r: the radius of the sphere,ϕ1 , ϕ2 :latitude of point 1 and latitude of point 2, in radiansλ1 , λ2 :longitude of point 1 and longitude of point 2, in radiansthe formula iss2r · arcsin2sin ϕ2 ϕ12 2 cos (ϕ1 ) cos (ϕ2 ) sin λ2 λ12 !(2)A C function to compute this formula in a numerically e cient way is given in Listing 1. Rememberthat latitude and longitude are not interchangeable, and that the values of these are given in the commandsand in the le with longitude followed by latitude.Listing 1: Haversine Function (corrected version)#i n c l u d e cmath # Tobuild ,linktothemathlibraryusing lmThis work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Int'l nc-nd/4.0/4

CSci 335 Software Design and Analysis IIIProject 3: Processing MTA Subway Entrance Dataconstdouble R 6372.8constd o u b l e TO RAD doublehaversine (M PI /double180.0;lat1 ,Prof. Stewart Weiss//radius//conversiondoublelon1 ,ofearthofdoubleinkmdegreeslat2 ,toradsdoublelon2 ){ lat1 ; lat2 ;TO RAD l o n 1 ;TO RAD l o n 2 ;( lat2 lat1 )/2;( lon2 lon1 )/ 2 ;lat1 TO RADlat2 TO RADlon1 lon2 doubledLa t doubledLon doublea s i n ( dLat ) ;doubleb s i n ( dLon ) ;return2 R a s i n ( s q r t ( a a c o s ( l a t 1 ) c o s ( l a t 2 ) b b ) ) ;}5Program Organization and LogicThe program must process the commands e ciently.This section discusses some of the logical issues,technical tools, and algorithms that this suggests you use.5.1 Bit MasksBit masks are an important tool in solving several of the problems in this assignment. For example, giventwo subway entrances, each with its own set of lines that it services, how can you quickly decide if the twosets are identical? Of course you could design a solution that iterates over train lines, and which would bevery ine cient, but you can also design a solution that uses constant time to solve this problem, if each setis represented by a bit mask. How?Similarly, given aline identifier,how can you nd the set of all stations that service the line, assumingthat this set is not stored for each line? For each station, you would need to check if theline identifieris part of the set of lines that it services. This can also be done in constant time if the station has a bit maskand the line has a bit mask. How?Given that there are at mostthan one character, a64-bit26single letter lines, and7single digit lines, and just a few lines with moreinteger has more than enough bits to de ne bit masks to represent sets of lines.5.2 Determining Connectivity (Subway Station Creation)Determining the disjoint sets of subway stations and which subway entrances are part of the same stationrequires using the nd and smart union operations from Chapter 8, which implies thatSubway Entranceobjects should be stored in an array or a vector. In the lecture notes for Chapter 8, simple integers are usedas an example to illustrate the algorithms, but your project must de ne an array whose elements are classobjects and that also have a member that is the parent index. Thus you need to generalize those algorithms.Each time a newSubway Entranceobject is created, it is placed into a set of size one. The object in thisset is then compared against all currently existing sets to see if they should be connected to this one. If so,this is a union operation. If it is not connected to any existing sets, it remains a set of size one. Eventually,all subway entrances are compared to all others in this way, when the entire le has been read, and the setshave been formed. At this point the subway stations can be given stable subway station names, which aresimply the names of the subway entrances at the roots of the parent trees that represent them. In addition,their locations as GPS coordinates can be computed based on Equation 1.5.3 Finding Stations E cientlyOne of the problems the program must solve is how to nd a particular station e ciently.If the onlyrepresentation of a subway station is the parent tree for that station stored in an array or vector of suchtrees, then when given a station name to nd, as when a command is given to list all entrances for thatThis work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Int'l nc-nd/4.0/5

CSci 335 Software Design and Analysis IIIProject 3: Processing MTA Subway Entrance DataProf. Stewart Weissstation, the program would have to do an iterative search, taking too much time.Instead, it can createa hash table whose keys are station names, and whose values are the index values in the array or vectorat which the parent tree root is found. (It can also store the subway station location there too, if it is astructure or class object.) Therefore, once all stations have been created, a hash table can be created tostoreSubway StationSubway StationSubway Station Hash.objects for fast access later. (What data and operations should aobject encapsulate?) This hash table should be represented by a class named5.4 LinesA subway line is a set of stations. Should the program construct these sets as it reads the input le, or onlywhen a command is issued to list the stations belonging to the line? How would you represent these sets?Since subway stations are not stabilized until after all data has been read, it is a challenging task to constructthe sets as the input le is processed. Instead, it is better to do so in a lazy way, only when a command isinvoked to list the line's stations. Then how can this be done e ciently? As was mentioned in Section 5.1above, if a line has a bit mask and the hash table has a method to enumerate allSubway Stationobjectsin it, each of which has a bit mask, then the set of all stations serving that line can be determined with asingle pass across the set of subway stations, and these stations can then be stored, or at least some type ofLine object. Thus a Line objectSubway Station objects. How can you createreference to them can be stored, in a linked list or vector associated with astores a bit mask and some type of reference to a container ofthis bit mask e ciently?How can you represent the set of allLineobjects in such a way that there is fast access to them? Sinceyou never delete lines and only insert them and search for them, this suggests that they should be storedin a hash table. Your program should create a hash table whose keys are line identi ers and whose valuesareLineobjects. It is possible to design a very simple hash function for this table. In fact, it is possibleto de ne a perfect hash function that is extremely fast, given the simple form ofprogram will be evaluated in part on your accomplishing this task.itself, namedSubway Line Hash.line identifiers.YourThis hash table should be a class in5.5 The Subway SystemSubway System object. This object must encapsulate the twoSubway Station Hash and Subway Line Hash, and the vector that stores the Subway Stationtrees. These objects need not be exposed to the clients. Instead, the Subway System should provideThe subway system is represented by ahash tables,parentmethods to perform all operations required by the program, so that the main program's only interaction iswith theSubway System.You are free to design this class interface as you see t.5.6 Required FilesRegardless of how you de ne the class interfaces for your classes, each class must be represented by a separatepair of les, its header le and its implementation le.Minimally, this implies that your program wouldneed to contain the following les, assuming they are named in the obvious way:main.cppsubway entrance.hsubway entrance.cppsubway station.hsubway station.cppsubway system.hsubway system.cppsubway station hash.hsubway station hash.cppsubway line hash.hsubway line hash.cppREADMEMakefileThis work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Int'l nc-nd/4.0/6

CSci 335 Software Design and Analysis IIIProject 3: Processing MTA Subway Entrance DataTheREADMEProf. Stewart Weissle must contain a running log of your progress and changes and thoughts and possibly frus-trations during this project, or your eureka moments. It can also contain documentation of the program.There is no hard rule about it. I will provide a Make le that can be used regardless of how you name theles.6Testing Your ProgramYou should make sure that you test the program on a much smaller data set for which you can determinethe correct output manually. You should create your own small test les for that purpose. (Feel free to sharethose with other students on Piazza. )You should make sure that your program's results are consistent with what is described in this speci cationby running the program on carefully designed test inputs and examining the outputs produced to make surethey are correct.7Programming RulesYour program must conform to the programming rules described in theProgramming Rulesdocumenton the course website. It is to be your work and your work alone.8Grading RubricThe program will be graded based on the following rubric, based on 100 points. A program that cannot run because it fails to compile or link on acslab host receives only 20%.This20% will be assessed using the rest of the rubric below. Meeting the functional requirements of the assignment: 50% Performance and Design. These are inseparable in this assignment. It includes e cient solutions tothe problems as well as choices of algorithms, data structures, modularity, organization: 25% Documentation: 20% Style and proper naming: 5%This implies that a program that does not compile on a9cslabhost cannot receive more than 20 points.Submitting the AssignmentThis assignment is due by the end of the day (i.e. 11:59PM, EST) on May 14, 2018. Create a directorynamed namedusername project3 where usernameis to be replaced by your CS Department network loginname. Put all project-related source-code les and theREADMEandMakefileinto that directory.place any executable les, data les, or object les into this directory.Do notYou will lose 1% for each2le that does not belong there, and you will lose 2% if you do not name the directory correctly .Next, create a zip archive for this directory by running the zip commandzip -r username project3.zip ./username project3This will compress all of your les into the le namedusername project3.zip.Do not use the tar compressutility. If you do not zip the directory correctly so that all les, when extracted with the commandusername project3.zip,unzipare not in a properly named directory, your program will lose 2%.The submit command that you will use issubmit cs335 project.It requires two arguments: the numberof the project and the pathname of your le. Thus, if your le is namedusername project3.zipand it isin your current working directory you would typesubmit cs335 project 32Iusername project3.ziphave scripts that process your submissions automatically and misnamed les force me to manually override them.This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Int'l nc-nd/4.0/7

CSci 335 Software Design and Analysis IIIProject 3: Processing MTA Subway Entrance DataThe program will copy your le into theproject3Prof. Stewart 335 sw/projects/project3/and if it is successful, it will display the message, File .successfully submitted. You will not be able to read this le, nor will anyone else except for me. But you can double-check that thecommand succeeded by typing the commandls -l /data/biocs/b/student.accounts/cs335 sw/projects/project3and making sure you see a non-empty le there.If you put a solution there and then decide to change it before the deadline, just replace it by the newversion. Once the deadline has passed, you cannot do this. I will grade whatever version is there at the endof the day on the due date. You cannot resubmit the program after the due date.This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 Int'l nc-nd/4.0/8

subway entrance. 2 URL A URL that provides service information about the subway entrance. 3 Name A string that speci es a unique name for this subway entrance, which is always its street address or a similar locator. This eld might contain a substring (exit) that indicates that it is an

Related Documents:

MTA values its employees and MTA employees value their jobs. The average tenure at MTA is 10.4 Years. That’s nearly 3x longer than the national private sector average (3.7 years). MTA employee contributions are more affordable than in large private companies. MTA offe

China alone contributes 50-70% of the new capacity during another two round of expansion in 2019 and 2023. Average Annual New Capacity of Basic Petrochemicals 1996-2000 2007-2009 2012-2013 2018-2019 2021-2025 Note: Basic chemicals including ethylene, propylene, butadiene, benzene, toluene, PX and methanol MTA 14 30 21 15 25 MTA MTA MTA MTA 11% .

formed by an MTA. The most common MTA for UNIX systems is Sendmail, and MTA for Windows is Mi-crosoft Exchange 2000/2003. In addition to stable host-based e-mail servers, Microsoft Corporation has devel-oped LDAP/Active-directory servers and B2B-servers that enhance mail-delivery practices. Users normally do not deal with the MTA.

air load, the dryer’s refrigeration system only operates to maintain the mass at design temperature. The result is near-zero power consumption with immediate restart capability. dryer power consumption load on dryer standard dryer MTA savings DE Hybrid MTA savings MTA savings compressed air flow dryer powe

Sep 26, 2017 · tools needed for marketers to select and apply MTA solutions with confi dence. As part of this process, we uncovered that successful MTA deployment begins with a proper data strategy. As a fi rst step, we created the enclosed MTA DataMap , designed as a concise reference tool of all of the d

MTA Chairman & CEO Prendergast Endorses LIRR Infrastructure Agenda The newest MTA chief is working on the railroad. During a Special Executive Breakfast co-hosted by the Long Island Contractors’ Association (LICA) and the Long Island Association (LIA), MTA Chairman & Chief Executive Officer Thomas Prendergast presented his credentials to Long Island on

FTA ADA Paratransit Compliance Review: Maryland Transit Administration (MTA Mobility) April 2016 8 4. Introduction to MTA Mobility Maryland Transit Administration (MTA), a regional transportation authority that is a division of the Maryland Department of Transportation (MD DOT), operates public transit services in the Baltimore metropolitan area.

For more details on invoice submission, go to www.MTA.info and click the links Doing Business with Us, Procurement, Invoice Processing. Sign up for ACH direct deposit To sign up for ACH direct deposit of your invoice payments, log on to My MTA Vendor Portal, click on BSC Forms on the hom