Abstract
Automated Vehicles (AVs) promise significant advances in transportation. Critical to these improvements is understanding AVs’ longitudinal behavior, relying heavily on real-world trajectory data. Existing open-source trajectory datasets of AV, however, often fall short in refinement, reliability, and completeness, hindering effective performance metrics analysis and model development. This study addresses these challenges by creating a Unified longitudinal trajectory dataset for AVs (Ultra-AV) to analyze their microscopic longitudinal driving behaviors. This dataset compiles data from 14 distinct sources, encompassing various AV types, test sites, and experiment scenarios. We established a three-step data processing: 1. extraction of longitudinal trajectory data, 2. general data cleaning, and 3. data-specific cleaning to obtain the longitudinal trajectory data and car-following trajectory data. The validity of the processed data is affirmed through performance evaluations across safety, mobility, stability, and sustainability, along with an analysis of the relationships between variables in car-following models. Our work not only furnishes researchers with standardized data and metrics for longitudinal AV behavior studies but also sets guidelines for data collection and model development.
Similar content being viewed by others
Background & Summary
The advent of Automated Vehicles (AVs) marks a revolutionary change in the realm of transportation1. Various stakeholders, including transportation agencies, policymakers, urban planners, the automotive industry, and customers are paying attention to the potential impact of AV on traffic flow. Numerous studies have established a definitive and quantifiable link between macro-level traffic flow and micro-level longitudinal driving behavior2,3. This connection underscores the importance of understanding microscopic longitudinal AV behaviors to fully grasp their broader impacts on traffic. There is a particular emphasis on car-following behavior, the critical component in longitudinal driving behavior and arguably the most fundamental element in traffic flow4,5. The key to studying and comprehending the car-following driving behavior of AVs lies in the availability of real-world trajectory data, which contain a sequence of spatial positions, velocities, and ground truth accelerations over time and thus provide invaluable insights into AV behavior6. Access to such AV trajectory data is imperative for stakeholders to generate reliable insights informing policy, infrastructure development, management strategies, traffic solutions, and AV design.
Numerous studies indicated that car-following behavior essentially impacts road traffic performance including safety, mobility, stability, and sustainability7,8,9,10,11,12,13. AV trajectory data can offer direct insights into the impact on traffic in terms of these performance metrics. Safety is a priority for many stakeholders. In car-following behavior, the principal focus involves assessing the likelihood and timing of a rear-end collision, considering the vehicle’s relative position and speed with respect to its preceding vehicle. Regarding mobility metrics, AVs have the potential to alter the driving strategy and following distance, thereby affecting throughput and traffic flow efficiency. While adopting an aggressive driving strategy or maintaining shorter following distances might enhance efficiency, such approaches would compromise stability, leading to diminished comfort and rising safety hazards. AVs are also expected to reduce overall fuel consumption of road traffic, thereby contributing to the achievement of environmental sustainability for future transportation. By directly analyzing AV trajectory data, stakeholders can assess these performance metrics, and develop strategies that enhance the positive impacts of AVs while mitigating potential negative impacts.
Although AV trajectory data can provide intuitive insights into the performance metrics of AV, a significant limitation arises from the limited scenarios in which this data is collected. For example, trajectory data may be collected within a specific speed range or when the AV followed a vehicle adhering to a predetermined path14. Thus, performance metrics derived from such constrained scenarios might present a biased view, failing to capture corner cases. This drawback underscores another crucial role of AV trajectory data: the accurate calibration of robust models that can run in the mirror of real-world conditions. These accurately calibrated models enable the exploration of broader impacts on traffic through simulation, including examining AV driving behavior in corner cases15. Furthermore, the models facilitate prediction interactions between AVs and human-driven vehicles. By accurate simulation results, the stakeholders can lay the groundwork for decision-making and strategic planning in anticipation of the forthcoming mixed traffic16.
Recently, a surge in perception datasets of AVs-gathered through cameras and Light Detection and Ranging (LiDAR) in AV, such as BDD100K17, Argoverse18,19, Waymo perception20, KITTI21, nuScenes22, ONCE23, and ZOD24 datasets. These perception datasets are primarily used to predict the motion states of surrounding vehicles of AV and address basic safety conditions in AVs. However, they fall short of capturing the complex driving behaviors of AVs. In stark contrast, the collection of AV trajectory data, which depends on Global Positioning System (GPS) and Inertial Measurement Units (IMU), remains exceedingly scarce despite its critical importance as previously discussed. This scarcity is largely due to automakers’ reluctance to voluntarily share their trajectory data with their automated driving technology. The high costs associated with renting test sites and vehicles with automated driving technology also pose barriers for researchers to collect this essential data.
Despite these challenges, researchers worldwide have published several trajectory datasets under varying conditions and different sizes, as summarized in the Appendix. However, these datasets often fall short in both content and data processing methods, which limits their utility for comprehensive and precise studies of car-following behavior25. In terms of the content, these individual datasets have deficiencies in equity and sustainability. The sensors used for data collection, the tested AVs, the test sites, and the experimental settings vary among datasets. Consequently, relying on a single dataset for research cannot yield comprehensive conclusions. Additionally, we found that most datasets do not analyze the energy consumption of AVs, which restricts research in the area of sustainability. In terms of data processing methods, the deficiencies of these datasets can be categorized into three aspects: refinement, reliability, and completeness. Firstly, not all datasets are specifically designed to collect trajectory data; they may inadvertently include car-following trajectory data alongside extraneous information, such as data on lateral vehicles that do not influence AV’s car-following behavior20,26. This necessitates a selective refinement process to isolate the relevant trajectory data. Also, some datasets only contain raw data, which may include outliers or anomalies resulting from measurement errors, thus compromising data reliability. Thirdly, these datasets occasionally lack crucial details (e.g., vehicle length), requiring researchers to make educated guesses to fill these gaps. In light of these mentioned facts, currently, there is no unified and well-processed trajectory dataset available that encompasses multiple AVs across diverse experimental conditions and scenarios. This absence hinders the feasibility of conducting comprehensive studies on the impact of AVs on transportation.
To address this gap, this study proposes a unified approach to processing trajectory datasets of AVs by enhancing their refinement, reliability, and completeness. While it would be impractical to conduct empirical research on all available vehicles with automated driving technology, developing a dataset in systematic and structured processes that compile the results of experimental campaigns conducted by various global research teams-including the author’s group, Connected & Autonomous Transportation Systems Laboratory (CATS Lab)-can provide substantial insights into the longitudinal behavior of AVs. Similar to how standard datasets such as ImageNet27, KITTI21, and NGSIM28 have revolutionized their respective fields, this work aims to establish an open-source Unified Longitudinal Trajectory dataset for AVs (Ultra-AV) for future longitudinal behavior research of AV29. The Ultra-AV dataset will facilitate the analysis of data, the development of models, and the identification of characteristics that influence AVs’ impact on transportation.
This study has the following contributions:
-
This study systematically reviewed open-source AV trajectory datasets and detailed their collection scenarios and conditions.
-
This study developed a unified trajectory data format that includes essential elements for car-following behavior analysis of AVs, such as the position, speed, and acceleration of both the following AV (FAV) and the lead vehicle (LV).
-
This study introduced a standardized trajectory data processing methodology that involves multiple steps to enhance the refinement, reliability, and completeness of the data.
-
This study validated the processed unified trajectory dataset through three key approaches: data collection methods, analysis of performance metrics, and development of AV models.
To summarize, we leverage available open-source AV datasets to facilitate research. A comprehensive workflow for processing multiple open-source datasets to compile this dataset is illustrated in Fig. 1.
One omission in this paper is that we did not consider AV’s lateral behaviors, such as lane-changing. This is due to the scarcity of trajectory datasets that include AV lateral behaviors, and the corresponding analyses and processing are more complex. However, we noted that some trajectory datasets, such as the Central Ohio ACC Dataset30, Waymo Open Dataset20, and Argoverse 2 Motion Forecasting Dataset19, contain lane-changing behaviors. To fully understand AV behaviors, it is essential to consider both longitudinal and lateral behaviors. Additionally, we notice that no dataset contains the fuel consumption data. We suggest that in the future data collection study incorporates this attribute.
Methods
Efforts have been made globally to gather trajectory data related to AVs. Our selection criteria for trajectory datasets are as follows:
-
1.
The dataset must include AVs equipped with Advanced Driving Assistance Systems (ADAS) or Automated Driving Systems (ADS), and the vehicle must be in autonomous driving mode during data collection. For instance, the Oxford RobotCar Dataset31 and the Apollo Space Trajectory Dataset16 are not considered in this study because the autonomous driving mode was not activated during data collection.
-
2.
The dataset must include trajectory data of both the FAV and LV in a car-following state, with each frame containing at least one of the following parameters: position, speed, or acceleration. For example, our analysis dataset does not include figure or video datasets such as the Argoverse 2 Sensor Dataset19.
-
3.
The car-following trajectories in the dataset must encompass sufficiently complete behaviors, so we only consider datasets with trajectory lengths of at least 9 seconds. For example, our analysis does not consider the Argoverse 1 Motion Forecasting Dataset18 since it only consists of 5-second trajectories.
-
4.
We only consider open or currently maintained datasets. Datasets that are no longer maintained are not considered. For instance, the Lyft Level 5 Dataset32 is excluded because its official website is no longer accessible.
Finally, we have examined 14 open-source datasets, each providing distinct insights into AV behavior across various driving conditions and scenarios. These open-source datasets are from seven providers:
-
Vanderbilt ACC Dataset33. Collected in Nashville, Tennessee by Vanderbilt University research group. (https://acc-dataset.github.io/datasets/)
-
MircoSimACC Dataset34. Collected in four cities in Florida, including Delray Beach, Loxahatchee, Boca Raton, and Parkland by the Florida Atlantic University research group. (https://github.com/microSIM-ACC/ICE)
-
CATS Open Datasets35. Three datasets were gathered in Tampa, Florida, and Madison, Wisconsin by the CATS Lab. (https://github.com/CATS-Lab/Filed-Experiment-Data-ACC_Data, https://github.com/CATS-Lab/Filed-Experiment-Data-AV_Platooning_Data, and https://github.com/MarkMaaaaa/CATS-UWMadison-AV-Data/tree/main)
-
OpenACC Database36. Four datasets were collected across Italy, Sweden, and Hungary by the European Commission’s Joint Research Centre. (https://data.europa.eu/data/datasets/9702c950-c80f-4d2f-982f-44d06ea0009f?locale=en)
-
Central Ohio ACC Datasets30. Two datasets were collated in Ohio by UCLA’s Mobility Lab and Transportation Research Center. (https://catalog.data.gov/dataset/advanced-driver-assistance-system-adas-equipped-single-vehicle-data-for-central-ohio and https://catalog.data.gov/dataset/advanced-driver-assistance-system-adas-equipped-two-vehicle-data-for-central-ohio)
-
Waymo Open Dataset20,37. Two datasets were collected in six cities including San Francisco, Mountain View, and Los Angeles in California, Phoenix in Arizona, Detroit in Michigan, and Seattle in Washington by Waymo. (https://waymo.com/open/ and https://data.mendeley.com/datasets/wfn2c3437n/2)
-
Argoverse 2 Motion Forecasting Dataset19. Collected from Austin in Texas, Detroit in Michigan, Miami in Florida, Pittsburgh in Pennsylvania, Palo Alto in California, and Washington, D.C. by Argo AI with researchers from Carnegie Mellon University and the Georgia Institute of Technology. (https://www.argoverse.org/av2.html)
To highlight the diversity of the selected datasets, we analyze their differences from several aspects. First, the locations of the test sites for the datasets are depicted in Fig. 2. The reviewed datasets are collected in several cities in the United States and Europe, which ensure the diversity and exemplarity among the selected cities38.
Additionally, the majority of the datasets reviewed involve AVs’ long-time trajectories, which have been widely used in the analysis of AV behavior in the literature. However, the Waymo Open Dataset’s Waymo Motion Dataset and the Argoverse 2 Motion Forecasting Dataset contain comparatively shorter trajectories, with durations of 9.1 seconds and 11 seconds at 10Hz, respectively. These datasets are primarily employed in research in motion forecasting. However, such datasets are typically collected in rural areas within complex traffic environments, which provide the opportunity to analyze AV behavior in challenging conditions. Consequently, this paper includes analyses of these two datasets.
Finally, we analyze the detailed differences of these datasets from four aspects: data collection, AV information, test sites, and experiment settings. In terms of data collection sensors, many datasets utilize the Ublox GNSS from Ublox company (https://www.u-blox.com/). A few datasets, such as the OpenACC ZalaZone Dataset, Waymo Open Dataset, and Argoverse 2 Motion Forecasting Dataset, combine data from multiple sensors to achieve higher accuracy. Regarding the tested AVs, most datasets test only 1-2 commercial AVs. Some datasets, including the CATS UWM Dataset, Waymo Open Dataset, and Argoverse 2 Motion Forecasting Dataset, test AVs developed by their respective teams. Besides, the OpenACC Dataset includes more than 20 types of AV models, which provides the probability to compare the performance of different AV models. The test sites of the datasets vary, including public roads, closed test tracks, and different road types such as highways, freeways, urban roads, and rural roads. Experiment settings include the speed and headway settings of the AVs, as well as whether the driving state is naturalistic. We discuss more details about these classifications in the Appendix.
To enhance the refinement, reliability, and completeness of these datasets, this study proposes a three-step process to develop the Ultra-AV dataset by three steps: (1) extraction of longitudinal trajectory data; (2) general data cleaning to remove anomalies and errors; and (3) data-specific cleaning tailored for car-following behavior.
Step 1: Extraction of longitudinal trajectory data
The first step of the data process aims to obtain the unified longitudinal trajectory data. Thus, we identified and stored them with a unified data format. Before explaining the extraction process, we define the longitudinal trajectory used in this study. Define the index set \({\mathscr{I}}=\{1,\ldots ,I\}\) of longitudinal trajectories comprising a series of consecutive data points, where I is the total number of trajectories. Each trajectory contains a series of consecutive time stamps \({{\mathcal{T}}}_{i}=\{{t}_{i0},{t}_{i1},\ldots ,{t}_{i{T}_{i}}\}\) with the same time gap Δt, where Ti is the number of time stamps for trajectory i. Although the datasets we reviewed organize data in a similar “trajectory” format, they may contain different FAVs or LVs within different lanes in the same trajectory. To keep consistency, we define a longitudinal trajectory consisting of one FAV \({c}_{i}^{{\rm{f}}}\) to track the same LV \({c}_{i}^{{\rm{l}}}\) consistently in the same lane, without changing lanes throughout all time stamps in a trajectory \({{\mathcal{T}}}_{i}\). In a longitudinal trajectory i, the data point at time stamp t corresponds to a state vector \({{\bf{s}}}_{it}=[{a}_{it}^{{\rm{f}}},{d}_{it},{v}_{it}^{{\rm{f}}},\Delta {v}_{it}]\), where \({a}_{it}^{{\rm{f}}}\) denotes the longitudinal acceleration of \({c}_{i}^{{\rm{f}}}\), dit is the spatial gap (i.e., bumper-to-bumper distance) between \({c}_{i}^{{\rm{l}}}\) and \({c}_{i}^{{\rm{f}}}\), \({v}_{it}^{{\rm{f}}}\) is the velocity of \({c}_{i}^{{\rm{f}}}\), and Δvit is the velocity difference between \({c}_{i}^{{\rm{l}}}\) and \({c}_{i}^{{\rm{f}}}\).
Extracting longitudinal trajectory set \({\mathscr{I}}\) necessitates identifying the \({c}_{i}^{{\rm{l}}}\) and \({c}_{i}^{{\rm{f}}}\). The dataset can be categorized into two types by the identification procedures. In the first category, exemplified by the Vanderbilt ACC Dataset and CATS Open Datasets, the relationship of \({c}_{i}^{{\rm{l}}}\) and \({c}_{i}^{{\rm{f}}}\) is labeled in the dataset. The second category, including the Central Ohio Datasets, Waymo Open Dataset, and Argoverse 2 Motion Forecasting Dataset, provides information on all surrounding vehicles but does not specifically label the relationship. Thus, this paper proposes a unified algorithm to identify LVs among mass trajectories. The identification algorithm is shown as follows:
-
1.
Segment trajectories to exhibit AV’s lane-changing behaviors. For the Central Ohio Dataset, which includes the lane ID where the vehicles are located, processing is straightforward. We segment the trajectories into multiple consecutive trajectories. Thus, each segmented trajectory maintains a consistent lane ID throughout its duration. For datasets that only offer vehicles’ positions without specific lane IDs, such as the Waymo Motion Dataset and the Argoverse 2 Motion Forecasting Dataset, identification of consistent lane trajectories poses a challenge. To address this, we employ linear regression to identify trajectories that exhibit straight-driving behaviors, indicative of consistent lane ID. Here, we denote the set of trajectories from the original dataset as \({{\mathscr{J}}}^{0}\) to differentiate the original trajectory set from the set of longitudinal trajectories I obtained after the processing. Besides, the Euler coordinate position of the center of mass of vehicle \(k\in {{\mathcal{K}}}_{j}\) in trajectory \(j\in {{\mathcal{J}}}^{1}\) at timestamp t as \({p}_{jtk}^{{\rm{x}}}\) and \(\left.{p}_{jtk}^{{\rm{y}}}\right)\), respectively, where k = 0 represents the FAV and \(k\in {{\mathcal{K}}}_{j}/\{0\}\) represents the surrounding vehicle. For trajectory \(j\in {{\mathcal{J}}}^{0}\), apply the least squares method to fit a linear model with \({\{{p}_{jt0}^{{\rm{x}}}\}}_{t\in {{\mathcal{T}}}_{j}}\) as inputs and \({\{{p}_{jt0}^{{\rm{y}}}\}}_{t\in {{\mathcal{T}}}_{j}}\) as outputs. We then compute the R-squared (R2) of the linear model of the trajectory j. Trajectories whose R2 is less than a threshold are considered not a straight line. The threshold is set as 0.9 as determined from our preliminary experimental results. Finally, these trajectories are excluded from set \({{\mathcal{J}}}^{0}\). The new trajectory set is denoted as \({{\mathcal{J}}}^{1}\).
-
2.
Identify preceding vehicles of the FAV. For each trajectory \(j\in {{\mathcal{J}}}^{1}\) and time stamp \(t\in {{\mathcal{T}}}_{j}\), we define the set of vehicles \({\bar{{\mathcal{K}}}}_{jt}\) where each vehicle \(k\in {\bar{{\mathcal{K}}}}_{jt}\) must meet the following criteria: 1. \(k\in {{\mathcal{K}}}_{j}\backslash \{0\}\), 2. k is located in the same lane with the FAV, and 3. k is located in front of the FAV. For the Central Ohio Dataset, which provides both the lane ID and Frenet coordinates30 of surrounding vehicles, the identification of \({\bar{{\mathcal{K}}}}_{jt}\) is straightforward. However, datasets such as the Waymo Motion Dataset and the Argoverse 2 Motion Forecasting Dataset primarily consist of trajectories represented in Euler coordinates. To process this data, we first excluded vehicles not moving in the same direction as the FAV by removing any vehicle k where the dot product σ0 ⋅ σk is negative, where \({{\boldsymbol{\sigma }}}_{0t}=[{p}_{j(t-1)0}^{{\rm{x}}}-{p}_{jt0}^{{\rm{x}}},{p}_{j(t-1)0}^{{\rm{y}}}-{p}_{jt0}^{{\rm{y}}}]\) and \({{\boldsymbol{\sigma }}}_{kt}=[{p}_{j(t-1)k}^{{\rm{x}}}-{p}_{jtk}^{{\rm{x}}},{p}_{j(t-1)k}^{{\rm{y}}}-{p}_{jtk}^{{\rm{y}}}]\) are the direction vectors of the FAV and vehicle k, respectively. Next, we define the direction vector from the FAV to vehicle k, \({{\boldsymbol{\sigma }}}_{0kt}=[{p}_{jt0}^{{\rm{x}}}-{p}_{jtk}^{{\rm{x}}},{p}_{jt0}^{{\rm{y}}}-{p}_{jtk}^{{\rm{y}}}]\). We use the dot product of σ0t and σ0kt to verify alignment in direction, removing vehicles where this value is less than 0.984. This threshold is calculated by the car length and the lane widths. Considering the average width of a mid-size car as 6 feet39, with lane widths at 10 feet40, resulting in a maximum deviation of 4 feet. We assume the AV follows the 3-second rules41 and a minimum speed of 5 mph, which results in a minimum spatial gap of 22 feet. Thus the maximum angle \(\theta =\arctan (0.182)\), \(\cos (\theta )\approx 0.984\). This threshold ensures that only vehicles moving in a closely similar direction to the FAV are retained for further analysis.
-
3.
Identify the LV. For each trajectory \(j\in {{\mathcal{J}}}^{1}\) and time stamp \(t\in {{\mathcal{T}}}_{j}\), calculate the spatial headway hjtk by Equation (1) for each vehicle \(k\in {\bar{{\mathcal{K}}}}_{jt}\) relative to the FAV. hjtk is defined as the distance between the centers of FAV and the surrounding vehicle.
$${h}_{jtk}=\sqrt{{({p}_{jtk}^{{\rm{x}}}-{p}_{jt0}^{{\rm{x}}})}^{2}+{({p}_{jtk}^{{\rm{y}}}-{p}_{jt0}^{{\rm{y}}})}^{2}}$$(1)The vehicle with the smallest hjtk is considered as the LV for that timestamp, \({c}_{jt}^{{\rm{l}}}=\arg {min}_{k}\,{h}_{jtk}\). If a trajectory j involves multiple LVs over time, divide it into several longitudinal trajectories, each consistent with a single LV. Collect these into set \({\mathcal{I}}\).
-
4.
Enhance identification by the relationship between spatial headway and speed. Our preliminary experimental results identified several inaccuracies using previous methods. To enhance the algorithm, an additional step has been integrated into the trajectory processing workflow. For each trajectory \(i\in {\mathcal{I}}\), we compare the change in spatial gap, \(\Delta {d}_{i}={d}_{i{T}_{i}}-{d}_{i0}\), to the change estimated from speed differences, \(\Delta {\widehat{d}}_{i}={\sum }_{t\in {{\mathcal{T}}}_{i}}\Delta t\cdot \Delta {v}_{it}\). In a consistent car-following scenario, these two changes should align closely. Therefore, if the relative difference \(\frac{| \Delta {d}_{i}-\Delta {\widehat{d}}_{i}| }{\Delta {d}_{i}}\) exceeds a threshold of 0.2, which is based on the preliminary experimental results, trajectory i is deemed inaccurate and removed from set \({\mathcal{I}}\).
Following the refinement of longitudinal trajectories, key labels relevant to analyzing FAV behaviors are extracted from the processed data and formatted consistently. The labels retained are listed in Table 1, where each label is described with its definition and calculation methods. Notably, a default value of 4.5 meters, corresponding to the average length of a mid-size vehicle39, is assigned to vehicle length if it is not explicitly provided in the dataset. This standardization ensures uniformity in the data.
Table 2 shows the statistical results after Step 1 extraction of longitudinal trajectory data. To simplify the expression, we label datasets by IDs 1-14, where: 1 = Vanderbilt Two-vehicle ACC Dataset; 2 = MicroSimACC Dataset; 3 = CATS ACC Dataset; 4 = CATS Platoon Dataset; 5 = CATS UWM Dataset; 6 = OpenACC Casale Dataset; 7 = OpenACC Vicolungo Dataset; 8 = OpenACC Asta Dataset; 9 = OpenACC ZalaZone Dataset; 10 = Ohio Single-vehicle Dataset; 11 = Ohio Two-vehicle Dataset; 12 = Waymo Perception Dataset; 13 = Waymo Motion Dataset; 14 = Argoverse 2 Motion Forecasting Dataset. Note that all the tables in this paper follow these IDs. These results indicate the range of data collected in each dataset. For example, CATS UWM Dataset, OpenACC ZalaZone Dataset, Waymo Perception Dataset, and Argoverse 2 Motion Forecasting Dataset (datasets 4, 8, 12, and 13) have a low average speed, suggesting that the scenarios in these four datasets are primarily low-speed environments. The CATS UWM Dataset and OpenACC ZalaZone Dataset mainly test low-speed environments, while the Waymo Motion Dataset and Argoverse 2 Motion Forecasting Dataset are primarily collected in urban environments where the traffic conditions are complex, and AVs usually travel at low speeds. Additionally, there are some outliers, such as the maximum and minimum al and af in dataset 2, and the maximum d in datasets 9 and 10, which would not occur in a normal driving process. We suppose that these data are caused by sensor errors and should be removed from the dataset.
Step 2: General data cleaning
Due to the presence of outliers in the data obtained from Step 1, Step 2 general data cleaning focuses on enhancing the reliability of the trajectory dataset by cleaning it including removing outliers and inputting missing values. Raw datasets may include abnormal values due to sensor errors, such as accelerations exceeding 100 m/s2. Such errors can become more pronounced when calculating additional metrics through differentiation. We define these abnormal values as outliers and remove them by excluding data outside η standard deviations from the mean, denoted as std and mean. The cleaning process includes the following procedures:
-
1.
Mark missing values and outliers. Each label’s mean and standard deviation are calculated, excluding the previously identified missing data. We then mark outliers that fall outside the range [mean − η ⋅ std, mean + η ⋅ std]. The calculation of the mean and standard deviation and the marking will be repeated iteratively without considering the marked outliers until all outliers are marked. The labels to be identified and their respective η values are summarized in Table 3. We use a conservative η to ensure this step removes only genuine outliers without affecting general data.
Table 3 Identified criteria for Step 2 and Step 3. -
2.
Remove or input the marked data points. Based on experience, if a label includes ten consecutive marked data points, we remove all these points to maintain accuracy in trajectory analysis. However, if fewer than ten consecutive marked data points are within a label, use linear interpolation to replace the marked data to minimize data loss. Note that this interpolation is only done within the same trajectory.
-
3.
Re-organize the trajectory ID. After removing some data points, certain trajectories may become discontinuous. To address this, we follow these steps to re-organize the “Trajectory_ID” and “Time_index” labels: 1) Split any trajectories where “Time_index” is discontinuous into multiple new trajectories, assigning each a new “Trajectory_ID”. 2) Remove short trajectories that contain fewer than 70 data points. This threshold is based on preliminary experiments. 3) Renumber the “Trajectory_ID” and “Time_index” columns to start from 0, ensuring a continuous sequence. 4) Update the labels “Position_LV” and “Position_FAV” to reflect changes in trajectory segmentation.
After these cleaning procedures, we obtained the longitudinal trajectory dataset, which includes both free-flow and car-following trajectories. Given the critical importance of car-following behavior in the study of AV longitudinal behaviors, we specifically extracted the car-following trajectories in the next step for further analysis and validation.
The processed data statistics after Step 2 are recorded in Table 4. After this step, the outliers initially found in Table 2 have been removed. The data removal was carefully limited to a small quantity to ensure that the processing did not significantly influence the overall data distribution. Consequently, the mean and standard deviation remain largely consistent with those in Table 2. Nevertheless, some data that do not typically occur in the car-following scenario still exist in Table 4. For example, the minimum speeds of the OpenACC ZalaZone Dataset and Waymo Perception Dataset (datasets 8 and 11) are less than 0 m/s. The maximum acceleration and deceleration of the Ohio Two-vehicle Dataset and Waymo Motion Dataset (datasets 10 and 12) exceed 10 m/s2. In the OpenACC Vicolungo Dataset and Ohio Two-vehicle Dataset (datasets 6 and 10), the maximum spatial gaps are around 250 m.
Step 3: Data-specific cleaning
In this step, we define a hard margin for certain labels to identify car-following trajectories. Though the car-following concept is a broad consensus among researchers, there is no universally accepted definition42. Thus, we proposed several thresholds derived from both a review of the relevant literature and empirical analysis of the data to identify car-following behavior, which are also summarized in Table 3:
-
A minimum speed threshold of 0.1 m/s, below which FAVs are considered to be stationary based on empirical observations.
-
A spatial distance threshold whereby an LV is situated within 120 meters on the same lane as the FAV, identifying it from free-flow traffic conditions43.
-
An acceleration range set of FAV is between -5 m/s2 to 5 m/s244.
The following process of this step is similar to Step 2. Data points that fall outside these established thresholds will be removed for exclusion or rectified through linear interpolation. After these three steps, we finally obtained the car-following trajectory dataset.
Table 5 shows the statistical results after Step 3 data-specific cleaning. Following the removal of non-car-following scenario data, the data in Table 4 have been adjusted to normal ranges. It is evident that the average speeds across all datasets have increased, especially in the Ohio Single-vehicle Dataset, Ohio Two-vehicle Dataset, Waymo Perception Dataset, and Argoverse 2 Motion Forecasting Dataset (datasets 9, 10, 11, and 13). This suggests the presence of numerous stationary scenarios within these datasets. The average spatial gaps in the Ohio Single-vehicle Dataset, Ohio Two-vehicle Dataset, and Waymo Perception Dataset (datasets 9, 10, and 11) have also significantly increased, indicating that these datasets initially contained many instances of small gaps.
Figure 3 displays the statistical distributions of key variables in car-following behavior analysis, including spatial gap, relative speed, FAV speed, and FAV acceleration, for the final data after completing the three-step data processing. From Fig. 3, we can suppose the testing scenarios of the data, where some of them are indicated in the Appendix. For example, the CATS UWM Dataset was collected in a low-speed environment, while the OpenACC Casale Dataset was collected in a high-speed environment. Additionally, the form of the car-following model can be analyzed by observing the shape of the distribution. For example, the distributions of af from the four datasets in the OpenACC Database clearly show two distinct peaks, one on the left and one on the right of zero, possibly representing vehicles’ behaviors in accelerating and decelerating are piecewise.
Data Records
The Ultra-AV dataset, which includes both the longitudinal trajectory dataset and the car-following trajectory dataset, is obtained after undergoing Step 2 general data cleaning and Step 3 data-specific cleaning. The dataset in CSV format comprising the descriptive document can be freely accessed via the figshare repository29 (https://figshare.com/articles/dataset/Ultra-AV_A_unified_longitudinal_trajectory_dataset_for_automated_vehicle/26339512). The raw data for the 14 original datasets can be found in the corresponding literature or websites.
Technical Validation
In this section, we validate the processed unified trajectory dataset through three aspects. First, we introduce the data collection methods we used in the CATS Open Datasets. Then, we analyze the performance of the car-following trajectory dataset through four metrics. Finally, we analyze the relationships between the variables in the car-following model.
Data Collection
First, we introduce the AV platform developed by the CATS Lab as a reference solution for future AV trajectory data collection. The CATS Lab has developed a complete AV platform, which has been set up in two lab-owned Lincoln MKZ. The platform is built upon the Robot Operating System (ROS), which provides a robust framework for parallel computing and is particularly well-suited for robotics and autonomous applications. In addition, the platform allows for direct electronic control over the vehicle’s functions by integrating the Drive-By-Wire (DBW) system.
The developed system includes the perception system, operation system, and dynamical system, shown in Fig. 4. The perception system comprises advanced sensing technologies such as LiDAR and cameras that provide real-time data on the vehicle’s surroundings. LiDAR and GPS navigation units offer high-precision ___location tracking capabilities. The operation system is a hierarchical structure with an upper-level computer and a lower-level control. This system utilizes the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol for networking and communication, interlinking with the Controller Area Network (CAN) for vehicle control. The dynamical system features an electrically powered acceleration/braking/steering system to manipulate the vehicle’s longitudinal and lateral motions. The final output utilizes the CAN to transit the signals for brake, throttle, and steering angle.
Data collection is accomplished by the perception system. The perception system obtains precise vehicle trajectory data collected from various sensors, including LiDAR, GPS, and cameras. The early data collected by CATS Lab, including the CATS ACC Dataset and CATS Platoon Dataset, primarily used the Ublox GPS system to gather real-time GPS positions and speeds. The real-time vehicle-following spacing between the two vehicles could be obtained by the distance between the GPS positions. Preliminary testing indicated that the GPS receivers had a position accuracy of 0.26 m and a speed accuracy of 0.089 m/s. Due to low precision and the data packet loss during transmission, in the CATS UWM Dataset, we utilized LiDAR for data collection. To achieve higher data precision, a feasible approach is to design algorithms that integrate data from multiple sensors. The Waymo Open Dataset and Argoverse Dataset have already employed similar technologies. Therefore, we recommend that researchers integrate multiple sensors in future data collection to obtain high-precision data.
Performance Measurement
In this section, we analyze how each dataset impacts road traffic performance in four metrics, i.e., safety, mobility, stability, and sustainability. These metrics are the goals of the intelligent transportation system45 and have been widely used in the literature7,8,9,10,11,12. We first define the measurements of these four metrics.
The safety of FAV in car-following behavior is measured by the Time-To-Collision46 (TTC) to represent the risk or proximity of a vehicle to a potential collision. The TTC at time t is defined as the time that remains until a collision between two vehicles would have occurred if the collision course and speed difference were maintained. The higher the TTC is, the more safe a situation is, and vice versa46. The TTC in trajectory i at time t can be calculated as follows:
The mobility is measured by the time headway, defined by the time difference between consecutive arrival instants of two vehicles passing a certain detector site on the same lane47. The time headway is considered a direct measure of road capacity. A short time headway will increase road capacity and thus increase mobility, and vice versa48. The time headway in trajectory i at time t can be calculated as follows:
The stability is measured by the squared error of acceleration at time t. Larger variations in acceleration are considered a lack of smoothest to indicate discomfort and potential safety risks. The larger the squared error of acceleration is, the lower traffic stability, and vice versa. It can be calculated as follows:
The sustainability is measured by the fuel consumption rate of the FAV. We utilize the average value of three classical vehicle fuel consumption models to measure the fuel consumption, including the Virginia Tech Microscopic (VT-Micro) model49, Vehicle Specific Power (VSP) model50, and Australian Road Research Board (ARRB) model51,52. We use \({F}_{it}^{{\rm{VTM}}},{F}_{it}^{{\rm{VSP}}},{F}_{it}^{{\rm{ARRB}}}\) to represent the fuel consumption calculated by the three models in trajectory i at time \(t\in {{\mathcal{T}}}_{i}\), respectively. To simplify the notation, in equations (5)–(8), we use vit and ait to represent the velocity and acceleration of the vehicle in trajectory i at time \(t\in {{\mathcal{T}}}_{i}\). The expression of the three models is shown as follows:
-
VT-Micro model:
$${F}_{it}^{{\rm{VTM}}}=\exp \left(\mathop{\sum }\limits_{{n}_{1}=0}^{3}\mathop{\sum }\limits_{{n}_{2}=0}^{3}{f}_{{n}_{1}{n}_{2}}^{{\rm{VTM}}}{\left({v}_{it}\right)}^{{n}_{1}}{\left({a}_{it}\right)}^{{n}_{2}}\right)$$(5)where n1, n2 are the power indexes and \({f}_{{n}_{1}{n}_{2}}^{{\rm{VTM}}}\) are constant coefficients, which are available in Table 6. The notation E represents exponentiation in scientific notation.
Table 6 Coefficients of the VT-Micro model. -
VSP model:
$$VS{P}_{it}={v}_{it}\left({f}_{1}^{{\rm{VSP}}}{a}_{it}+{f}_{2}^{{\rm{VSP}}}\delta +{f}_{3}^{{\rm{VSP}}}\right)+{f}_{4}^{{\rm{VSP}}}\cdot {v}_{it}^{3}$$(6)$${F}_{it}^{{\rm{VSP}}}=\left\{\begin{array}{ll}{f}_{5}^{{\rm{VSP}}}, & \,{\rm{if}}\,VS{P}_{it} < -10\\ {f}_{6}^{{\rm{VSP}}}VS{P}_{it}^{2}+{f}_{7}^{{\rm{VSP}}}VS{P}_{it}+{f}_{8}^{{\rm{VSP}}}, & \,{\rm{if}}\,-\,10\le VS{P}_{it}\le 10\\ {f}_{9}^{{\rm{VSP}}}VS{P}_{it}+{f}_{10}^{{\rm{VSP}}}, & \,{\rm{if}}\,VS{P}_{it}\ge 10\end{array}\right.$$(7)where δ denotes the road grade that is set to 0 in this paper since we assume the road grade can be neglected in most experiment sites, VSPit is the vehicle-specific power in trajectory i at time \(t\in {{\mathcal{T}}}_{i}\). Parameters \({f}_{1}^{{\rm{VSP}}}=1.1\), \({f}_{2}^{{\rm{VSP}}}=9.81\), \({f}_{2}^{{\rm{VSP}}}=9.81,\,{f}_{3}^{{\rm{VSP}}}=0.132\), \({f}_{4}^{{\rm{VSP}}}=3.02E-4,\,{f}_{5}^{{\rm{VSP}}}=0.00248\), \({f}_{6}^{{\rm{VSP}}}=0.00198,\,{f}_{7}^{{\rm{VSP}}}=0.0397\), \({f}_{8}^{{\rm{VSP}}}=0.201E,\,{f}_{9}^{{\rm{VSP}}}=0.0793\), \({f}_{10}^{{\rm{VSP}}}=0.00248\).
-
ARRB model:
$${F}_{it}^{{\rm{ARRB}}}={f}_{1}^{{\rm{ARRB}}}+{f}_{2}^{{\rm{ARRB}}}{v}_{it}+{f}_{3}^{{\rm{ARRB}}}{v}_{it}^{2}+{f}_{4}^{{\rm{ARRB}}}{v}_{it}^{3}+{f}_{5}^{{\rm{ARRB}}}{v}_{it}\cdot {a}_{it}+{f}_{6}^{{\rm{ARRB}}}{v}_{it}\left(\max {\left(0,{a}_{it}\right)}^{2}\right)$$(8)where parameters \({f}_{1}^{{\rm{ARRB}}}=0.666,{f}_{2}^{{\rm{ARRB}}}=0.019,{f}_{3}^{{\rm{ARRB}}}=0.001,{f}_{4}^{{\rm{ARRB}}}=5E-4,{f}_{5}^{{\rm{ARRB}}}=0.122\), and \({f}_{6}^{{\rm{ARRB}}}=0.793\).
In equations (5)–(8), the units for vit and ait are m/s and m/s2, respectively. The units for \({F}_{it}^{{\rm{VTM}}},{F}_{it}^{{\rm{VSP}}},{F}_{it}^{{\rm{ARRB}}}\) are L/s, g/s, and ml/s, respectively. To perform unit conversions, we assume the density of fuel is 800 g/L (the density of gasoline at room temperature is approximately 720 to 775 g/L, and diesel is about 830 to 850 g/L). Therefore, the total energy consumption equation is formulated as:
Figure 5 displays the distributions of the four indicators across all datasets. The distribution of TTC is predominantly left-skewed, with peak values below 50 s. Notably, the Waymo Motion Dataset and Argoverse 2 Motion Forecasting Dataset exhibit the smallest peaks around 10 s, and their probability density is concentrated in the lower TTC range. Given that a smaller TTC indicates greater risk, this distribution suggests a higher risk associated with AVs in complex traffic environments.
The distribution of τ is primarily concentrated between 1-5 s. The CATS Platoon Dataset shows multiple peaks due to testing with four levels of time headway settings. In contrast, the CATS UWM Dataset exhibits a larger τ, indicating poorer mobility of its AV’s car-following model compared to the smaller τ observed in the OpenACC Database and Vanderbilt ACC Dataset. This difference may also caused by the different experimental settings. The CATS UWM Dataset was tested in a low-speed environment with a minimum safety distance, leading to a larger τ, while the other two datasets were tested at higher speeds.
Regarding the distribution of F, some datasets, including the CATS UWM Dataset, OpenACC ZalaZone Dataset, Waymo Motion Dataset, and Argoverse 2 Motion Forecasting Dataset, predominantly have F below 0.001 L/s. Conversely, the Vanderbilt ACC Dataset, OpenACC Casale Dataset, and OpenACC Vicolungo Dataset exhibit higher F, averaging over 0.005 L/s. The variance in F across these datasets can be explained by the positive correlation between energy consumption and speed. According to Table 5, the datasets with the lower F correspond to those with the lowest average speeds, while the datasets with higher F have the highest average speeds. Additionally, differences in the vehicle car-following models’ energy efficiency across the datasets also influence it. This result reflects the distinct energy consumption characteristics associated with the vehicles in each dataset.
Lastly, the distribution of α is centered near zero for most datasets, with the probability density decreasing as α increases. Among them, the CATS ACC Dataset shows fluctuations in its distribution. The reason is that the acceleration precision of the original data is limited to one decimal place, resulting in insufficient data accuracy. This causes α to be concentrated in a few areas. Additionally, the CATS Platoon Dataset, OpenACC Vicolungo Dataset, and Waymo Perception Dataset exhibit slightly higher densities at larger α, indicating relatively poorer vehicle stability in these datasets.
Overall, the results presented in Fig. 5 are consistent with the literature. For example, TTC in most datasets is concentrated below 50 seconds, and the time headway τ is mainly between 1 and 2 seconds. This demonstrates that the Ultra-AV dataset can be used for research on AV behavior analysis. The four metrics used in this paper can be utilized to evaluate the trajectory datasets collected in the future, and also serve as standards for the development of car-following models. Additionally, we observe the relationships between different metrics in Fig. 5: 1. τ and TTC show a negative correlation. 2. τ and F show a negative correlation. 3. τ and α show a negative correlation. This indicates that improving some metrics might lead to worse outcomes in others. For example, increasing mobility by reducing τ might reduce the spatial gap between vehicles, thereby reducing TTC and compromising vehicle safety. A future research direction is to develop models that make trade-offs between these four metrics to achieve overall optimal.
Car-following Model Development
Although the trajectory data from FAV offers valuable insights into their performance, this data may stem from a limited set of conditions. These derived performance metrics may not reflect the full range of driving scenarios. Thus, the development of accurate and robust car-following models for simulation across a broader range of scenarios is advantageous. Researchers are trying to develop car-following models, including the linear ACC model53, nonlinear intelligent driver model54, or data-driven models55. No matter what the model structure is, they all adapted \({a}_{it}^{{\rm{f}}}\) as the output and dit, \({v}_{it}^{{\rm{f}}}\), and Δvit as the input in these models. Thus, we analyze the relationship between output acceleration and input three variables with scatter plots and correlation analysis.
Figure 6 displays the relationship between output acceleration and input three variables. To accurately depict the relationships among the variables, the moving average method with a window length of three was applied to smooth the \({a}_{it}^{{\rm{f}}}\) data before plotting. In Fig. 6, a notably nonlinear positive correlation between \({a}_{it}^{{\rm{f}}}\) and dit is evident, particularly in the datasets from the OpenACC Database, which a logarithmic curve can characterize. Besides, most datasets show a linear positive correlation between \({a}_{it}^{{\rm{f}}}\) and Δvit. In contrast, there is no clear relationship between \({a}_{it}^{{\rm{f}}}\) and vit. The unclear relationship is due to \({a}_{it}^{{\rm{f}}}\) being influenced by the other three variables collectively, and it cannot be directly reflected by a single variable. Therefore, a deep analysis of the relationships among variables in car-following behavior is necessary, as well as the development of specific car-following models. Moreover, car-following models may vary across different vehicle types, and more detailed studies should analyze the same vehicle. This paper will not delve into specific models, but researchers can refer to the latest reviews in this field56,57 to conduct studies using the provided data.
The relationship between af and three variables can also be revealed with correlation coefficients. Table 7 shows the Pearson and Spearman correlation coefficients between af and d, vf, and Δv for all datasets. According to Table 7, af has a positive correlation with d and Δv, and a negative correlation with vf. This result is similar to the experience: when d is large, the FAV accelerates to close the gap; when vf is large, the FAV slows down to stabilize at a following speed; and when Δv is positive, the FAV decelerates to match the LV speed. The correlation coefficient for d generally ranges from 0 to 0.4, indicating a weak correlation. For vf, the coefficient ranges from −0.2 to 0, indicating no correlation. For Δv, the coefficient is above 0.5, indicating a strong correlation. Since Pearson and Spearman correlation coefficients only reflect the linear and monotonic relationships between variables, we suppose that in the car-following model, the influence of vf on af is nonlinear. This insight inspires us to develop piecewise linear or nonlinear car-following models.
From the analysis of the scatter plots and correlation analysis, it is clear that the Ultra-AV dataset reflects certain relationships between acceleration and other variables. The results show that the relationships depicted in the dataset are similar to those discussed in the literature and experience, validating that our dataset can be utilized for the development of car-following models. The analysis also indicates that there is a certain nonlinearity in the relationships between variables and acceleration, particularly with vf. This inspires researchers to consider the nonlinear relationship between vf and af in future model development.
Overview of the AV longitudinal trajectory open datasets
Comprehensive details for the datasets reviewed in this paper, including data collection, AV information, test sites, and experiment settings, are summarized in Table 8. The datasets are labeled by IDs 1-14, where: 1 = Vanderbilt Two-vehicle ACC Dataset; 2 = MicroSimACC Dataset; 3 = CATS ACC Dataset; 4 = CATS Platoon Dataset; 5 = CATS UWM Dataset; 6 = OpenACC Casale Dataset; 7 = OpenACC Vicolungo Dataset; 8 = OpenACC Asta Dataset; 9 = OpenACC ZalaZone Dataset; 10 = Ohio Single-vehicle Dataset; 11 = Ohio Two-vehicle Dataset; 12 = Waymo Perception Dataset; 13 = Waymo Motion Dataset; 14 = Argoverse 2 Motion Forecasting Dataset.
The content recorded in the table includes four aspects: data collection, AV information, test sites, and experiment settings.
-
Data Collection: This includes the sensors used for data collection, the accuracy of the sensors, and the frequency of the data in the datasets.
-
AV Information: This includes the automation level, brand, model, model year, powertrain, and dimensions of the tested AVs. The automation level is categorized into two levels: ADAS and ADS, which usually refer to low-level and high-level AVs, respectively. The powertrain is divided into three types: internal combustion engine vehicle (ICEV), hybrid vehicle (HV), and electric vehicle (EV).
-
Test Sites: This includes the accessibility of the test sites, the road type, and the weather conditions during testing. Accessibility can be either public, indicating tests conducted on open roads, or closed, indicating tests conducted in a closed test facility.
-
Experiment Settings: This includes the duration of the tests, the speed settings, the headway settings, and the drive type. The drive type can be either naturalistic or artificial. The former represents a scenario where the LV is in a natural driving state, while the latter represents a scenario where speed oscillations are artificially induced during the experiment.
Regarding the AVs used in testing, most datasets test only 1-2 AVs. The OpenACC Dataset includes numerous AVs. Specifically:
-
OpenACC Vicolungo Dataset tests the following AVs: Ford S-Max 2018 ICEV SUV, KIA Niro 2019 HV SUV, Mini Cooper 2018 ICEV Hatchback, Mitsubishi Outlander PHEV 2018 HV SUV, Mitsubishi SpaceStar 2018 ICEV SUV, Peugeot 3008GTLine 2018 ICEV SUV, VW GolfE 2018 EV SUV.
-
OpenACC Asta Dataset tests the following AVs: Audi A6 2018 ICEV Sedan, Audi A8 2018 ICEV Sedan, BMW X5 2018 ICEV SUV, Mercedes AClass 2019 ICEV Sedan, Tesla Model3 2019 EV Sedan.
-
OpenACC ZalaZone Dataset tests the following AVs: Audi A4 Avant 2019 HV SUV, Audi E-Tron 2019 EV SUV, BMW I3S 2018 HV Hatchback, Jaguar I-Pace 2019 EV Hatchback, Mazda 3 2019 ICEV Sedan, Mercedes-Benz GLE 450 4Matic 2019 HV SUV, Smart BME Addv (developed by Budapest University of Technology and Economics), Skoda Octavia RS 2019 ICEV SUV, Tesla Model3 2019 EV Sedan, Tesla ModelS 2019 EV Sedan, Tesla ModelX 2016 EV Hatchback, Toyota RAV4 2019 HV SUV.
Additionally, the Ohio ACC Datasets include tests of two retrofitted AVs: a retrofitted Tesla Sedan and a retrofitted Ford Fusion Sedan from AutonomouStuff Company. The Ford Fusion Hybrid, tested in the Argoverse 2 Motion Forecasting Dataset, is integrated with Argo AI self-driving technology.
In the data collection sensors, specific sensor details for some datasets include:
-
MicroSimACC Dataset collect trajectory data from the vehicles’ onboard computers using the On Board Diagnostics (OBD) II data logger.
-
OpenACC Asta Dataset uses the RT-Range S multiple target ADAS measurement solution by Oxford Technical Solutions Company (https://www.oxts.com/).
-
OpenACC ZalaZone Dataset employs three sensors: the Race Logic VBOX with 0.02 m position accuracy and 0.03 m/s speed accuracy, the Ublox 9 with 0.3 m position accuracy and 0.14 m/s speed accuracy, and a tracker app from ZalaZone with 10 m position accuracy and 0.28 m/s speed accuracy.
-
The Tesla vehicle in Ohio ACC Dataset equipped with a 32-line Velodyne LiDAR with 0.03 m position accuracy, two pluggable USB monocameras, and the RT3000 with 0.01 m position accuracy from OXTS company. In the Ford vehicle, the RT3000 is replaced by the Novatel SPAN from Novatel Company (https://novatel.com/).
-
Waymo Open Dataset uses LiDAR and a high-resolution pinhole camera.
-
Argoverse 2 Motion Forecasting Dataset uses three sensors: a 32-line Velodyne LiDAR with 0.03 m position accuracy, high-resolution ring cameras, and front-view facing stereo cameras
Code availability
The code for data extraction and analysis have been documented and made accessible on https://github.com/CATS-Lab/Filed-Experiment-Data-ULTra-AV. The detailed contents include:
• Readme.md: A general description of the raw data for the dataset.
• main.py: The main function calls data processing and analysis functions for each dataset.
• trajectory_extraction.py: Code used in Step 1 to extract AV longitudinal trajectories.
• data_transformation.py: Code used in Step 1 to convert all datasets to a unified format.
• data_cleaning.py: Code used in Steps 2 and 3 for data cleaning.
• data_analysis.py: Code used to analyze data statistics, plot traffic performance of datasets, and plot scatter plots.
• model_calibration.py: An example tool to use the processed data to calibrate a linear car-following model.
We also recommend using other software packages such as R to effectively analyze the trajectory data. These tools are well-suited for handling the dataset’s format.
Data usage is restricted to research purposes only. Any commercial exploitation of the data requires separate approval and possibly additional agreements.
References
Calvert, S. et al. Traffic flow of connected and automated vehicles: Challenges and opportunities. Road vehicle automation 4 235–245 (2018).
Jin, W. L. On the equivalence between continuum and car-following models of traffic flow. Transportation Research Part B: Methodological 93, 543–559, https://doi.org/10.1016/j.trb.2016.08.007 (2016).
Kerner, B. S. Physics of automated driving in framework of three-phase traffic theory https://doi.org/10.1103/PhysRevE.97.042303 (2018).
Jiang, R. et al. On some experimental features of car-following behavior and how to model them. Transportation Research Part B: Methodological 80, 338–354, https://doi.org/10.1016/j.trb.2015.08.003 (2015).
Qu, X., Zhang, J. & Wang, S. On the stochastic fundamental diagram for freeway traffic: Model development, analytical properties, validation, and extensive applications. Transportation Research Part B: Methodological 104, 256–271, https://doi.org/10.1016/j.trb.2017.07.003 (2017).
Chen, X. et al. Follownet: A comprehensive benchmark for car-following behavior modeling. Scientific Data 10, 828 (2023).
Axelsson, J. Safety in vehicle platooning: A systematic literature review. IEEE Transactions on Intelligent Transportation Systems 18, 1033–1045 (2016).
Wang, M. et al. Delay-compensating strategy to enhance string stability of adaptive cruise controlled vehicles. Transportmetrica B: Transport Dynamics 6, 211–229 (2018).
Ubiergo, G. A. & Jin, W.-L. Mobility and environment improvement of signalized networks through vehicle-to-infrastructure (v2i) communications. Transportation Research Part C: Emerging Technologies 68, 70–82, https://doi.org/10.1016/j.trc.2016.03.010 (2016).
Feng, S., Yan, X., Sun, H., Feng, Y. & Liu, H. X. Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment. Nature Communications 12, 748 (2021).
Ma, C., Yu, C. & Yang, X. Trajectory planning for connected and automated vehicles at isolated signalized intersections under mixed traffic environment. Transportation research part C: emerging technologies 130, 103309 (2021).
Ma, C., Yu, C., Zhang, C. & Yang, X. Signal timing at an isolated intersection under mixed traffic environment with self-organizing connected and automated vehicles. Computer-Aided Civil and Infrastructure Engineering 38, 1955–1972 (2023).
Ma, K. & Wang, H. Influence of exclusive lanes for connected and autonomous vehicles on freeway traffic flow. IEEE Access 7, 50168–50178 (2019).
Shi, X. & Li, X. Empirical study on car-following characteristics of commercial automated vehicles with different headway settings. Transportation Research Part C: Emerging Technologies 128, https://doi.org/10.1016/j.trc.2021.103134 (2021).
Feng, S. et al. Dense reinforcement learning for safety validation of autonomous vehicles. Nature 615, 620–627 (2023).
Ma, Y. et al. Trafficpredict: Trajectory prediction for heterogeneous traffic-agents. In Proceedings of the AAAI conference on artificial intelligence, vol. 33, 6120–6127 (2019).
Yu, F. et al. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitionand pattern recognition, 2636–2645 (2020).
Chang, M.-F. et al. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8748–8757 (2019).
Wilson, B. et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting. arXiv preprint arXiv:2301.00493 (2023).
Sun, P. et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2446–2454 (2020).
Geiger, A., Lenz, P., Stiller, C. & Urtasun, R. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research 32, 1231–1237 (2013).
Caesar, H. et al. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11621–11631 (2020).
Mao, J. et al. One million scenes for autonomous driving: Once dataset. arXiv preprint arXiv:2106.11037 (2021).
Alibeigi, M. et al. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 20178–20188 (2023).
Zhou, H., Ma, K. & Li, X. A review on trajectory datasets on advanced driver assistance system equipped-vehicles. In 2024 IEEE Intelligent Vehicles Symposium (IV), 1947–1952, https://doi.org/10.1109/IV55156.2024.10588821 (2024).
Kesting, A., Treiber, M. & Helbing, D. General lane-changing model mobil for car-following models. Transportation Research Record 1999, 86–94 (2007).
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (Ieee, 2009).
Punzo, V., Borzacchiello, M. T. & Ciuffo, B. On the assessment of vehicle trajectory data accuracy and application to the next generation simulation (ngsim) program data. Transportation Research Part C: Emerging Technologies 19, 1243–1262 (2011).
Zhou, H., Ma, K., Liang, S., Li, X. & Qu, X. Ultra-AV: A unified longitudinal trajectory dataset for automated vehicle https://doi.org/10.6084/m9.figshare.26339512.v1 (2024).
Xia, X. et al. An automated driving systems data acquisition and analytics platform. Transportation Research Part C: Emerging Technologies 151, 104120 (2023).
Maddern, W., Pascoe, G., Linegar, C. & Newman, P. 1 year, 1000 km: The oxford robotcar dataset. The International Journal of Robotics Research 36, 3–15 (2017).
Houston, J. et al. One thousand and one hours: Self-driving motion prediction dataset. In Conference on Robot Learning, 409–418 (PMLR, 2021).
Wang, Y., Gunter, G., Nice, M. & Work, D. B. Estimating adaptive cruise control model parameters from on-board radar units. arXiv preprint arXiv:1911.06454 (2019).
Yang, M. et al. Microsimacc: an open database for field experiments on the potential capacity impact of commercial adaptive cruise control (acc). Transportmetrica A: Transport Science 1–30 (2024).
Shi, X. & Li, X. Empirical study on car-following characteristics of commercial automated vehicles with different headway settings. Transportation Research Part C: Emerging Technologies 128, 103134 (2021).
Makridis, M., Mattas, K., Anesiadou, A. & Ciuffo, B. Openacc. an open database of car-following experiments to study the properties of commercial acc systems. Transportation Research Part C: Emerging Technologies 125, 103047 (2021).
Hu, X., Zheng, Z., Chen, D., Zhang, X. & Sun, J. Processing, assessing, and enhancing the waymo autonomous vehicle open dataset for driving behavior research. Transportation Research Part C: Emerging Technologies 134, 103490 (2022).
Xu, X., Zheng, Z., Hu, Z., Feng, K. & Ma, W. A unified dataset for the city-scale traffic assignment model in 20 us cities. Scientific Data 11, 325 (2024).
Ibiknle, D. Average car sizes & dimensions (2023).
National Association of City Transportation Officials (NACTO). Lane width - urban street design guide. Accessed: 2023-05-06 (2023).
DriveSafe Online. Safe following distance: Follow the 3 second rule Accessed: 2024-05-06 (2020).
Liu, T., Fu, R. et al. The relationship between different safety indicators in car-following situations. In 2018 IEEE Intelligent Vehicles Symposium (IV), 1515–1520 (IEEE, 2018).
Mai, M., Wang, L. & Prokop, G. Advancement of the car following model of wiedemann on lower velocity ranges for urban traffic simulation. Transportation Research Part F: Traffic Psychology and Behaviour 61, 30–37 (2019).
Alotibi, F. & Abdelhakim, M. Anomaly detection for cooperative adaptive cruise control in autonomous vehicles using statistical learning and kinematic model. IEEE Transactions on Intelligent Transportation Systems 22, 3468–3478 (2020).
Wang, F.-Y. et al. Transportation 5.0: The dao to safe, secure, and sustainable intelligent transportation systems. IEEE Transactions on Intelligent Transportation Systems (2023).
Minderhoud, M. M. & Bovy, P. H. Extended time-to-collision measures for road traffic safety assessment. Accident Analysis & Prevention 33, 89–97 (2001).
Ha, D.-H., Aron, M. & Cohen, S. Time headway variable and probabilistic modeling. Transportation Research Part C: Emerging Technologies 25, 181–201 (2012).
Li, X. Trade-off between safety, mobility and stability in automated vehicle following control: An analytical method. Transportation Research Part B: Methodological 166, 1–18 (2022).
Zegeye, S., De Schutter, B., Hellendoorn, J., Breunesse, E. & Hegyi, A. Integrated macroscopic traffic flow, emission, and fuel consumption model for control purposes. Transportation Research Part C: Emerging Technologies 31, 158–171 (2013).
Duarte, G. O., Gonçalves, G. A., Baptista, P. C. & Farias, T. L. Establishing bonds between vehicle certification data and real-world vehicle fuel consumption–a vehicle specific power approach. Energy Conversion and Management 92, 251–265 (2015).
Akcelik, R. Efficiency and drag in the power-based model of fuel consumption. Transportation Research Part B: Methodological 23, 376–385 (1989).
Knoop, V. L. et al. Platoon of sae level-2 automated vehicles on public roads: Setup, traffic interactions, and stability. Transportation Research Record 2673, 311–322 (2019).
Ma, K. et al. String stability of automated vehicles based on experimental analysis of feedback delay and parasitic lag. Transportation Research Part C: Emerging Technologies 145, 103927 (2022).
Treiber, M., Hennecke, A. & Helbing, D. Congested traffic states in empirical observations and microscopic simulations. Physical Review E 62, 1805 (2000).
Zhu, M., Wang, X. & Wang, Y. Human-like autonomous car-following model with deep reinforcement learning. Transportation Research Part C: Emerging Technologies 97, 348–368 (2018).
Brackstone, M. & McDonald, M. Car-following: a historical review. Transportation Research Part F: Traffic Psychology and Behaviour 2, 181–196 (1999).
Wang, Z., Shi, Y., Tong, W., Gu, Z. & Cheng, Q. Car-following models for human-driven vehicles and autonomous vehicles: A systematic review. Journal of Transportation Engineering, Part A: Systems 149, 04023075 (2023).
Acknowledgements
This research is sponsored by National Science Foundation, USA through Grants CMMI #1932452 and CMMI #2343167.
Author information
Authors and Affiliations
Contributions
H.Z. conducted the experiments and analyzed the results, K.M. conceived the experiment, S.L. conducted the experiments, X.L. and X.Q. revised the paper. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhou, H., Ma, K., Liang, S. et al. A unified longitudinal trajectory dataset for automated vehicle. Sci Data 11, 1123 (2024). https://doi.org/10.1038/s41597-024-03795-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03795-y